I've been doing some quite fun stuff involving advanced maths over the last couple of days. It's good to do hard theoretical stuff from time to time. Because it's been a while, I had to revise my stats quite a bit. As soon as I got beyond binomial and normal distributions to chi-squared distributions, G-tests and beta functions, I turned to the textbooks. Actually, first I turned to Google, but all I could find was course listings for university courses saying things like 'This course includes the beta function, chi-squared distribution etc. etc.'. I guess most of the notes are only available on intranets etc. It is remarkably hard to find information online about advanced maths.
Wikipedia is actually a very good resource here (this is the kind of thing wikipedia generally rocks at - the right answer is indisputable, so you don't get edit wars etc.). To see how comprehensive it is, check out the pages on the Beta function and its distribution (though there's precious little about how to calculate it).
When it comes to numerical methods, there's still nothing to beat Numerical Recipes but I didn't really want to have to write my own routines to calculate things like chi-square distributions. Luckily, after a bit of googling, I found that we could install statistics functions for php. Unfortunately, they are the worst-documented functions I have ever seen (check out the documentation for the chi-square distribution function!) but that's another blog post.
##Why I am looking up all these ridiculous things
I'm in the process of writing a new tool to help us do better at ppc management - a similar idea to Clickmuse's Adwords Optimizer, that we use and find handy - but instead of telling you when one advert is out-performing another, this is designed to spot keywords that need to be targeted differently (either moved into their own ad group in order to gain the benefit of a high-performing keyword or deleted / moved out of an ad group to avoid pulling the performance of your other keywords down). With the way that quality score is calculated, taking into account ad group performance as a whole and factoring it into your cost-per-click via the quality score, it is becoming ever more important to monitor ad groups closely and group keywords together even better.
At first, I thought this was a pretty simple statistical task - all I needed to do was look at each keyword and its clicks in order to calculate the probability of getting this many clicks if the keyword has an underlying long-run click-through rate equal to the ad group it is in. If this probability was low (below 5%, say), we could be relatively confident that this keyword should be moved out of this ad group as it is an outlier compared to the rest of the group.
I started coding this without thinking too much further (great software development process there) and hit a hurdle when I needed to calculate the probability of getting at least n clicks out of I impressions with a supposed click-through rate of p. It's simple when I is small (it's just the sum of binomial distribution probabilities) but as I grows, this involves calculating factorial of a bunch of large numbers. At this point I started dredging up some of my old stats courses and remembered that you can use Normal and Poisson approximations to the Binomial distribution for large populations (which you use depends on the probability - for reasonable size expected numbers of clicks, you can use the Normal distribution, for small expected numbers of clicks you have to use the Poisson distribution).
Calculating the normal distribution in PHP was pretty straightforward, but calculating Poisson distributions when large numbers are involved means calculating Beta functions and other such fun. I needed to read up on the subject. Hence why I turned to Google, failed, and turned to wikipedia.
This post isn't really about what the solution to my problem was in the end, but just in case anyone is interested, I have gone with calculating a G-statistic for the set of data of:
|Rest of ad group||ca||nk|
This statistic then has a chi-squared distribution with 1 degree of freedom (I think - because there are 2 independent variables given the total number of impressions and we are estimating one parameter - the click-through rate of the whole ad group). Testing this for significance then tells us the probability that the keyword clicks and 'rest of ad group' clicks are drawn from a universe with the same click-through rate. I think this is a better approach because it allows for both the possibility that the keyword we are interested in has an unexpected number of clicks and that the rest of the ad group has an unexpected number of clicks.
##Where this all fails
I find the concept of degrees of freedom very difficult to grasp (always have, despite having it explained by some of the best minds in the world - I find it especially hard when you start talking fractional degrees of freedom) and when I'm working with real-world examples, I'm always sure I've got it wrong.
Wikipedia is very poor on the subject and I can't find good explanations online.
If you happen to understand this kind of stuff and can tell me whether my problem has 1 or 2 degrees of freedom, it would be very much appreciated!