We've got a new tool to share! This tool allows you to measure the effect of any SEO split tests you might run on your website. You can find it here. This post will walk you through how to use it, but before we do that, let's jump back a step:
What is an SEO split test?
An SEO split test is where you make a change to your website on a subset of particular page template to see how those pages perform differently to the other half.
For example, you might change the title tags on 50% of your product pages and see how they perform compared to the other half.
This is different from a CRO split test, where you show users different versions of the SAME page.
Time for some quick definitions

Variant  This is the set of pages with the change. In our example the altered title tag.

Control  This is the set of pages where we made no change.
What do you need to use this tool
You’ll need to have run (or be running) an SEO split test (see 'How does A/B testing for SEO work?' in this post for help). You’ll need to have separated your pages into two groups, made a change to a percentage of them and then downloaded the total organic traffic to each of the two buckets.
(One way to do this with GA is sending a hit level custom dimension which contains “control” or “variant” and then measuring organic entrances, or you could also do this, by downloading the data for each of the individual pages and then matching them to your control and variant buckets.)
Specifically what you need is:

Total organic entrances (or sessions) day by day for the sum of your control set of pages.

Total organic entrances (or sessions) day by day for the sum of your variant set of pages.
For both of these groups, you’ll need 100 days of this data before the test begins, plus however many days your test has been running for.
So if your test has been running for 14 days, you would need 114 days data.
Why 100 days of historic data? In short, this is what allows the maths behind this to work correctly.
Want to see an example data set? Here's one we've put together in a Google sheet.
How do you use it?
You enter the control and variant data into the boxes, choose the start date for your test and click run.
(The tool knows your test begins 100 days in, so the start date input is purely to set the axis correctly. )
The tool will then plot your variant against the control using the Causal impact model, the start date will be highlighted on the graph and you can see how they perform relative to each other.
If the red line is positive, your change was good. If the blue line is higher then your change was bad.
You can also download the data in a CSV to calculate how much better they perform.
How does this work?
We’re about to enter the wonderful world of maths, so brace yourselves.
This tool uses Google Causal Impact model (you can find the academic paper here, there isn’t much written on this if you’re not maths inclined although I think this post was better than some of the others).
It’s a form of regression model and works kind of like this (simplification ahead).
Causal impact lets you break down time series data (data which is day by day) into its component parts i.e.: seasonality, industry effects, and the underlying trend.)
You provide causal impact with data to model those effects (seasonality, industry demand etc.) and then it creates the model using those inputs and your time series data. By isolating the other effects, it allows you see the true performance beneath those.
So how does it work in this case?
Well, our time series data is the variant set. We want to know how would that set of pages have performed if there was no change, so we can use the causal impact model to mimic that.
We provide a variable for time (you never see this) and a control set of data (what you enter) which then helps the model to account for any swings like sales or Google updates which should affect both the control and variant equally. This allows us to isolate and compare the variant and modelled control, which will have accounted for seasonality and site wide swings.
Why not just directly compare control and variant? We can’t directly compare them because of possible differences in the variant and control groups, the most obvious example of this is the two groups may be different sizes depending on how the pages got sorted.
For example, your variant may have an average of 5,000 organic sessions a day, where you control may only have an average of 4,000 organic sessions a day, so we can’t compare the absolute fluctuations in our two sections.
There’s more to it than that, but that is the easiest to follow example.
Why don’t we show statistical significance?
Statistical significance is an important concept. With any kind of statistical modelling there will be a band of error.
This gets a little more complicated however when looking at a prediction over time. If we were just comparing two days we might be able to say that A > B by such an amount that the result is statistically significant.
However, if we’re comparing twotime series, then what is important is the performance over time and not a one off date. If one consistently outperforms the other, then what is important is the total aggregate sessions, not any individual day. All the individual days may be within the margin of error and yet the total makes it notably significant.
This basically makes day by day significance misleading, which is what this graph would show. Instead you need to calculate significance on total aggregate sessions i.e. total sessions to control vs total sessions to the variant, which you’ll need to manually with any standard significance tool.
Why have we made this tool?
We're fully bought in on split testing. We think it's the future of SEO and the way the industry is going. We even built an entire platform around it  DistilledODN.
But we also recognise that not everyone can afford large scale enterprise tools, so we wanted to make the basic maths available to everyone and encourage more industry testing.
While the maths here is a simpler version of what we use in DistilledODN (we can't invest the same scale of resources into testing different models in this tool as we can in a full piece of software), the base (causal impact) is there and unlike in our ODN, where we have to have a generic set of maths that is applicable to anyone and anywhere, the power of testing and calculating the numbers yourself means you can make adjustment calls that a platform can't. For example if you know you've run a sale on one section of your site which deviates from the norm, you can exclude that when providing the numbers to the tool.
Anyway, enough waffle. I hope you all find it useful!