Why your CRO tests fail

I’ve been running some A/B tests on the Distilled website recently. It was the first time I’ve had my hands dirty in the data for a little while and the tests weren’t doing what I was expecting them to do.

I found, for example, that winning variants would routinely underperform when we rolled them out to live.

I’d been trying not to overthink the problem (I often get called a stats wonk) and simply to trust the tools at my disposal and follow common practices:

  • I had a user objection I was targeting
  • I had a reason for believing my B might outperform my A
  • I trialled a significant change
  • I ran standard software
  • I sought a 95% confidence level

It was only after getting some strange results and digging deeper into the statistics that I discovered how dangerous this was.

Do you pay any attention to the blend of traffic hitting your A/B tests?

If not, it’s very likely that you are inadvertently running much weaker tests than you thought. If your site is anything like ours (and many of our clients) you have a channel or two that are pretty small in % of visitor terms but that convert at a multiple of the rate of your larger channels. For example, for us, email converts multiple times better than unbranded or referral traffic.

A typical A/B test will randomly allocate visitors to one or other of the variants. Have you ever thought about the fact that randomly sending more of your email visitors to A or B would dramatically skew the results?

No big deal, you probably think. Statistical significance is for stats wonks. I have a gut feel for these things. I’m sure that we’re still right plenty of the time.

It turns out that, under some pretty standard assumptions, we could easily find our 95% confidence level turned into an 80% confidence level or worse. Suddenly, you’ve gone from being wrong one time in twenty to being wrong one time in five. [Incidentally, if you allow yourself to peek at tests while they’re running, you could be doing a lot worse than this].

This is not a hypothetical problem. We ran a test on the DistilledU homepage to replace the product hero-shot (version A):

DistilledU heroshot version

with a product video (version B):

DistilledU video version

B outperformed A by a decent margin in our first test (+15% with a 95% confidence level). Here are the results when we re-ran the test:

Results of peeking at the test

B is converting 11% worse than A. What the...?

Why do we see high performing variants underperform when we put them live?

After digging into the details a bit more, I suspected that the issue lay with our traffic blend. As I mentioned above, some of our smaller channels punch well above their weight in conversion rate terms. It occurred to me that if we accidentally sent more of our email traffic to one variant or the other, its higher conversion rate would skew the results pretty dramatically.

We allocated visitors to version A or B randomly. How likely was it that we would introduce this kind of skew?

I created a little script to help me understand the scale of the problem. It runs simulations using a variety of blends of traffic and assumed conversion rates of version A and version B. Each test is run for long enough that we can generally expect to reach confidence if there is a significant difference in real conversion rates. It then re-runs each test 500 times (you’d never do this in the real world!).

We are seeking a 95% confidence level. What I found was that even if you are:

  1. Correctly setting out to run a test for the recommended length of time
  2. Avoiding peeking at the result part way through
  3. Calling a test successful if it achieves 95% confidence

As many as one in five of your “successful” results may in fact come from having accidentally (randomly) sent more high-converting (“email”) traffic to one variant or the other. (This explains why people sometimes find “successful” results when they are actually comparing two identical pages).

The glimpse of light at the end of the tunnel is that the longer you run a test for, the more the channel distribution converges (by the law of large numbers) to be the same for each variant. This means that we can fix the problem for running our tests for longer.

How long should I run my split tests for?

I set the simulation up to run tests for 2, 4 and 8 times as long as the conventional wisdom suggests. Here are the results for a range of blends of traffic where the email traffic converts between 1.2x and 1.6x as well as the rest of the traffic (in theory, I was running to a 95% confidence level):

Effect of increasing trial size on effective power

This shows that even running tests for 8 times as long as we previously thought might only get you back to a 90% confidence level (remember, the software will be reporting a 95% confidence level).

Surely there’s a better way than just blindly running our tests for 8x as long?

I haven’t been able to work out a way of just calculating how long we need to run our tests for based on the actual traffic blend and conversion rate differences by channel. Luckily though, I think I’ve stumbled across something that might give us a solution (thanks to Mat at Mixcloud - whose Django A/B testing plugin we also happen to use). Mat was talking about how they routinely run A/A/B/B tests because they often find that configuration errors creep in and break their tests.

I’ve seen people recommend running A/A tests to detect setup errors or to set the sample size [PDF]. But since traffic blend can change over time, by running A/A/B/B and only accepting a result when the As and Bs have converged, I think we can avoid running tests forever.

A proposed “answer” for those who just want to run good tests

Sidenote: I don’t think I’ve quite nailed this yet - I think that we probably want a slightly different test for checking that A1=A2 and B1=B2 - I think we might want to reject more (i.e. use a lower confidence level for those tests). Can anyone think of a way of coming up with a better one?

Next time you set up an A/B test, set up two identical versions of each variant - let’s call them A1 and A2, B1 and B2. (Our hypothesis is that the conversion rate of B is better than the conversion rate of A).

Decide on the confidence level you want to see (let’s use 95% as an example). At the end of a standard-length run (note that this will be twice as long as under an A/B test since you need to send the same amount of traffic to twice as many pages), using standard measurements of confidence:

  1. If we see a significant difference in the conversion rate of A1 and A2 (at the 95% level) we call the test a dud
  2. Likewise if we see a significant difference in the conversion rate of B1 and B2

If both of those pass, we declare B a winner if we also see a significant difference between the conversion rate of A1 and B1 (at the 95% level).

This methodology removes the need to run exceptionally long tests to be confident in rarely seeing skewed results. Instead, it focuses on discarding tests that appear to have been skewed by uneven traffic mixes leaving only real results we can confidently put live.

Will this work?

It’s hard to calculate the exact impacts of doing things this way, so I ran a follow-up simulation using the same approach described above using my proposed A/A/B/B methodology (you can find the code here). In the chart below (built on the same assumptions described above - but running A/A/B/B tests), you see:

  • Increasing numbers of successful tests (blue bar) - the longer we run the trial for, the more often we gain confidence in our outcome
  • There are a bunch of tests we have to discard because either the As or the Bs haven’t converged (green bar)

A/A/B/B effect of increased run length on trial outcomes

Some thank yous

Thank you to:

About the author
Will Critchlow

Will Critchlow

Will founded Distilled with Duncan in 2005. Since then, he has consulted with some of the world’s largest organisations and most famous websites, spoken at most major industry events and regularly appeared in local and national press. For the...   read more