Why your CRO tests fail

I’ve been running some A/B tests on the Distilled website recently. It was the first time I’ve had my hands dirty in the data for a little while and the tests weren’t doing what I was expecting them to do.

I found, for example, that winning variants would routinely underperform when we rolled them out to live.

I’d been trying not to overthink the problem (I often get called a stats wonk) and simply to trust the tools at my disposal and follow common practices:

  • I had a user objection I was targeting
  • I had a reason for believing my B might outperform my A
  • I trialled a significant change
  • I ran standard software
  • I sought a 95% confidence level

It was only after getting some strange results and digging deeper into the statistics that I discovered how dangerous this was.

Do you pay any attention to the blend of traffic hitting your A/B tests?

If not, it’s very likely that you are inadvertently running much weaker tests than you thought. If your site is anything like ours (and many of our clients) you have a channel or two that are pretty small in % of visitor terms but that convert at a multiple of the rate of your larger channels. For example, for us, email converts multiple times better than unbranded or referral traffic.

A typical A/B test will randomly allocate visitors to one or other of the variants. Have you ever thought about the fact that randomly sending more of your email visitors to A or B would dramatically skew the results?

No big deal, you probably think. Statistical significance is for stats wonks. I have a gut feel for these things. I’m sure that we’re still right plenty of the time.

It turns out that, under some pretty standard assumptions, we could easily find our 95% confidence level turned into an 80% confidence level or worse. Suddenly, you’ve gone from being wrong one time in twenty to being wrong one time in five. [Incidentally, if you allow yourself to peek at tests while they’re running, you could be doing a lot worse than this].

This is not a hypothetical problem. We ran a test on the DistilledU homepage to replace the product hero-shot (version A):

DistilledU heroshot version

with a product video (version B):

DistilledU video version

B outperformed A by a decent margin in our first test (+15% with a 95% confidence level). Here are the results when we re-ran the test:

Results of peeking at the test

B is converting 11% worse than A. What the...?

Why do we see high performing variants underperform when we put them live?

After digging into the details a bit more, I suspected that the issue lay with our traffic blend. As I mentioned above, some of our smaller channels punch well above their weight in conversion rate terms. It occurred to me that if we accidentally sent more of our email traffic to one variant or the other, its higher conversion rate would skew the results pretty dramatically.

We allocated visitors to version A or B randomly. How likely was it that we would introduce this kind of skew?

I created a little script to help me understand the scale of the problem. It runs simulations using a variety of blends of traffic and assumed conversion rates of version A and version B. Each test is run for long enough that we can generally expect to reach confidence if there is a significant difference in real conversion rates. It then re-runs each test 500 times (you’d never do this in the real world!).

We are seeking a 95% confidence level. What I found was that even if you are:

  1. Correctly setting out to run a test for the recommended length of time
  2. Avoiding peeking at the result part way through
  3. Calling a test successful if it achieves 95% confidence

As many as one in five of your “successful” results may in fact come from having accidentally (randomly) sent more high-converting (“email”) traffic to one variant or the other. (This explains why people sometimes find “successful” results when they are actually comparing two identical pages).

The glimpse of light at the end of the tunnel is that the longer you run a test for, the more the channel distribution converges (by the law of large numbers) to be the same for each variant. This means that we can fix the problem for running our tests for longer.

How long should I run my split tests for?

I set the simulation up to run tests for 2, 4 and 8 times as long as the conventional wisdom suggests. Here are the results for a range of blends of traffic where the email traffic converts between 1.2x and 1.6x as well as the rest of the traffic (in theory, I was running to a 95% confidence level):

Effect of increasing trial size on effective power

This shows that even running tests for 8 times as long as we previously thought might only get you back to a 90% confidence level (remember, the software will be reporting a 95% confidence level).

Surely there’s a better way than just blindly running our tests for 8x as long?

I haven’t been able to work out a way of just calculating how long we need to run our tests for based on the actual traffic blend and conversion rate differences by channel. Luckily though, I think I’ve stumbled across something that might give us a solution (thanks to Mat at Mixcloud - whose Django A/B testing plugin we also happen to use). Mat was talking about how they routinely run A/A/B/B tests because they often find that configuration errors creep in and break their tests.

I’ve seen people recommend running A/A tests to detect setup errors or to set the sample size [PDF]. But since traffic blend can change over time, by running A/A/B/B and only accepting a result when the As and Bs have converged, I think we can avoid running tests forever.

A proposed “answer” for those who just want to run good tests

Sidenote: I don’t think I’ve quite nailed this yet - I think that we probably want a slightly different test for checking that A1=A2 and B1=B2 - I think we might want to reject more (i.e. use a lower confidence level for those tests). Can anyone think of a way of coming up with a better one?

Next time you set up an A/B test, set up two identical versions of each variant - let’s call them A1 and A2, B1 and B2. (Our hypothesis is that the conversion rate of B is better than the conversion rate of A).

Decide on the confidence level you want to see (let’s use 95% as an example). At the end of a standard-length run (note that this will be twice as long as under an A/B test since you need to send the same amount of traffic to twice as many pages), using standard measurements of confidence:

  1. If we see a significant difference in the conversion rate of A1 and A2 (at the 95% level) we call the test a dud
  2. Likewise if we see a significant difference in the conversion rate of B1 and B2

If both of those pass, we declare B a winner if we also see a significant difference between the conversion rate of A1 and B1 (at the 95% level).

This methodology removes the need to run exceptionally long tests to be confident in rarely seeing skewed results. Instead, it focuses on discarding tests that appear to have been skewed by uneven traffic mixes leaving only real results we can confidently put live.

Will this work?

It’s hard to calculate the exact impacts of doing things this way, so I ran a follow-up simulation using the same approach described above using my proposed A/A/B/B methodology (you can find the code here). In the chart below (built on the same assumptions described above - but running A/A/B/B tests), you see:

  • Increasing numbers of successful tests (blue bar) - the longer we run the trial for, the more often we gain confidence in our outcome
  • There are a bunch of tests we have to discard because either the As or the Bs haven’t converged (green bar)

A/A/B/B effect of increased run length on trial outcomes

Some thank yous

Thank you to:

Will Critchlow

Will Critchlow

Will founded Distilled with Duncan in 2005. Since then, he has consulted with some of the world’s largest organisations and most famous websites, spoken at most major industry events and regularly appeared in local and national press. Will is part...   read more

Get blog posts via email

17 Comments

  1. Tom

    Fascinating stuff - maybe I'm barking up the wrong tree but I can't help but see the parallel between the A/A/B/B test and error checking codes.... It feels like AABB is just trying to encode the message twice and check if it's changed - we can learn from information theory and surely build a model that is not only more efficient at error checking but also error CORRECTING if we build the model well?

    It also feels like there might be a universal theory that you could use that somehow builds a model for "X channels, each converting at xx%" to try and model conversion changes across all channels - i.e. in a perfect world you'd be able to see that your test has found that A converts better with 95% confidence for 3 of your channels but 2 of your channels A only converts better with 60% confidence.

    reply >
    • Will Critchlow

      Yeah - I suspect that if I were to dig through the research on the subject, I would find that there are some great algorithms for doing this kind of thing.

      I suspect that there is an approach using some kind of time series test that checks the variance is converging along with the multi-armed bandit stuff to avoid the peeking problem.

      I might write an open letter to the testing tool vendors asking them to work it out :)

  2. Alex Czarto

    Thanks for digging so deeply into this - great post and great suggestion to run A/A/B/B tests. Really good stuff.

    Here is a relatively simple formula (and good blog post) about how to test statistical significance on your test results (because you shouldn't always trust what your A/B testing tool tells you)
    http://blog.asmartbear.com/easy-statistics-for-adwords-ab-testing-and-hamsters.html

    reply >
  3. The survey was thought-provoking on its own, but this post was excellent. I'm inclined to adopt the methodology here but it's tough. Given how much more data/time is needed to be restore the 95% certainty, now I'm looking for an equation that will help me balance between achieving that certainty or running multiple tests in that time frame.

    I mean, even if 1/5 tests produce an invalid result, that's still 4/5 that are certain... most gamblers would dive on those odds... And if you could have run 8 "nearly certain tests" in the same time frame it takes to complete the "absolutely certain" test you still might win out over holding out for more absolute confirmation. I also feel like gross misreports are less likely to occur, since if the two variations have significantly different conversion rates, then the deviation that may occur due to traffic fluctuations shouldn't be enough to throw off the winner (although your example home page test is certainly kind of shocking). It's sort of like maximizing ROI/margins vs profits, there's nothing wrong with making a little less per unit if you can double your sales. I guess that I'm saying is, what's the balance in opportunity cost between running fewer "more certain" optimizer tests and running more "less certain" tests in less time?

    Sorry for the length, but I found this really engaging. Cheers!

    reply >
    • Will Critchlow

      I'm considering recommending that we run A/A/B/B but at a lower p-value (80-85%).

      The interesting thing for me is how many people have had the reaction you have (paraphrasing, "we'll move to running less powerful tests to avoid having to run them for longer"). It makes me interested in how people chose to run 95% in the first place.

      I think the critical thing for me is that we run a test with approximately the power we think we are running. I think I'd rather run an 80% test that I know is an 80% test than a "95%" test that could be anywhere in the 75%+ range.

  4. Really enjoyed the post Will, very thought provoking.

    I'd be interested in knowing A) how many AABB tests are thrown away because they don't correlate and B) how many people have the will power/discipline to throw away the results of an AABB test when A!=A or B!=B

    I think it's never going to be possible to create a purely scientific 'ceteris paribus' test as you could in a lab because humans are involved, however based on your post I think a smarter approach may need to be taken to distribute traffic into A or B based on traffic source. Currently large samples and long length of test are designed to reach an even distribution but a helping hand may help cut down on these.

    Best post to date Will, nothing wrong with being a stats wonk

    reply >
    • Will Critchlow

      The last chart tries to show how many otherwise successful tests are thrown away because A!=A or B!=B (the green bar).

      In the case where a test fails with A!=A or B!=B, you don't need to throw away the variation - you simply need to re-run the test (I think).

  5. I was scrolling through the comments and thought I might be very late to the party in adding my own "thanks." But then I realized I'm just used to the American date being all out of order, haha. Anyway, to the point:

    Thanks for writing this up! It's given me a lot to think about regarding our CRO and will start some interesting conversations in our office. I've emailed it to the guys running our tests and hope it starts some buzz.

    reply >
  6. I might be barking up the wrong tree here, but from a practical POV, surely even with a 75% "certainty" of better performance it's worth changing your site - after all, you already have by running the test - they are on live visitors after all - and, thanks to the joy of the web, if it really does fail in "real life", you can always revert it back again?

    In particular one needs to focus on by home much one variant outperforms the other - if they are close together then it is somewhat neither here nor there how confident you are about performance as it makes relatively little difference either way. If there's a bigger gap then it makes more sense to worry about these things....?

    reply >
    • Will Critchlow

      In many situations, you're right of course. Though it's interesting how many marketers I've spoken to chose 95% confidence initially but then are happy with only 75% when they discover this kind of thing is going on.

      Why not pick 75% from the outset in that case?

      Also - if you peek at tests as well as experiencing this kind of issue, it can easily drop further and become essentially random.

    • Will Critchlow

      Incidentally, when I surveyed a bunch of CRO people I saw around 2/3 using 95%+ as their chosen test power.

  7. Richard Vaughan

    I would tend to solve this problem by running A/A/B/B tests first to determine if any noise in the results was sufficient to require re-running multiple tests for specific channels. I would then target specific channels like email using query parameters (I know Optimizely lets you do this) to see if I could increase the overall real confidence level.

    Inconsistent or schizophrenic results post-test or a high noise level in A/A/B/B tests tend to indicate that a conversion funnel is perhaps not the ideal path for all user types and may need splitting for different entry paths etc. Which essentially becomes a product/design/marketing issue, not a measurement or mathematical one.

    reply >
    • Will Critchlow

      Yep - if you're going to be running a lot of tests on noisy channels that have roughly static "noisiness" over time, then this is a great approach to converge on the right length and power of test.

  8. Lee Newell

    What CRO software do you use at distilled?

    reply >
  9. In a case where there are known strong main effects (email converts better) and low natural incidence of this grouping factor, you might want to look into doing what is known as a stratified sample. In certain circumstances this is a preferable methodology to a random sample. I think it might sort out your problem.

    reply >

Leave a Reply

Your email address will not be published. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>