As someone making a living in the startup world, one can not have missed the rise of A/B tests, greatly boosted by Eric Ries’ book The Lean Startup. But what is this A/B testing all about? And how do you make sure you get a data-driven approach to product development right for your website or web application?
What is A/B testing?
A/B testing requires to have two different versions of a page, one being your current version, and the other is the version you want to change the page to. Every A/B experiment starts with a little hypothesis. For instance: in order to drive more traffic towards our signup page we need a friendly green button, instead of the blue one we have currently. To research and justify your changes, you route half your visitors to the first page and half to the second. Next, you monitor how many of the visitors perform the desired action (like: sign up for your service) on each page, and you calculate the conversion rate for the old and new page. The page with the highest conversion rate is probably the one you should use.
For a funny intro on A/B testing, go check this presentation by Visual Website Optimizer. You’ll find more serious (and useful – who sells fish online anyway?!) slides by Ben Tilly on the topic on elem.com. Just one of the take-aways: A/B tests do not substitute for talking to users, usability tests, acceptance tests, unit tests or ‘thinking’.
To be considered during sprint planning, a feature normally must relate back to a testable hypothesis which attempts to predict its impact on key company metrics. If the upside of a hypothesis is promising enough, we will move quickly to prototype the feature, utilizing a combination of internal and third party A/B testing tools.
Joel Lewenstein‘s (product designer at Quora) on handling an unexpected negative outcome for your A/B test:
When data confirms intuition, it’s an easier outcome to feel good about and move forward with. Harder to reconcile is when the data disagrees with intuition, i.e. Scenarios II and III. Underneath the inherent frustration and bewilderment that comes with these outcomes, there can be tremendous learning opportunities.
Unexpectedly bad data forces a reconsideration of assumptions:
Really thoughtful consideration of these results, without biases or assumptions, creates fertile ground for brainstorming new ideas with potentially high impact.
To really reap the benefits of this data-driven approach to product development, Joel claims (at least) two things to be critical. Test groups should be conceived to test as few variables as possible. And those variables should have clear, testable names like ‘I think people will sign up for a service that seems affordable, so I’m going to put our price information on the homepage’.
When A/B testing turns out to be A/A testing
Stavros Korokithakis recently wrote a great blog on the pitfalls of A/B testing. Stavros:
The main problem with A/B testing is that you can’t really be sure that the results were actually statistically significant, rather than a fluke. If you show one person the first page and they sign up, and show one person the second page and they don’t sign up, does this mean that the first page is better? Clearly not, since it might be pure luck. So, we need a way to tell whether or not the difference in rate is due to randomness. (…) Wanting to verify that my A/B testing code was implemented properly, I gave it two variations of the page that were exactly the same, and which shouldn’t have produced any significant difference. To my great surprise, I saw that my tests reported conversions being up by a whole 30%, with 99.8% confidence, on one of the two (identical) versions! How could it be 99.8% sure that one page is 30% better, when there was no change between the two?
Many online A/B testing frameworks let you automatically stop or conclude your test at the ‘moment of significance’, and there is very little discussion on false positive rates. In academic research however, failing to control for testing errors and false positive rates is considered a huge faux-pas. Mats Einarsen knows this trade-off all too well and writes the following on Booking.com dev blog: Is your A/B testing effort just chasing statistical ghosts?:
For anyone running A/B tests in a commercial setting there is little incentive to control your false positives. Why make it harder to show successful changes just to meet a standard no one cares about anyway? (…) It actually matters, and matters a lot if you care about your A/B experiments and what you learn from them.
To demonstrate how much it matters, he ran a simulation of how much impact you should expect repeated testing errors to have on your success rate. You can download the script and alter the variables to fit your metrics. The outcome? 1.000 controlled experiments where it’s known that there is no difference between the variants showed that 771 experiments out of 1.000 reached 90% significance at one point or another. And moreover: 531 experiments out of 1.000 reached 95% significance at some point.
Conclusion being that if you use an A/B testing package that automatically turns on experiments based on significance alone, you’ll probably see your success rate go up regardless of the quality of your changes.
How many subjects are needed for an A/B test?
The question of how many subjects are needed to achieve relevant results from an A/B test is one that puzzles many startup founders, as they usually have too little of a user base of visitors rate to call significance.
There are tools that help you figure out your sample size like Evan Miller’s Sample Size Calculator. In statistics, G-tests are likelihood-ratio for your statistical significance tests. You can find a G-test Calculator on elem.com.
Evan Miller wrote an industry standard article on ‘how not to run an A/B test’ on April 18, 2010:
If you run experiments: the best way to avoid repeated significance testing errors is to not test significance repeatedly. Decide on a sample size in advance and wait until the experiment is over before you start believing the “chance of beating original” figures that the A/B testing software gives you. “Peeking” at the data is OK as long as you can restrain yourself from stopping an experiment before it has run its course.
Set up your sampling size before you start with your A/B tests and don’t trust the significance calculations provided by your A/B tool blindly. Some of them show you: “It’s significant now”, but that’s actually not true. Reach your sample size first, then (and only then) have a look at the data. To cite our very own Erik Bovee (SpeedInvest):
The plural of anecdote is probably not data.
Get data-driven approach to product development right for your company
A/B tests are great to justify small iterations on your site, if you get the following prerequisites right:
- Your sample size (and thus user base) should be large enough. Testing a green versus a blue button with a hundred users will never result in an outcome you should act upon.
- Start your experiment with a clear (and small) hypothesis. “The green button will make everything better” is not a clear hypothesis.
- Don’t – ever – stop your experiment prematurely.
- Evaluate your tests to fix testing errors and avoid false positives.
- Don’t take it personal when data disagrees with intuition. And certainly don’t just go for option B anyway, regardless of the outcome of your A/B tests. What’s next? Anarchy?