Over the past few years I’ve done quite a bit of research on A/B testing. Initially I wanted to understand how to apply them within my own work. I quickly became interested in designing tests and interpreting results in various scenarios. Then I wanted to understand the math behind them. Then I really geeked out on them and started to reverse engineer some of the online calculators that are available.
The math of these tests is fascinating to me because there’s a bit of complexity involved. How do you decide when to end a test? How much confidence can you have that the result is correct? However complex though, ultimately there is an intuition.
The correct way do an A/B test (using frequentist inference methods, Bayesian is another topic) is to start with a hypothesis then calculate the required sample size. You start with a measure of your current metric, design a new version and form a hypothesis about how much your newer version will improve. From there you can determine how many times your split test needs to be repeated to reach a given significance level.
For example, suppose I have a web page with a 5% success rate (clicks to some other page). I want to improve that so I try a different message. If my hypothesis is that I’ll see a 5% increase in performance, that would improve my success rate to 5.3% and require 240,000 visitors to determine whether or not I’ve achieved my goal with 95% confidence. That’s going to take a long time to complete, for any website. Long enough that it wouldn’t make sense to spend the time on it for such little gain. Plus, in this scenario I know that I’ll be showing the inferior version to at least 120k visitors.
If I raise my hypothesis to a 50% improvement, that places my target success rate to 7.5%. I only need 3,000 visitors to reach 95% confidence in this second test. That’s a more manageable number. Even on a site with meager traffic I can complete this test in a reasonable amount of time.
This doesn’t mean that I should just start hypothesizing that all of my tests will achieve a 50% improvement. In order to generate those types of results I need to start testing things that are very different from each other. In the online world, changing the color of a button isn’t going to make a 50% difference. You need to get creative and try different layouts, offers and imagery. Test small cartoons against large high resolution images. Test selling directly online versus having people go through a sales team. Test offering great sales support versus having a thorough knowledge base.
I’m also not suggesting that you shouldn’t test button colors. I do recommend testing everything that influences the behaviors you want to encourage. I wouldn’t start with this test, since this is a small difference I’d hold this one until you’ve tested other more significant factors. I also wouldn’t settle for a 5% increase.
Notice that in the above example I can complete 80 versions of the second test within the same time frame as the first. If I set my goal to a 100% improvement, I can reach significance for more than 260 individual tests in the same time frame as my original test aiming for a 5% improvement. Now we’re getting somewhere. You absolutely should test 260 different ways to improve the results you’re seeing.
Practically speaking, what does it mean if I aim for a 50% increase in performance and only achieve a 5% bump? Not much, really. Use your own judgment on whether or not to keep the new version. This isn’t the big difference that you’re looking for so make a choice (via whatever method) and move on to your next test.
As you iterate through test after test to improve results an interesting thing will happen. You’ll end up with a product, or website, or marketing campaign that you didn’t envision at the start. This is because you haven’t been the designer, your customers have. You move forward in stages without a defined path. And you’re still making progress.
These principles aren’t new and are certainly being applied in marketing and business. I’d argue that it also applies outside of work, in less tangible areas of our lives. If you want to start saving more money, test not eating out versus giving up cable TV. See which one works the best for you. As you experiment within your personal life, you don’t need an Excel spreadsheet, but I do recommend quantifying the results somehow.
A/B testing is one of the coolest aspects of online marketing, in my estimation. You start with a few ideas and work your way forward to success through iterations, improving and gaining confidence as you go. It’s amazing to see your work morph into something that people engage with. I say test all your big ideas, and look for big successes.