jump to navigation

How many A:B tests do I have to run before it is meaningful? July 21, 2008

Posted by jeremyliew in A:B testing, product management.
trackback

Inside Facebook has a good post about how to not screw up your A:B testing that is a useful reminder about how many tests you need to run before you know that the results are statistically significant.

The author notes:

How many [tests] do we need to declare a statistically significant difference between [a design leading to action success rate] of p1 and one of p2? This is readily calculable:

* number of samples required per cell = 2.7 * (p1*(1-p1) + p2*(1-p2))/(p1-p2)^2

(By the way, the pre-factor of 2.7 has a one-sided confidence level of 95% and power of 50% baked into it. These have to do with the risk of choosing to switch when you shouldn’t and not switching when you should. We’re not running drug trials here so these two choices are fine for our purposes. The above calculation will determine the minimum and also the maximum you need to run.)

Thus, if you did this number of tests and found that the difference in action success was greater than (p1-p2), then you would have a 95% confidence level that the design being tested is responsible for the increase in success rate, and you would move to a new best practice.

The author reminds developers to adhere to A:B testing best practices, including:

# Running the two cells concurrently
# Randomly assigning an individual user to a cell and make sure they stay in that cell during the test
# Scheduling the test to neutralize time-of-day and day-of-week effects.
# Serving users from countries that are of interest.

One thing that immediately emerges from this formula is that you don’t need that many tests to determine if a new design is working. For example, testing a design that anticipates increasing success from .5% to .575% only needs about 52k tests. For apps and websites that are at scale, this does not take very long.

The danger is that, because of the overhead of putting up and taking down tests, “bad” test designs stay up for too long, exposing too many users to a worse experience than usual. While some people consider A:B testing to be splitting users into equal groups, there is no such requirement. I’d advise developers to size their test cells to be x% of their total traffic, where x% is a little more required to hit the minimums calculated above over a week. This neutralizes time of day and day of week effects, minimizes the overhead of test set up, and ensures that not too many users are exposed to bad designs. It also allows multiple, independent tests to be run simultaneously.

Comments»

1. todd sawicki - July 21, 2008

Jeremy –
In my experience running loads of a/b tests with websites, landing pgs and ads I have yet to find that waiting for highly statistically significant traffic (ie. 95% CI) is usually worth it. Trends tend to emerge with very low levels of traffic and to your point it’s usually not worth waiting for 95% CI. With a consumer site I have to see a 80% CI result differ from a 95% CI. If there’s a noticeable difference – my advice from the dozens and dozens of examples I’ve run – as little as a few thousand pg views or impressions can tell you all you need and if it does – DON’T WAIT.
– todd

2. What is multi-vari analysis? | learnsigma - September 20, 2008

[…] How many A:B tests do I have to run before it is meaningful? […]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: