The Challenge of Running Concurrent A/B Tests
When you are testing, coming up with ideas can be easy, but how you implement them can be hard. In a previous post, How to Set up an A/B Test, we covered the basics of how to set up and run a single test, but what if you wanted to run more than one A/B test at a time? How do you do that without compromising the results?
When running multiple tests A/B tests, creating a hypothesis is still key to getting successful results. When you create the hypothesis you know what the ultimate goal of your test is, which allows you to develop your testing strategy.
Let’s take an easy example, running multiple subject line tests at once. After creating your hypothesis you can use Return Path’s sample size calculator to get the size of your test and control groups. Then you break your list up into as many random sample test/control groups as you can and run the subject line tests concurrently.
But what if you wanted to run two different types of tests at once, say a subject line test and a frequency test? These two tests are extremely different, and if not set up properly, when running concurrently the results of both of your tests can be skewed.
Timing for subject line vs. frequency testing
At a high level, a subject line test can generally be done in one to three days. Frequency testing, on the other hand, takes about three to four weeks to run. This is because with frequency tests, it's important to give people enough time to adjust to the new send frequency and then look at whether they behave the same or differently than they did before the frequency was changed.
For example, take a subscriber who is used to receiving mail from you three times a week, but reads only one. Then that same subscriber is made part of a frequency test where their frequency is dropped to one email per week. For the first two weeks the subscriber doesn’t read any email, but on the third, fourth, and fifth weeks the subscriber reads the one email that is sent each week. This is because it takes people time to recognize that the email cadence has changed, and adjust their reading pattern accordingly. One email per week may be the optimal frequency for this subscriber, but at the beginning of the test period they didn't read any email because they were waiting for later emails. Therefore, only after people realize the new cadence of emails will you only be able to tell the results—judging too quickly will result in a incorrect conclusion.
Choosing sample groups for concurrent tests
Ideally, when running a subject line test at the same time as a frequency test, there would always be two separate tests and control groups without an overlapping subscriber base. But this isn’t always possible. Say you have a frequency test running, but you also want to run a subject line test, and the only subscribers available to you are the people who are already in the frequency test. In order to avoid biasing the frequency test, you would have to do stratified sampling to create the subject line test and control groups. Stratified sampling occurs when you know you have at least two groups and you want to make sure you get a representative sample from each. In this case you have both the test and control groups from the frequency test.
Say your frequency test and control groups have 5,000 subscribers each. If your subject line sample size for the test and control groups are 1,600 each, then to create the subject line test group you would choose a random sample of 800 from the frequency test group and a random sample of 800 from the frequency control group, and repeat for the subject line control group. This insures that even though the subject line test will have an effect on the frequency test, it should have a balanced effect on both the test and control groups.
Running concurrent tests can get complicated, and even with the best laid plans tests can be corrupted with unexpected results. When running basic A/B tests, the best practice is to keep separate test and control groups. If test must overlap, make sure you understand all the possible implications and take stratified samples to try to keep the purity of the tests. The only thing worse than having a test with inconclusive results is coming to a wrong conclusion because a test was corrupted.
Check back soon for the next blog post in this series!