A /B testing has had A profound impact in many fields, including medicine, agriculture, manufacturing and advertising. In software development, A/B testing experiments provide A valuable way to assess the impact of new features on customer behavior. In this series, we will describe the technical and statistical state of Twitter’s A/B testing system.

 

This article, the first in A series, looks at why you do A/B testing and how you can avoid the pitfalls.

 

Note: This article originally appeared on Twitter, and was translated by InfoQ Chinese with permission from the author.

 

The body of the

Experimentation is at the heart of Twitter’s product development cycle. This culture of experimentation may be due to Twitter’s heavy investment in tools, research, and training to ensure that feature teams can seamlessly and rigorously test and validate their ideas.

 

The scale and variety of Twitter experiments is enormous — from subtle UI/UX changes, to new features, to improvements to machine learning models. We like to think of experiments as an endless learning loop.

Hypothesis building: Propose ideas for new features or suggest improvements to existing features.

Define success metrics: Assess “Opportunity size” — the number of users affected by the change. Formally define indicators of experiment success and failure; Consider acceptable compromises.

Verify assumptions: Implement proposed changes, “instrument” the corresponding log, and perform reasonableness checks to ensure the experiment is correctly configured.

Learn: Review the data collected during the experiment, take lessons learned, and share them with other Twitter teams.

Publish: After collecting the data, determine whether the experiment has validated the hypothesis and decide whether to publish or not.

Build another hypothesis: Build more hypotheses for more improvements based on new ideas from the experiment.

A/B testing, decision making and innovation

Twitter’s Product Instrumentation and Experimentation (PIE) group has been doing a lot of Experimentation. A/B testing has many benefits, but it also has many well-known pitfalls. Its results are often unexpected and counterintuitive. How can we avoid this trap? When should we recommend A/B testing to test features or proposed changes? How do we remain agile in decision making, but still rigorous when taking significant risks?

The benefits of testing and incremental testing

Feature change A/B testing culture focuses on small incremental gains, with most experiments leading to single-digit percentage improvements, or even percentage points. So, some people say, what’s the point? Why not do something more impactful and revolutionary?

 

This is true: if any experiment can improve the index, most so far have done so in a minimal way; An experiment that improves a core measure by a few percentage points for most users is considered a remarkable success.

 

This has nothing to do with the fundamentals of A/B testing. This is because a mature product is difficult to change by drastically improving its metrics. The idea of a home run, as many would have it, doesn’t make a bit of improvement at all: humans turned out to be extremely bad at predicting what would work (see “Seven Rules of Thumb for Web Site Experimenters” for more information). Most of the time, poor A/B test results allow us to detect early on that what looks like A good idea may actually be A bad idea. So we prefer to get the bad news as quickly as possible and go back to the drawing board; That’s why we did the experiment.

 

A/B testing is A way to ensure that good ideas don’t die on the rocks and have A chance to be fully developed. When we really believe in an idea, and the initial results don’t meet our expectations, we can make further improvements to the product until it meets our expectations and can be released to millions of people. The other approach is to build some feel-good feature and release it, then develop other new ideas, and a year later someone realizes that no one is using the feature and it just quietly dies.

 

As we worked on various prototypes, rapid iteration and measuring the impact of proposed changes allowed our team to incorporate implicit user feedback into the product early. We can release a change, study what will improve and what won’t, then make assumptions about changes that will further improve the product, release the change, and continue until we have changes that can be pushed to a wider audience.

 

Some might consider this incremental change inefficient. Of course, releasing “big ideas” sounds a lot better than making small improvements. However, if you think about it, adding up many small changes can have a compounding effect. A product improvement approach that avoids incremental change is largely a bad policy. A good financial portfolio balances predictable risk-free bets with lower returns and risky, high-return bets. Portfolio management is no different in this respect.

 

That said, there are a lot of things we can’t, or shouldn’t, test. Some changes are designed to create network effects that these user-bucket-based A/B tests do not capture (although other techniques do exist to quantify this effect). When only a random percentage of people are involved, certain features can fail. For example, in simple A/B testing, Group DMs is not A usable feature because maybe those lucky enough to get it want to leave messages for those who don’t, making it largely useless. Other features may be completely orthogonal – launching a new app like Periscope, for example, is not a Twitter experiment. But once rolled out, A/B testing became an important way to drive both measurable and less easily measured incremental changes within an application.

 

In another category of changes, major new features are tested through user research during internal builds, but are released to all users at specific explosive moments for strategic marketing reasons. As an organization, we make this decision when we believe it is in the best interests of both the product and the user. We believe that while incremental changes may result in a better initial release that more users try and use, we can make more money from a larger release. This is a tradeoff for the product owner. So when such A new feature is released, do we do A/B testing of incremental changes? Of course! As ideas mature, we use sound scientific principles to guide their evolution — and experimentation is a key part of the process.

Experimental reliability

Now that we’ve done a case study on running the experiment, let’s discuss how we can avoid pitfalls. The configuration and analysis of experiments is complex. Even normal human behavior can easily lead to bias and misunderstanding of results. There are several practices that can reduce the risk.

Demand hypothesis

Often experimental tools can reveal large amounts of data, often allowing experimenters to design custom metrics to measure the impact of changes. But this can trigger one of the most subtle traps in A/B testing: “cherry-picking” and “HARKing” — picking metrics from many data points that simply support your hypothesis, or seeing the data and adjusting the hypothesis so that it matches the experimental results. At Twitter, it’s not uncommon for an experiment to collect hundreds of metrics, which can be broken down into a multitude of dimensions (user attributes, device types, countries, and so on), generating thousands of observations — which you’ll need to pick and choose if you want to fit into any story.

 

One way we can direct participants away from cherry picking is to ask them to explicitly specify the metrics they want to improve on during the configuration phase. The experimenters could track as many indicators as they wanted, but only a few could be explicitly labeled in this way. The tool then highlights these metrics on the results page. Experimenters are free to explore all the other data that has been collected and establish new hypotheses, but the initial claims should be fixed and easy to check.

The experimental process

No matter how good the tool, a poorly configured set of experiments will still deliver undesirable results. At Twitter, we have invested in creating experiments to improve the probability that they will work and run correctly. Most of the steps in this process are optional — but we found that making it available and well-documented greatly reduced the time lost in rerunking the experiment to collect more data, reduced the time lost waiting for the App Store release cycle, and so on.

 

All the experimenters were asked to record their experiments. What are you changing? What are your desired outcomes? Desired “audience size” (the percentage of users who will see this feature)? Collecting this data not only ensures that experimenters have considered these questions, but also allows us to create an institutional learning database — an official record of experiments that have been done and their results, including negative ones. We can use this as a reminder for later experiments.

 

Experimenters can also take advantage of experimental shepherds. Experimental shepherds are experienced engineers and data scientists who review experimental hypotheses and proposed metrics to reduce the chance of experiments going wrong. This is optional and the recommendation is not binding. The project also received a lot of feedback from participants as people gained more confidence that experiments were configured correctly, followed the right metrics, and were able to analyze the results correctly.

 

Some teams also hold weekly meetings where the results of experiments are reviewed to determine what should and should not be released to a wider audience. This helps solve the problem of cherry-picking and misinterpreting statistical significance. It is important to note that this is not a “give me a reason to say no” regular meeting – we have made it clear that the “red” trials are published and the “green” trials are not. What is important here is to be honest and clear about the expectations and results of introducing change, rather than tolerating stagnation and rewarding short-term gains. Introducing these reviews significantly improved the overall quality of the changes we released. It was also an interesting meeting because we got to see all the work the team was doing and what people were thinking about the product.

 

Another practice we often use is to use “Holdbacks” if possible — push the feature to 99% (or whatever high percentage) of users and watch how key metrics deviate from the blocked 1% over time. This allowed us to iterate and release quickly while keeping an eye on the long-term impact of our experiments. This is also a good way to verify the benefits that are actually realized in the experiment.

The experimental training

One of the most effective ways to make sure experimenters are aware of traps is to train them. Twitter data scientists run several experimental and statistical intuition courses, which all new engineers take in their first few weeks at the company. The goal is to familiarize engineers, PMS, EM, and other roles with experimental procedures, warnings, traps, and best practices. Increasing awareness of experimental quality and pitfalls helps us avoid wasting time on avoidable errors and misunderstandings, allowing people to gain faster insight and improve pace and quality.

The forthcoming

In a later article, we will describe how our experimental tool DDG works; We’ll jump right into some of the interesting statistical problems we encountered — detecting biased buckets, doing reasonableness checks with (or without) a second control, automatically determining appropriate bucket sizes, session-based metrics, and handling outliers.

Thank you

Thanks to Lucile Lu, Robert Chang, Nodira Khoussainova and Joshua Lande for their feedback on this article. Many people have contributed to the ideas and tools behind the Twitter experiment. We would like to give special thanks to Cayley Torgeson, Chuang Liu, Madhu Muthukumar, Parag Agrawal, and Utkarsh Srivastava.

 

Check the English original: https://blog.twitter.com/2015/the-what-and-why-of-product-experimentation-at-twitter-0

 

Thanks to Guo Lei for correcting this article.

 

In this paper, by Shouting technology authorized reprinted from InfoQ, the original link: http://www.infoq.com/cn/articles/twitter-ab-test-practise-part01