An introduction to A/B testing in the app

How can A/B testing help you get more out of your app

An introduction to in-app A/B testing: How A/B testing can help you get more out of your app
By Gavin Kinghall
Translation from: The Gold Project
This article is permalink: github.com/xitu/gold-m…
Translator: mnikn
Proofread by: Swants, Winry01

A/B testing is A controlled experimental method used to compare differences between two or more versions based on hypotheses that can be confirmed or disproved. This test generates reliable results by deriving a specific test version from the original version. A/B tests are most effective when they are tested in real situations that the subject does not know about.

To build A representative sample of each release, the A/B test platform needs to randomly let users use version A or Version B, or exclude them from testing. Then ensure that users have A consistent A/B experience (always A or always B) throughout the test, and provide additional metadata to the analytics platform to determine the impact of metrics. Once metrics are analyzed, the best-performing version is selected, and you can use A/B test platform to gradually roll out the winning version to all users.

For example, you can assume that the bottom navigation in your app will have more user engagement than the TAB. You can design an A/B test to compare tabs (version A) and bottom navigation (version B). Your A/B test platform will then generate A sample that is randomly assigned to either version A or Version B based on the user. And each user continues to see the same version during the test. At the end of the test, user engagement in version A can be compared to user engagement in Version B to see if there is A statistically significant improvement in version B. If version B is better, there is data to support you to change the navigation style to bottom navigation and let all users see that version.

Left – Version A, label; Right — Version B, bottom navigation

An explanation of the product details experiment on the Google Play console

The Google Play Console also supports A/B testing in your product details, which I won’t focus on in this article. The Product details experiment allows you to test different ICONS, feature maps, promotional videos, short descriptions and detailed descriptions in the product details to see if these changes can increase the installation of your app. The product details experiment focuses on increasing the number of conversions, while the rest of my article discusses in-app A/B testing to improve post-install metrics such as retention, user engagement, and in-app purchase revenue.

In my article, I’ll cover the five key steps of A/B testing in your app:

Establish hypothesis
Integrate A/B test platform
Test hypothesis
Analyze and draw conclusions
To take action

Then, I’ll cover more advanced techniques that can be explored.

The first step is to establish the hypothesis

A hypothesis provides an explanation for A phenomenon, and A/B tests are A way of determining whether A hypothesis is true. This hypothesis may be created by examining existing data, it may be more of a guess, or it may simply be a “prediction.” (Assumptions are often based on “predictions” for new metrics that new features involve.) In the navigation example, the hypothesis can be expressed this way: “Using bottom navigation will increase user engagement over tabs.” Then, if your app has made a change to the navigation style and how that change affects user engagement, you can use that assumption to make a decision. It’s important to remember that the sole purpose of the test was to prove that bottom navigation had a direct, positive impact on average revenue per user (or ARPU).

What is being tested (what is A? What’s B?

The following table lists most of the scenarios to help you determine which version to test. Take our hypothetical navigation experiment as an example.

The “Test exclusion” column represents users who do not participate in the test. Their behavior will not help the test results. Let’s see who the test users are.

We choose scenario 2 or Scenario 3 based on what the hypothesis is measuring. If it’s only relevant to the new feature (for example, if the new feature requires in-app purchase, then the feature is only relevant to in-app purchase revenue), then select Scenario 2. If the assumption is that the new feature to be implemented (for example, if the new feature is a “favorite” mechanism, and the metric is user engagement) is related to what came before (and is measurable), choose Scenario 3.

Note: In the following sections, I’ll use Scenario 1 for brevity. The same approach applies to scenarios 2 and 3, replacing the “new 1” and “new 2” versions with the titles “existing” and “new” versions.

Who is going to test

If observed behaviour is known to change because of a factor outside the hypothesis – for example, if behaviour is known to vary by country of residence when the hypothesis only considers the effect of global income – you need to either make the value of that factor unique (for a single country) or use a representative sample of the entire population (for all countries).

The size of the controlled representative sample can also be set as a percentage of the total population. For example, the test sample size is 10% of the population, of which 5% received version A, 5% received version B, and the remaining 90% were excluded from the test. This means that 90% of users will only see existing features, not any new features, and their behavior will be excluded from the test metrics.

How long will the tests take?

Maximum time: User behavior usually varies with time, day of the week, month, season, and similar factors. To make enough difference between releases, you need to balance statistical significance with the needs of the business. (Your business may not be able to wait until you have enough data to complete the statistics.) If you know that a particular metric changes over a short period of time, such as a time of day or a day of the week — then try to have the test cover the entire period. For metrics that require a longer period of time, it may be better to test only for a few weeks and speculate accordingly based on how the metrics change over time.

Minimum time: Test runs should be long enough to capture enough data to provide statistically significant results. A typical test size is 1,000 users (minimum). However, it depends on the index distribution derived from the hypothesis whether the obvious results can be obtained. You can do this in a reasonable amount of time by estimating how many users will be able to test over the required period of time, and then select the percentage of the estimated number of users to make your test statistically significant over that period of time. Some A/B test platforms manage these operations automatically and can also improve your test sampling rate, allowing your tests to reach statistical significance faster.

Step 2: Integrate the A/B test platform

There are already several A/B test platforms that can be tested as A stand-alone product or as A component of A larger analysis platform, such as Firebase Remote configuration Analysis. Through the client library, the platform sends a set of configuration instructions to the app. The app doesn’t know why it’s returning an argument, so it doesn’t know what part it’s testing, or even if it’s part of the test. The client should follow the configuration instructions and interpret the value. In the simplest case, the parameters returned can be simple key-value pairs that control whether a given function is enabled or not, and if so, activate the corresponding version.

In more complex cases, if A large number of remote app configurations are required, the app will send the parameters to A/B test platform, which will use these parameters to select A more refined test configuration. For example, if you assume that only devices with XXXHDPI screen density are involved, then the app will need to send its screen density to A/B test platform.

Don’t reinvent the wheel

Directly select one of the existing platforms that can meet A/B testing requirements. Note: You need to develop the habit of A/B testing and data-driven decisions.

Note: It is difficult to manage many users, maintain consistent test status, and distribute test participants fairly. There is no need to write from scratch.

Of course, you have to write code for each version to be tested. However, it should not be up to the app or a custom service to decide which version to use at a given time. This is handled by the A/B test platform, which uses A standard approach to centrally manage multiple tests for the same population at the same time. It only makes sense to implement A simple A/B testing mechanism yourself when you only perform one test on the platform. For the cost of hardcoding two tests, you can integrate an off-the-shelf A/B test platform. Compared to the cost of writing two hardcoded tests, it is better to integrate those tests into an off-the-shelf A/B test platform.

Integrated analysis function

Choose an existing analysis platform that provides detailed test status information and automatically helps you categorize your test population. The tight integration of the two platforms depends on the specific configuration of each test and the version to be passed directly between the A/B test platform and analysis platform. The A/B test platform assigns A unique reference to each version and passes it to the client and analysis platform. Then, only the client is allowed to pass that reference to the analysis platform rather than the entire version of the configuration.

Remote configuration

An app with remote configuration already has most of the code needed to implement A/B tests. Essentially, A/B testing adds some server-side rules to determine what configuration is sent to the app. For apps that don’t have remote configuration, introducing A/B testing platforms is one of the best ways to get you to do so.

Step 3: Test the hypothesis

Once your assumptions are defined and your tests designed, and A/B test platform is integrated, implementing your test version is the easiest thing to do. Next, start your test. The A/B test platform assigns A sample set of users to test groups and then assigns versions to each test user. The platform then continues to assign users within the desired time period. For more advanced platforms, tests are performed until statistical significance is reached.

Monitor the test

I recommend monitoring the impact of new releases during testing, including metrics not mentioned in the test hypothesis. If you find it’s having a bad impact, you may want to stop testing as soon as possible and get users back to the previous version as quickly as possible — minimizing a bad user experience. Some A/B test platforms can automatically monitor and alert the test to unexpected negative impacts. If your platform can’t do this, you need to cross-reference any impacts seen in existing monitoring systems with current testing to identify “bad” versions.

Note: If testing does need to be stopped early, you should be careful with the data collected, as it does not guarantee a representative sample of the test population.

Step 4: Analyze and draw conclusions

Once the test is completed normally, you can use the data collected in the analysis platform to determine the results of the test. If the outcome indicators match the hypothesis, then you can assume that the hypothesis is correct. Otherwise, you guessed wrong. Determining whether an observation is statistically significant depends on the nature and distribution of the indicators.

If the assumptions are wrong — because the indicators have no positive or negative impact — then there is no reason to keep the version. However, the new version may have a positive impact on related but unexpected indicators. This may be a reason to choose a new version, but it is generally better to perform additional tests specifically for the ancillary metrics to confirm the impact. In fact, the results of an experiment often raise additional questions and assumptions.

Step 5: Take action

If the assumption is true and the new version is better than the old one, then we can update the “default” configuration parameter to be passed to the app to instruct it to use the new version. Once the new version has been the default for enough time, you can remove the old version’s code and resources from the next version of the app.

Iterative show

A common use of the A/B test platform is to reintroduce it as A mechanism for iterative presentation, where the winning version of the A/B test gradually replaces the older version. This can be seen as A/B design test, while the iterative presentation is A Vcurr/Vnext test to confirm that the selected version will not adversely affect the majority of users. You can iterate on the presentation and make sure there are no adverse results before moving on to the next step by increasing the percentage of users who receive the new version (for example, starting from 0.01%, 0.1%, 1%, 3%, 7.5%, 25%, 50%, 100%). You can also categorize in other ways, such as country, device type, user group, etc. You can also choose to show the new version to specific groups of users (such as internal users).

Further experiments

For example, you can build A simple A/B test for A deeper understanding of the scope of user behavior. You can also run multiple tests simultaneously and compare multiple versions in a single test to make the tests more efficient.

Depth grouping and positioning

A/B test results can detect changes in different sets of results and locate which method is responsible. In both cases, it may be necessary to increase the sampling rate or test duration to achieve statistical significance for each group. For example, the test results of the TAB vs bottom navigation hypothesis may have different effects depending on the country. In some cases, user engagement in some countries may increase substantially, while in some countries it may not change, and in others it may decline slightly. In this scenario, the A/B test platform can be set to A different “default” version by country to maximize overall user engagement.

You can test with the same set of data for a specific group. For example, you can test users who live in the United States and users who have previously used the tabbed navigation style.

A/n test

A/ N test is short for testing two or more versions. This could be multiple new versions replacing existing versions, or versions with completely new features replacing versions without any new features. As you group in depth, you may find that different versions perform best in different groups.

Multivariable test

A multivariable test is a single test that changes multiple parts of the app at once. Then, in the A/ N test, the unique set of values is treated as A single variable. Such as:

Multivariate testing is appropriate when multiple aspects are likely to affect overall indicator performance, but it is not possible to distinguish which particular aspect is responsible for the effect.

Scale up the test

If multiple tests are run simultaneously in the same population, they must be managed from the same platform. Some platforms can scale to support thousands of tests running simultaneously, others isolate full testing (so users can only test once at a time), and some platforms can share a test user (so users can test multiple times at once). The former is easier to manage, but quickly runs out of test users and results in an upper limit of statistical significance depending on the number of parallel tests. In the latter case, the A/B test platform is difficult to manage, but there is no upper limit to the number of parallel tests. The platform achieves this by treating each test entirely as an additional group of another test.

Self choice

Self-selection lets users know that they are using a particular version of a particular test. Users can choose their own version, or let the A/B test platform assign it to them. In either case, these users should be excluded from the metrics analysis because they weren’t unwittingly participating in the test — they knew it was a test, so they might have shown a biased response.

conclusion

In-app A/B testing is A very flexible tool that allows you to make data-driven decisions about your app, which, as I’ve highlighted in this article, can help you make informed choices about new features. A/B testing allows you to test versions of various aspects of your app in the real world with real users. To simplify the design, integration, execution, and analysis of A/B tests within apps, Google provides A set of tools that include:

Firebase Remote Configuration (FRC) provides a client library that allows apps to request and receive Firebase configurations, as well as a rules-based cloud mechanism to define user configurations. Remote configuration can help you update (and upgrade) your app without having to release a new version.
Firebase Remote configuration and Analysis allows you to determine and track version deployments based on A/B tests.
Firebase analysis gives a category of metrics by version and links directly to the FRC.

What do you think?

Any questions or thoughts about using A/B testing? Post a discussion in the comments below, or use the hashtag #AskPlayDev, and we’ll respond at @googleplaydev, where we regularly share news and tips on how to do better at Google.

Remember: Analysis is critical for A/B testing. Together, A/B testing and analysis can broaden your horizons and drive the design and development of your app to the best of its ability.

Diggings translation project is a community for translating quality Internet technical articles from diggings English sharing articles. The content covers the fields of Android, iOS, front end, back end, blockchain, products, design, artificial intelligence and so on. For more high-quality translations, please keep paying attention to The Translation Project, official weibo and zhihu column.