The background,

With more and more diverse, the Internet the company’s products and business to use data to drive business decisions to become inevitable, and AB experiment is to data index to judge the product function and the effect of operating strategy iterative methods and tools, which can be based on sample simultaneity and homogeneity, contrast of two or more group in the same application scenarios.

For example, an AB experiment can be disassembled in the following steps:

** (1) Find optimization indicators: ** improve user conversion rate;

** (2) Put forward the hypothesis: ** change the UI of front-end interaction;

(3) Create experiments;

** (4) To measure the experimental effect: ** obtained the conversion rate of group A was 23%, and that of group B was 11%;

** (5) Continue to optimize or end the experiment: ** It is concluded that Group A is better and the strategy of Group A is launched.

** Simultaneity: ** Both variants are used at the same time, rather than using variant A today and variant B tomorrow, which can be affected by other factors. ** Homogeneity: ** The corresponding user groups of two variants need to be as consistent as possible. Consider, for example, an extreme scenario where all women in variant A and all men in variant B, and it’s impossible to tell whether the scenario is affecting the outcome or gender.

In iQiyi’s daily business, there are a large number of AB experiment landing scenes. It includes the following categories:

(1) Algorithm class

Algorithm AB experiment is widely used in search, recommendation, advertising and other scenarios. Algorithm students through AB experiments to verify each new strategy (such as recall, sorting, etc.) to improve business.

(2) Product functions

The change of Internet product form is the part that can be perceived most intuitively by users. A wrong product change will bring huge losses to the company. Through AB experiment with small flow trial and error, verify the effect and carry out full online, reduce the risk brought by product iteration.

(3) Operations

In the era of refined operations, the operations team will do a lot of operational strategy. For example, user operation (free new users, silent users recall), member operation (expired members repurchase strategy), content operation (key content location). The above strategies can be used to measure which kind of operation strategy is more effective through AB experiment.

Based on the requirements of the above extensive application scenarios, ** IQiyi has built an AB experimental platform, including experimental management, user diversion, effect display and other functions. ** This paper will detail the implementation of IQiyi AB experimental platform.

Second, the diverging model and related concepts

Google paper ** Overlapping Experiment Infrastructure: More, Better, Faster Experimentation is the theoretical foundation of the Internet industry’s AB Experimentation. The overlapping Experimentation framework set out in this paper allows for an almost infinite number of simultaneous experiments on an APP. The model of the whole diversion is as follows:

FIG. 1 Triage model

There are several core concepts in the model:

(1) Domain: a part of vertically segmented flow. A domain can contain several layers, and a layer can contain several experiments.

(2) Layer: one or more experimental sets, in which the flow of each experiment is mutually exclusive. Layers are orthogonal to each other.

(3) ** experiment: including traffic division configuration and index configuration. ** experiment will divide the allocated flow into multiple experimental groups, and verify their hypotheses by comparing the effects of experimental groups.

Third, architecture implementation

AB experiment has been widely used in various industries. Iqiyi AB experiment platform is constructed by referring to the successful cases of companies in the industry and combining with the current situation of the company’s technology business.

As shown in the figure below, based on the survey and combined with the business characteristics of IQiyi, some improvement points with common solutions in the industry are constructed.

** First, added the initialization interface. ** There are some experiments with high timeliness requirements on the home page of APP. It will take a certain amount of time to obtain experimental configuration information through the back-end service. The cloud control platform of the company will call AB SDK in real time to deliver the experiments with high timeliness to the back-end service at the moment the APP is launched, saving the time to call SDK.

** Second, while providing SDK services for the business, API is provided concurrently to obtain AB groups through online service mode. ** The purpose of doing this is to obtain the user group through SDK for timeliness in the full population experiment. Accurate population experiment provides corresponding services through encapsulation API for the accuracy of population;

** Third, the biggest difference from the industry is that IQiyi’s AB log is not implemented in the front end, but through the log service +AB SDK playback to obtain behavioral data user groups. ** The reasons for such improvements and their benefits are described in detail in the data acquisition section below.

As shown in the figure above, IQiyi AB experimental platform mainly consists of AB experimental management platform, AB SDK streaming and effect evaluation. The following three modules are introduced to you respectively.

(I) Experimental management platform

AB experiment configuration platform includes experiment management and index management.

Experiment management facilitates business side to manage their own experiment configuration and life cycle on the platform.

A complete experimental configuration includes several core functions: experimental population, associated indicators, traffic division and whitelist.

** (1) Experimental population: ** The user ID selected for different business needs is currently divided into two types: full population and precise population.

Among them, the full population refers to the population without any restrictions; Accurate crowd refers to the crowd delineated based on the user’s portrait and behavior, such as the crowd in a certain city who has a preference for movie channels.

** (2) Correlation index: ** Each experiment needs to correlate the indexes to be optimized, and the Polaris index of the business line will be associated by default. Users can add other metrics as needed.

** (3) Traffic division: ** Divides the traffic allocated to the experiment into multiple groups, and each group has a unique identifier. The business side enforces different policies based on identity.

** (4) Whitelist: ** Assigns the user ID to the specified group. Whitelisted users can directly match experimental groups, facilitating service testing and enabling certain internal users to experience experimental groups.

After users create experiments, there may be large differences in experimental effects of groups due to crowd characteristics and other reasons, which hinders users’ effect analysis.

To solve this problem, we provided the function of AA diversion to ensure that there was no significant difference in the effect indexes of experimental groups. The whole process is shown as follows:

Figure 3 AA shunt

(2) Shunt module

The AB system provides the HTTP service and SDK for service providers to use.

At first, THE AB system provided HTTP service for traffic diversion. The advantage of HTTP service is that it is flexible to change and does not limit the language of the client. However, the HTTP service needs to request the back-end service every time it is shunted. In addition, the HTTP service may jitter due to network reasons, which is unacceptable to some services with sensitive delay.

Subsequently, JAVA SDK and C++ SDK were developed. The SDK realized client splitting by regularly obtaining experimental configuration from AB service, avoiding network communication and shortening the splitting time. **SDK shunt process is shown below:

FIG. 4 Flow chart of diversion

(iii) Effect evaluation

Effect evaluation is one of the core links of AB experiment. AB experiment platform not only needs to ensure the fairness of user diversion, but also needs to collect experimental data, effect statistics and statistical analysis to measure the success or failure of the experiment. Create AB experimental closed loop.

1. data collection

The traditional AB experiment data collection adopts the buried point method, and the company also adopts the buried point method in the initial data statistics. With the increase of the number of experiments, the development mode of burying point is too heavy, which is prone to the problem of wrong burying and missing burying. Will result in the following consequences:

(1) It takes a lot of time to conduct troubleshooting;

** (2) Once the problem is found to be buried, it is difficult to obtain timely observation of experimental data, and ** requires revision, which greatly reduces the efficiency of experimental iteration.

Therefore, we decided to give up the previous method of burying points for data statistics. Inspired from the characteristics of the recommendation system return function, libra China based on the company’s data, libra SDK for online call log data simulation, simulated the playback of log data, the user belongs to AB experiment on the experiment group, to reduce the dependence on APP buried point, experimental result data can be a quick recovery, timely evaluation experiment results.

2. Effect statistics

In the previous chapter, two important concepts were mentioned:

(1) AB experiment management platform connects with the index system to manage all the indexes.

(2) The user’s AB experimental group was obtained by using the offline playback method.

Based on these two improvements, most experimental results can be counted on the platform, without the need for analysts or data development for customized report development.

3. Statistical analysis

In the previous chapter, the analysis of effect data is realized.

In the experimental group and the control group, there are generally differences in indicators.

As shown in the table below, the CTR of experimental group B increased by 0.01pp compared with the control group. At this time, can we assume that the experiment is successful and we can launch group B strategy in full? Obviously not, because it is impossible to judge whether the experimental fluctuation is in a reasonable range, and the influence brought by some small fluctuations in the experiment has not been completely excluded.

Therefore, we need to determine whether the experiment is statistically significant from the perspective of statistics, and whether there is a significant difference between the experimental group and the control group. Before introducing the statistical testing methods of iQiyi, I will first introduce the commonly used statistical testing methods.

**Z test: generally used for a large sample (i.e., sample size greater than 30) average difference test method. ** It uses the theory of standard normal distribution to infer the probability of the occurrence of difference, so as to compare whether the difference between two means is significant. Also known domestically as the U test.

**T test: mainly used for the normal distribution with small sample size (e.g. N < 30) and unknown population standard deviation σ. **T test is to use the T-distribution theory to deduce the probability of the occurrence of differences, so as to compare whether the difference between two means is significant.

Chi-square test: The Chi-square test is the deviation degree between the actual observed value and the theoretical inferred value of the statistical sample. The deviation degree between the actual observed value and the theoretical inferred value determines the chi-square value. If the Chi-square value is larger, the deviation degree will be larger. On the contrary, the smaller the deviation is; If the two values are exactly equal, the chi-square value is 0, indicating that the theoretical values are in perfect agreement.

Here iQiyi finally chooses the method of T test. Why was the T-test chosen? The main reasons are as follows:

(1) ** Through the previous chapter, the AB experimental platform has calculated various indicators concerned by the experiment, such as ** per capita movie-watching time, CTR, UCTR and other daily indicators of the corresponding scene;

(2) Based on the above indicators, we can quickly calculate whether there is a significant difference between the experimental group and the control group by taking at least 7 days of experiment as a cycle (entertainment APPS have weekend effect).

For space reasons, the specific t-test algorithm is not described here.

4. Introduction to practice

At present, iQiyi AB experimental system has served multiple business lines of the company, with more than 1300 online experiments, truly realizing the “data-driven business”.

The picture below is the AB experiment of ** page UI revision conducted by the company’s product team on iQiyi APP TV drama channel page. ** Improving user duration is one of the core goals of iQiyi APP. Based on the AB experiment steps, the product adjusts the UI interaction design of the channel page from the perspective of the product by analyzing the characteristics of users’ portraits that have visited the channel page of the TV series, as shown in the figure below:

A: TV drama Channel page (old)

B: TV drama Channel page (new)

Group A was the control group — old version of the channel page style, group B was the experimental group — new version of the channel page.

Finally, through the AB experiment to measure the effect, the per capita feature time of the new channel page has increased by 17.85%, as shown in the trend chart, which has a significant improvement. It is proved from the data that the new channel page is significantly better than the old one, which proves that the product optimization is successful and provides a decision basis for the product optimization.

At present, the platform is still in rapid iteration, and there will be improvements in the following directions in the future:

(1) It is urgent to take the efficacy algorithm into consideration when the effectiveness evaluation is centralized and statistically significant;

(2) Due to the insufficient sample size of some experiments, it is difficult to get accurate experimental conclusions. Later, the prediction function of business sample size will be given based on the experimental target prompt value.