How to optimize A/B test group

background

A/B testing is used in experimental engineering to verify the effect of A certain function. The difference between the control group and the experimental group needs to be less than a certain threshold, such as 0.4%, for the experiment to be considered effective.

Complete Randomization (CR) is commonly used in the industry, which is to hash the user ID field and modulus 100 to get a result value, and then divide the users with the same result value into the same bucket. In this way, it is found that there are differences between the experimental group and the control group in some key indicators. When the differences are large, the credibility of the experimental results will be affected.

Group Optional solution

1 AA test evading method

Based on the inherent uncertainty of the benchmark index caused by completely random sampling, the industry will adopt the mode of AA test to verify the benchmark index on the premise that the technical level is not enough to avoid. That is, for different groups of users using the same strategy, experimental empty run.

To observe whether there is a statistically significant difference in baseline indicators between the two groups of users. If they are different, it indicates that there are differences between the two groups of samples and they need to be regrouped until the baseline indicators are basically the same.

Re-run the experiment to see if the uneven distribution of the two groups disappeared, refer to Simpson’s Paradox and solution of A/B experiment

2 RR(Rerandomization)

The engineering progression of AA test is RR(RR) Rerandomization, that is, after each CR grouping, verify whether the grouping result composition difference of CR is less than the threshold set in the experiment (for example, if the difference is less than 0.1, the difference between groups can be ignored).

In contrast to CR, RR is a grouping attempt by sacrificing computation time. Namely, engineering automation of AA test.

Disadvantages: Personally, it is not suitable for adoption, because if the difference is too large, it may cause a permanent cycle, or not get a good result of the program

3. Adaptive Grouping Algorithm

Adaptive grouping method can make the distribution of selected observation indicators basically consistent after grouping when grouping only once, which can greatly reduce the relative error.

Compared with the traditional CR grouping, the Adaptive grouping algorithm is more complex. When traversing the population for grouping, each group needs to record the number of samples allocated so far and the distribution of the samples allocated on the selected observation indicators.

After getting the next object to be divided from the diverged population, each group of the experiment will be calculated to calculate the distribution score of the observation index of this group if the object is assigned to this group. Then, the pre-allocation score of each group was integrated to obtain the final allocation probability of each group for the experimental object.

The Adaptive Adaptive algorithm developed by Didi and Meituan data driven engineering construction can be tested.

Adaptive grouping process

Disadvantages: The idea of this algorithm is similar to greedy algorithm. Before samples are selected to enter which group, a large number of calculations should be carried out continuously each time, including direct and indirect distribution probability, balance coefficient, distribution function and so on. There are some mathematical ideas involved that may take time to understand. As an internal data-driven project of each company, I will not reveal too many details. I have searched for a long time and have not found some specific algorithm design patterns.

Think of ideas you can try

Scenario example: In business development, a new learning experience function is developed, but it is not sure whether it is suitable for full promotion. That is, AB test is used for experiment, and the difference of individual indicators is required to be less than 0.4%.

Examples for setting up supervision indicators

Indicators for	Index definition
The user’s age	0 ~ 1 years old, 1 ~ 2 years old, 2 ~ 3 years old, 3 ~ 4 years old, 4 ~ 5 years old, 5 ~ 6 years old, 6 ~ 7 years old, 7 ~ 8 years old, etc
The user types	Paying users, regular users
User completion rate	Same time mean

There are two modes:

Proportion pattern: an experimental user may have a certain type of attribute, such as age, user type, etc. At the time of distribution, the proportion difference between groups is small.

Mean value mode: the user corresponds to a certain value, such as the user’s learning time and courses. When distributing, the mean value of this value in the whole group has a small difference compared with other groups.

How to solve the difference of proportion pattern?

Using the general idea, divide in two, as evenly as possible into experimental and control buckets.

For example:

The user types are divided into two types: upgrade and experience class transfer. The number of users of these two types is counted respectively, and then they are divided into two groups.

Reference code:

public List<List<UserObject>> getByUserType(List<UserObject> userObjectList){ List<List<UserObject>> ans = new ArrayList<>(); List<UserObject> part1 = new ArrayList<>(); List<UserObject> part2 = new ArrayList<>(); // Find different user types Map<Integer, List<UserObject>> detailsMap01 = userObjectList.stream() .collect(Collectors.groupingBy(UserObject::getUserType)); for(List<UserObject> list:detailsMap01.values()){ partitionList(list,part1,part2); } ans.add(part1); ans.add(part2); return ans; }Copy the code

How to solve the difference of mean pattern?

There is an idea of small array algorithm in which the same mean is formed by the segmented numbers that can be investigated. However, in the above algorithm, two subarrays with identical mean values are found. For AB experiment, as long as the mean values are as similar as possible and the differences are small, it is ok.

Here is a relatively simple approximation algorithm idea:

The natural numbers are sorted from largest to smallest and then assigned to the two groups in turn, adding to the current and smallest group at the end of each time.

Array partition code can refer to:

public List<List<UserObject>> getByDevideAvgList(List<UserObject> allUserList){

    allUserList.sort((o1, o2) -> o2.getCompleteCoureScore() - o1.getCompleteCoureScore());

    int sum1=0,sum2=0;
    List<UserObject> part1 = new ArrayList<>();
    List<UserObject> part2 = new ArrayList<>();
   for(UserObject userObject:allUserList){
      if(sum1 > sum2){
         part2.add(userObject);
       }else{
        part1.add(userObject);
      }
  }

  List<List<UserObject>> ans = new ArrayList<>();
  ans.add(part1);
  ans.add(part2);
  return ans;
}
Copy the code

Thinking about the relation between one-dimensional index and multidimensional index

CR completely random grouping method. From the perspective of userId, the userId is hashed and then the module is taken. This operation has been very random and uniform, but the difference of each indicator is still not as good as people’s expectation. Look at the problem from another point of view, which indicators are different, try to make even where.

When the AB experiment is carried out, if there is only one monitoring indicator, of course, the difference can be controlled more accurately. However, when there are multiple indicators, how can the differences of each indicator be controlled as far as possible when experiments are conducted from multiple dimensions? This is a question worth thinking about.

First of all, it can be considered that each supervision indicator is independent and will not affect each other. In other words, if two or three of them are related, one of them can be used to represent them without segmentation.

When users are grouped, the index label sorting method can be adopted to randomly determine the execution sequence of indicators, and indicators do not affect each other, so that the effect of manual intervention on the difference can be achieved.

Specific experimental operation steps

Step 1: Randomly determine the execution sequence of indicators

The age, completion rate and type of users are designated as 1,2,3, and the order of execution is determined randomly. For example, an experiment can be carried out in the order of 123.

Step 2: Divide into groups one by one

After each execution, the number of groups will be split, and the differences between the two sides should be as small and even as possible.

Step 3: Select one from experiment BUCKET A and B respectively for experiment

At the time of the first differentiation, two buckets were already divided. When you’re choosing a control and a treatment group, you take them from two buckets. For example, take A bucket group 1 and B bucket group 1 for experiment. Several groups can be selected for the experiment.

If only one experimental group and control group, to cross group, such as A bucket group 1+ A bucket group 3+B bucket group 1+B bucket group 3, in order to try to make each sub-group is through the index calculation.

Step 4: Retry

Verify the differences between the experimental group and the control group. If the differences are too large, retry several times and rearrange the execution order of the indicators before grouping

Change the execution sequence of step 1, such as 213, and repeat the above steps to determine whether the difference threshold is exceeded. Note: try to make the selected experimental group or control group are calculated and controlled by each supervised index

The general steps are as follows:

The follow-up to optimize

1 can be divided into 4 x 5 x 5, and the final sub-bucket tree is an integer of 100. When A/B tests packets, the traffic is an integer.

2 volatility is larger, not easy to divide evenly, as far as possible in the back to divide.

conclusion

In daily work, if you need to use business development to solve some problems, you can first carry out detailed technical research, think independently, combine your own ideas, try to practice, if enough to meet the business scene, that is the best. Over time, subsequent iterations can be refined to get better.