Optimization of OLAP engine Clickhouse in abtest scenario

The introduction

A/B determination of righteousness

A/B testing data-driven oriented, can realize the flexible flow segmentation, different versions of the same products can online at the same time, by recording and analysis of user behavior on different versions of the data, get the effect of contrast, the greatest extent to ensure the scientific and accuracy of the results, and products to help people make scientific decisions.

Computing different versions of index data based on user behavior data is the only basis for evaluating experimental results.

Index product design

Figure 1. New indicators

In the design of the indicator system, the indicator registration method is adopted. Users can register indicators in their own service domains and service lines. To register indicators, you need to specify the indicator calculation formula (SQL), comply with the SQL template, and support user-defined dimensions (automatic dimension registration/associated dimension table). On the analysis level, users can view fixed and predicted indicators, and carry out multidimensional analysis of indicators.

In terms of product demand, samples and indicators of the day and accumulated (the experiment start time ~ one day before) need to be calculated.

Technical framework of indicators

Figure 2. Indicator design architecture diagram

Parallel calculation was used for multiple experiments, and serial calculation was used for multiple indexes in a single experiment.

The data to be calculated in a single experiment are as follows:

Figure 3. Experimental calculation data

The characteristics of index calculation are as follows: for each experiment, samples need to be calculated first, and the index is the index set on the sample. The calculation of p-value of the basic index depends on the sample and the index, and the calculation of p-value of the composite index needs to reconstruct the sample based on the denominator. The sample may vary from experiment to experiment, and users can customize the sample or choose the shunting service log as the default bottom-pocket sample.

Index calculation optimization

Stage 1: Engine and architecture optimization

Computing engine Spark: Based on performance and maturity, Spark is selected as the core computing engine. Its advantages over Hive are not discussed.

OLAP Clickhouse: Metrics analysis requires multidimensional and complex analysis based on detailed data, which can be performed by any mature engine. The main reasons for choosing CK are: 1> CK has proven performance and maturity, and 2> CK cluster is available for direct use.

Calculation method: parallel calculation of multiple experiments, serial calculation of multiple indexes within a single experiment. The correlation between experiments is small, which is suitable for parallel computation. Within a single experiment, multiple indicators have strong correlation. For example, composite indicators need to be calculated before composite calculation, and serial calculation is reasonable. In terms of implementation, it is completed by scheduling shell script dynamic cycle experiment.

Implementation: The calculation of a single experiment is completed by a universal Spark program. When the task is scheduled offline, the configuration of experimental indicators is read, and the details are calculated first, and then the index values are calculated.

At the beginning of the launch of AB platform, the number of experiments and indicators was small (10+ experiments and 50+ indicators), which could basically meet the needs. The overall calculation time was about 2-3 hours.

After running the platform for half a year, with the increasing number of experiments, the whole time has been extended to 5-6 hours, the parallelism has increased to 10, and the resources of a single task have increased to 100core and 800G memory. How to optimize?

Stage 2: Calculation model optimization

We grab a typical advertising experiment and print the consumption time of each stage of each indicator. After analyzing the execution time of each process of calculation, we find that the time spent on the calculation of samples and cumulative value of indicators is particularly long. AB index of accumulative total value calculation rules began running date < > = experiment = date calculation, index will define the SQL request must enter the time field, the program will automatically judgment and date processing, if an experiment to run for three months, the need to scan the 3 months of data calculation, data volume is the conference caused poor computational efficiency and computational time is very long.

We counted the running time of the closed experiment within 3 months:

Figure 4. Statistics of experimental running time

As can be seen from the figure above, the experiments longer than one month account for nearly 60% of the total number of experiments, which is relatively high.

How do you optimize the cumulative calculation?

Previous calculation model:

Figure 5. Original calculation model

The advantage of this calculation model is that it is relatively independent and can calculate the cumulative value of any date at will. Defect is 1 2 > > need to scan a larger amount of data calculation is not accurate: the sample after time in the index data are calculated and went in, there will be a slight error as a result, for example: May 1, the changes of the user, but May 2nd users to participate in the experiment, the algorithm will bring such data count as experiment of the transformation.

Optimized model:

Figure 6. New computational model

The advantages of this calculation model are as follows: 1> The performance of cumulative calculation is greatly improved, and the data calculation is no longer slower and slower with the increase of time; 2> The calculation results are more accurate. The disadvantage is that the cumulative calculation of the NTH day depends on the cumulative details of the n-1 day, which increases the complexity.

After this optimization, the calculation time of each index is relatively controllable, and there is no longer any experimental index with a particularly long time. The overall calculation time was kept at about 3 hours.

Stage 3: batch optimization of rate indicators

After the optimization in stage 2, the overall calculation of 50+ experiments and 200+ indexes can be satisfied, and the overall calculation time of a single experiment (5 indexes) can be stable at about 40 minutes.

At the beginning of the year, after the online school resource management and control platform was fully connected with AB, the number of experiments rose sharply. The number of experiments reached 150+ per day, and the number of indicators was 600+. The overall calculation time was as long as 10 hours.

Through the analysis of experiments and index data, it is found that in the experiments from the resource control platform, several fixed indexes are selected by default, which means that the same index appears in N experiments. Can we batch calculate experimental data for a single index?

The reason why AB index calculation is designed as a single experiment serial and multiple experiments in parallel is that each experiment allows users to customize samples, and the platform split log pocket is used only if the samples are not customized. The AB platform is adapted to a variety of indicators, and the suitable indicators for batch calculation are: the specified definition SQL itself contains the name of the experiment and the name of the group, and belongs to the basic rate indicator, that is, there is no need to associate with the sample.

Figure 7. Optimization scheme of rate index

After the implementation of this scheme, most experiments of the platform can be calculated in batches based on this scheme to accelerate the execution efficiency. The overall index calculation time is reduced to 5 hours, which improves the performance twice.

Stage 4: Batch optimization of mean index

The third stage is batch optimization of the rate index, but there are many mean indexes in the system, which are selected by several experiments. The calculation of some indexes is complicated, and it takes half an hour to calculate the data of a single day. For example:

Figure 8. Indicator SQL

This is the index SQL. In the actual calculation, it is necessary to associate samples again and calculate the mean details and mean values based on samples. The whole calculation will become complicated, and the calculation time for a single time will be about half an hour. How else can you optimize it?

Solution a:

We use spark’s checkpoint mechanism to cache the detailed data calculated for the first time. In this way, the cached detailed data can be directly used to calculate the mean value, p-value, and confidence interval.

After the implementation, the overall computing time is reduced. However, because checkpoint involves writing data to HDFS, Spark returns to Hive mode, which improves the overall performance but is not significant.

Scheme 2:

Can we do batch calculation and cache to CK as in stage 3? Yes, the model is a little bit more complicated.

FIG. 9. Optimization scheme of mean index

The core of this solution is as follows: Indicator SQL calculates temporary detailed data first. When calculating indicator details, it obtains them directly from CK, and dynamically selects them based on the cross-data source capability of Spark RDD. Finally, detailed data is written back to CK.

conclusion

After the optimization of AB measurement index calculation, the overall calculation time of 150+ experiments and 600+ indexes was kept at 2-3 hours. The most important thing was that with the increase of experiments and indexes, the increase of the overall calculation time was controllable without geometric increase, which achieved the optimization effect.