Recently, we tried the deep learning model, and the AUC of the offline model was higher than the baseline, and the CTR performance of the online model was also due to the baseline. However, based on the online practice logs, the AUC of the deep learning model was significantly lower than the baseline. However, the previous model iteration AUC can correspond to the on-line CTR.

This problem is strange. In general, because there is bias in online samples, when the new model is used to test baseline flow and experimental flow at the same time, the AUC of experimental flow will be slightly higher than that of baseline flow.

It can be understood that the increment of the new model brings some new good samples, but if we directly evaluate the baseline data, because the recommended or ranked data is dynamically feedback, there will be no good samples in this part, and the AUC will be low. But in our experiment, the opposite is true.

Analyze gauC metrics

The first reaction is that the AUC index is distorted, because the sorting is personalized, the sorting results between different users are not comparable, different users of the negative sample score may be higher than the positive sample, leading to global AUC index distortion.

For example: Suppose there are two users A and B, each user has 10 goods, 5 of the 10 goods are positive samples, we use A+, A-, B+, B- to represent the positive and negative samples of the two users respectively. In other words, 10 out of 20 items are positive samples. It is assumed that the results predicted by the model are in order of A+, A-, B+ and B-. If the results of the two users are mixed together, the AUC is not very high, because there are 5 positive samples ranked behind, but if you look at it separately, the positive samples of each user are ranked ahead of the negative samples, and the AUC should be 1. Obviously, it is easier to show the model separately, thus eliminating differences among the users themselves.

Considering the difference between each sample of the user, the weight of the sample is generally added to the AUC coefficient, so that the calculation is more reasonable. The premise of the above situation is that the ranking prediction results of the two users interfere with each other, that is, the more concentrated the scores are, the greater the difference between the positive and negative samples among different users is, leading to the seemingly reduced online ranking effect, but the AUC calculation of each user is not disturbed, and the online indicators do not decrease.

We used Ali’s GAUc definition:


The indicators after re-statistics are as follows:


After we found GAUC, diFF decreased, but it was still high at baseline, indicating that GAUC and CTR were also inconsistent.

User score distribution


It can be seen that in the auc distribution, the number of exposure users without clicking increases, and the baseline auc is poor in the auc[0,0.1] interval. Considering the number of users whose AUC is 0, there is not much exposure,

For example, 113 is 0.4%, 139 is (1.2%), and exposure increases by 1.565%, indicating that in addition to exposure, users are more willing to be exposed in the sample with user behavior.

And then we get rid of that sample, and we think about users who click



We can see that the auC prediction of users in the interval [0.6, 1] is closer to the real label distribution than that of baseline users. We believe that the proportion of map wall is higher in users with AUC =0, and the experiment is slightly higher than the baseline. From users with behavioral feedback, AUC was more advantageous in high-segment experiments.

User AUC calculation instructions


In the figure above, we selected 5 users whose auC was similar at baseline to that of the experimental single user, but differed greatly between the experimental and baseline auC after combining the samples of both users. This indicates that there is a large gap between the distribution of experimental and baseline model scores, and gauC alone does not explain the problem.

By sampling each user’s predicted product score under one browsing path, we found that the baseline and experimental slopes were different, which meant that the situation described above was more likely to occur.




We found that the baseline model showed significant differentiation in score prediction.

conclusion

As the baseline is a large-scale discrete model with features of 0 and 1, it leads to large diff of personalized hit and miss score. As the depth model is based on embedding, the score changes more evenly in the calculation of individalization, leading to more similar and indistishable samples in the sample set. The sample distribution of the two changes dramatically. As a result, AUC is not comparable and GAUC can only be referenced.

When the model changes dramatically, many unexpected problems arise, and in-depth analysis leads to a better understanding of the model and data.