Recently, Ali Mom algorithm team published a paper entitled “Entire Space Multi-Task Model: An E o ffective Approach for Estimating post-click Conversion Rates, has unveiled a brand new CVR estimation model. This model solves the problems of sample selection bias and sparse training data that are difficult to overcome in traditional CVR prediction models, and opens up the industry’s first large-scale data set containing user sequential behavior (see download address at the end of the article). This paper has been accepted by SIGIR 2018 (International Conference on Research on Development in Information Retrieval), the top Conference in the field of Information Retrieval.

Accurate estimation of conversion rates is critical in industrial-scale applications such as information retrieval, recommendation systems, and online advertising delivery systems. In electric business platform in the recommendation system, for example, to maximize the total scene commodities trading (GMV) is one of the important goals of platform, the GMV * * click * conversion can be salvaged for flow unit price and visible conversion is an important factor of optimizing GMV target, from the perspective of the user experience for the accurate forecast of conversion is used to balance users click on the preferences and buying preferences.

The new CVR prediction Model proposed by Ali Mom algorithm team in this paper is called “Entire Space Multi-Task Model” (ESMM), hereinafter referred to as ESMM Model. The ESMM model innovatively utilizes the sequential nature of user behavior to learn both the post-view click-through rate (CTR) and the post-click conversion rate (CVR) in a complete sample data space. It solves the problems of sample selection bias and training data sparsity which are difficult to overcome by traditional CVR prediction model.

Taking e-commerce platform as an example, after observing the list of recommended commodities displayed in the system, users may click the commodities they are interested in, and then purchase them. In other words, user behavior follows a sequential decision pattern: Impression → click → conversion. The CVR model is designed to estimate the probability of a user buying an exposed product after observing it and clicking on the product details page.

Namely pCVR = p (conversion | click, impression).

Suppose the training data set isS = {(x, y and z I I)} | I = 1 N “role =” presentation “>And the sample(x, y → z)” role=”presentation”>From the domainX × y × z” role=”presentation”>X “role=”presentation”>Is the characteristic space, y” role=”presentation”>And z “role =” presentation “>It’s the tag space,N Is the total number of samples in the data set. In the CVR estimation task,x Is the feature vector of high dimensional sparse multi-domain,yzThe value of is 0 or 1, indicating whether to click and whether to buy respectively.Y and z “role =” presentation “>It reveals the sequential nature of user behavior, that is, click events generally occur before purchase events. The goal of the CVR model is to predict conditional probabilitypCVR , and the two probabilities related to it are click rate pCTR and click and conversion rate pCTCVR. The relationship between them is as follows:

.

Traditional CVR forecast tasks usually adopt similar to CTR forecast (yq.aliyun.com/articles/56…). Such as deep learning models, which are very popular these days. However, unlike the CTR prediction task, the CVR prediction task faces some unique challenges: 1) sample selection bias; 2) Sparse training data; 3) Delayed feedback, etc.



Figure 1. Training sample space

The problem of delayed feedback is beyond the scope of discussion in this paper. The following is a brief introduction to sample selection bias and training data sparsity. As shown in Figure 1, the outermost large ellipse is the entire sample spaceS, where the collection composed of samples with click events (y=1) is S c = { ( x j , z j ) | y j = 1 } | j = 1 M ” role=”presentation”>

Corresponding to the shadow area in the figure, the traditional CVR model is trained with the samples in this set, and the trained model needs to make prediction and inference in the whole sample space. Because click events are much fewer than show events, so S c ” role=”presentation”>It’s just the sample spaceS A small subset of phi from phiS c “role=”presentation”>The extracted features are relative to those fromS The extracted features are biased, even very different. Therefore, the training sample set constructed in this way is equivalent to sampling from a distribution inconsistent with the real distribution, which to some extent violates the premise of the effectiveness of machine learning algorithms: training samples and test samples must be independently sampled from the same distribution, that is, the assumption of independent homodistribution. To sum up, the phenomenon that training samples are extracted from a relatively small set of the whole sample space, while the trained model needs to make inferences and predictions about the samples in the whole sample space is called sample selection bias. Sample selection bias can harm the generalization performance of the learned model.

The quantity of goods shown to users by the recommendation system is much larger than the quantity of goods clicked by users. Meanwhile, users with clicking behavior only accounts for a small part of all users, so there is sample space for clicking behavior S c ” role=”presentation”>Relative to the entire sample spaceS In general, it’s one to three orders of magnitude less. As shown in Table 1, on the training data set disclosed by Taobao, S c ” role=”presentation”>It only takes up the entire sample spaceS Of 4%. This is the problem of training data sparsity. Highly sparse training data makes model learning very difficult.



The ESMM model proposed by Ali Mama algorithm team draws on the idea of multi-task learning and introduces two auxiliary learning tasks to fit pCTR and pCTCVR respectively, thus eliminating the two challenges mentioned above at the same time. ESMM model can make full use of the sequential characteristics of user behavior, and its model architecture is shown in Figure 2.



Figure 2. ESMM model

Overall, the ESMM model can simultaneously output the estimated pCTR, pCVR, and pCTCVR for a given presentation. It is mainly composed of two sub-neural networks, the left sub-network is used to fit pCVR, and the right sub-network is used to fit pCTR. The structure of the two sub-networks is exactly the same, and the sub-network is named BASE model here. The output of the two sub-networks is multiplied to get pCTCVR, which is used as the output of the whole task.

It should be emphasized that the ESMM model has two main characteristics that make it different from the traditional CVR prediction model, as described below.

  1. Model the entire sample space. As can be seen from the following equation,pCVR You can estimate it firstpCTR andpCTCVRAnd then we derive it. In principle, it is equivalent to training two models separately to fitpCTR andpCTCVRAnd then through thepCTCVR Divided by thepCTR The final fitting target is obtainedpCVR .



  • However, due topCTR Usually very small, and dividing by a very small floating-point number can cause numerical instability problems (computational memory overflow). So the ESMM model takes the form of multiplication rather than division.
  • pCTR andpCTCVR Are the two main factors that the ESMM model needs to estimate, and are modeled in the entire sample space.pCVR It’s just an intermediate variable. Thus, the ESMM model is modeled in the entire sample space, rather than just in the click-through sample space like the traditional CVR predictive model.
  1. Shared feature representation. The ESMM model uses the idea of transfer learning to share the dictionary of embedding vector (feature representation) in the two sub-networks. The embedding layer of the network maps the large-scale sparse input data to the low-dimensional representation vector. The parameters of this layer account for most of the parameters of the whole network and need a large number of training samples to be fully learned. Since the training sample size of CTR task is much larger than that of CVR task, the mechanism of feature representation sharing in ESMM model enables CVR sub-tasks to learn from the samples that only show and do not click, thus greatly alleviating the problem of training data sparsity.

It should be added that the loss function of ESMM model consists of two parts, corresponding to the two sub-tasks of pCTR and pCTCVR, and its form is as follows:



Among them,θ c t r “role=”presentation”>andθ c v r “role=”presentation”>Respectively are parameters of CTR network and CVR network,l(.). Is the cross entropy loss function. In the CTR task, the display events with clicking behaviors were labeled as positive samples, while the display events without clicking behaviors were labeled as negative samples. In the CTCVR task, display events with both click and purchase behaviors are marked as positive samples, otherwise marked as negative samples.

Due to ESMM model sequential characteristics of the innovative use of user behavior in the complete sample space modeling, so no public data sets are available for testing, ali mother algorithm team collected from taobao’s actual recommendation system a contains the user sequential behavior of the new data set, and exposes a sample version, the download address is: Tianchi.aliyun.com/datalab/dat… . Subsequently, tests were carried out on public data sets and taobao production environment data sets respectively, which achieved better performance compared with several other mainstream competitive models.



Table 2 shows the comparison of AUC effects of different algorithms on the open data set. The BASE model is the sub-neural network model on the left of THE ESMM model. Since it only trains in the click sample space, it will encounter problems of sample selection bias and data sparseness, because the effect is also poor. DIVISION model is to train models fitting CTR and CTCVR respectively, and then divide the prediction result of CTCVR model by the prediction result of CTR model to obtain the prediction result of CVR model. Esmm-ns model is a variant of ESMM model, which removes feature representation sharing mechanism on the basis of ESMM model. AMAN, OVERSAMPLING, and UNBIAS are three competitive models.

Figure 3. Performance test comparison of several different algorithms on taobao production environment data set

Figure 3 is the comparison of ESMM model’s test results on taobao’s production environment data set. Compared with BASE model, ESMM model improved ITS AUC index by 2.18% in CVR task and 2.32% in CTCVR task. A 0.1% increase in the AUC is usually considered a significant improvement.

To sum up, ESMM model is a novel CVR prediction method. It is the first to make use of the sequential characteristics of user behavior to model in the complete sample space, avoiding the problems of sample selection bias and sparse training data that are often encountered by traditional CVR models, and achieving remarkable results. Meanwhile, the sub-network in ESMM model can be replaced by any learning model, so the FRAMEWORK of ESMM can be easily integrated with other learning models to absorb the advantages of other learning models and further improve the learning effect. In addition, the idea of ESMM modeling is easy to be generalized to the full link prediction scenario of multi-stage behavior in e-commerce, such as the behavioral link prediction of sorting → show → click → transform, which has huge imagination space.

Original link: arxiv.org/abs/1804.07…

Data set download address: tianchi.aliyun.com/datalab/dat…