Today, I would like to introduce two articles from the same recommendation team of Ali, Deep Interest Network for Click-through Rate Prediction and, Deep Interest Evolution Network for click-through Rate Prediction. This is the first part, which mainly introduces the previous article.

Deep Interest Network (DIN)

Deep Interest Network was proposed in 2017. Its ideas are well worth learning. CTR prediction is an important problem in e-commerce systems or in all recommendation systems. This year, deep learning methods were widely used in CTR prediction. This paper argues that most of the current deep learning models can be considered as Embedding&MLP structures, that is, large-scale and sparse input data is first compressed into low-dimensional embedding, which becomes fixed-length input vectors and is input into fully connected networks. These models greatly reduce the workload of feature engineering, so they are widely popular.

However, these methods still have disadvantages, and the biggest disadvantage is that the vector of fixed dimension cannot fully express the diversity of user interest. We can certainly enlarge the dimension of the vector, but this will greatly increase the possibility of overfitting the model. In recommendation systems, interest is generally expressed by past behavior. On the other hand, we don’t have to use vectors with high dimensions. Users may have a wide range of interests, but we do not need to know all of them when predicting certain projects. For example, to predict whether a user is going to buy a pair of shoes, we care about her interest in buying clothes and shoes, not kitchen supplies. This paper calls this kind of interest locally activated, and attempts to use the attention mechanism to find a better balance between diversity and locally activated.

In Taobao system, advertisement is commodity, that is, item in general recommendation system. The logic of the recommendation system is as follows: The general retriever stage is called Matching, and the system logic consists of Matching and ranking. Taobao uses Collaborative Filtering to complete the matching process, select some candidates, and then use CTR Prediction method to sort.

Most of the features of the input sorting algorithm are multi-group categorical forms, which are represented by one-hot or multi-hot. A typical input is as follows:

Base Model

Taobao originally used an Embedding&MLP model, which we call the Base Model here. This model consists of the following parts:

Embedding Layer

The Embedding Layer converts some of the higher-dimensional binary vectors mentioned above into dense lower-dimensional vectors. For example, we can put theThe goods ID vector of dimension is converted to the multi-dimensional embedding vector. Embedding Layer uses the dictionary lookup method and follows the following rules:

  • ifIf it is one-hot vector, we convert it to simple embedding.
  • ifIf it is a multi-hot vector, we convert it to embedding list.

Pooling layer and Concat layer

As mentioned before, different users have different behaviors and we need to combine different amounts of embedding vectors into fixed dimension input vectors into MLP, so we need a pooling layer to combine them. Sum-pooling and average pooling are commonly used. Vectors of different types after pooling will be combined at concat layer.

MLP

MLPlayer is responsible for prediction and Loss Function is Log Loss. The structure of the entire Model is shown below.

Deep Interest Network

The pooling layer does its job, but it loses a lot of information. The key to DIN is to learn locally activated information by using the attention mechanism, thus minimizing information loss. DIN introduces an activation architecture and uses the following system architecture:

Training Techniques

The article also proposes some training techniques.

Mini-batch Aware Regularization

We use regularization to prevent overfitting. However, L2 regularization means that all parameter updates need to be calculated in each mini-batch training China, which leads to heavy computation burden, especially unacceptable in large data sets. Therefore, the mini-batch Aware Regularization method is proposed in this paper, which can reduce the computation amount by using the sparsity of data set. That’s not our point. See the original article for details.

Data Adaptive Activation Function

PReLu is a common activation function. Its expression is as follows:

Metrics

We generally use AUC to evaluate the results. Weighted AUC was used by Ali and another indicator named RelaImpr was proposed.

The results of A/B Test are as follows: