0 x00 the

Deep Interest Network (DIN) was proposed by ali Mama precise directional retrieval and basic algorithm team in June 2017. Its CTR estimates for the e-Commerce industry focus on making full use of/mining information from historical user behavior data.

In this series of articles, we will read the papers and source code, and review some of the concepts related to deep learning and the implementation of TensorFlow.

This paper is the first in a series: paper interpretation. Thanks for your sharing. Please refer to the link at the end of this article.

0x01 Paper Summary

1.1 summarize

Deep Interest Network (DIN) was proposed by ali Mama precise directional retrieval and basic algorithm team in June 2017. Its CTR estimates for the e-Commerce industry focus on making full use of/mining information from historical user behavior data.

DIN introduces the attention mechanism to construct different abstract representations of users for different ads, so as to more accurately capture users’ current interests in a given data dimension.

The core idea is that users’ interests are diverse, and different users’ interests can generate different local activation for specific ads.

1.2 Article Information

  • Deep Interest Network for click-through Rate Prediction
  • Address: arxiv.org/abs/1706.06…
  • Code address: github.com/zhougr1993/… Read github.com/mouna99/die… The implementation in

1.3 Core Views

This paper introduces the existing CTR estimation models, which mostly meet the same pattern:

  • Firstly, a large number of Categorical Features are mapped to low-dimensional space by Embedding technology.
  • Then, the low-dimensional expression of these features is combined and transformed according to the category of features (described in a group-wise manner) to form fixed-length vectors (such as commonly used sum pooling/mean pooling).
  • Finally, concatenate these vectors into a MULTI-layer Perceptron (MLP) to learn the nonlinear relationship between these features.

There is a problem with this model. Such as under the electricity market view, the user’s interests can be used to describe the history of the user behavior (such as user access to goods, shop or category), but if, in accordance with the existing processing pattern, for different candidate ads, the user’s interests are always mapped to the same fixed-length vector to represent, this greatly limits the power of expression of the model, After all, users’ interests are diverse.

The bottleneck of Embedding&MLP model is to express users’ diverse interests, and the dimensional-constrained user representation vector will become the bottleneck of expressing users’ diverse interests.

To solve this problem, DIN network is proposed in this paper. For different candidate ads, the relevance of the AD to the user’s historical behavior is considered in order to learn the characteristic expression of user’s interest adaptively. Specifically, this paper introduces the Local Activation Unit module, which is based on the Attention mechanism to indicate user interest by weighting user’s historical behavior. The weighting parameters are learned through candidate ads and historical behavior interaction.

In addition, two techniques, mini-batch Aware Regularization and Dice activation function, are introduced to help train large networks.

1.4 Explanation of Nouns

Diversity: Users are interested in a variety of products when they visit e-commerce sites. That is, the interests of users are very broad. For example, a young mother, from her historical behavior, we can see that she has a wide range of interests: sweaters, handbags, earrings, children’s wear, sportswear and so on.

Local Activation: Whether a user clicks on an item recommended to him or not depends on only a small part of the historical behavioral data, not all of it. Whether a portion of the data in the historical behavior will lead to clicks on candidate ads. For example, a person who loves swimming has bought travel book, Ice cream, potato chips and swimming Cap before. The current product (or Ad) that I recommend to him is goggle. Whether he clicks on this AD has nothing to do with whether he has bought chips, books or ice cream before. It had something to do with his previous purchase of a swimming cap. That is, in this case, part of the historical data (the Swimming Cap) mattered, and the rest was of little use.

0x02 Unpack ideas

Main excerpt of this section: Build Wide & Deep manually with NumPy.

2.1 Memorization and Generalization

One of the main challenges of recommendation systems is to address both Memorization and Generalization. Memorization, based on historical behavior data, recommends items that are usually directly related to what the user is already doing. And Generalization learns new feature combinations and improves the diversity of recommended items. In DeepFM, Wide & Deep correspond to Memorization & Generalization respectively.

2.1.1 Memorization

In the face of CTR prediction problems with large-scale discrete features, nonlinear transformation of features and then the use of linear model is a very common practice in the industry, the most popular is “LR+ feature cross product”. Memorization constructs these nonlinear features by a series of artificial cross-products of features, capturing high-order correlation between sparse features, that is, “remembering” feature pairs that have appeared together in historical data.

For example,

Characteristics of the1-- Major: {computer, humanities, other}, characteristics2-- Downloaded music "Sorrow" :{yes, no},Copy the code

The feature dimensions of the two one-hot features are 3 and 2 dimensions respectively, and the corresponding cross product result is

Characteristics of the3-- professional ☓ downloaded music "Sorrow" : {computer ∧ is, computer ∧ no, humanities ∧ is, humanities ∧ no, other ∧ is, other ∧ no}.Copy the code

Typical examples are LR models, which use a large number of primitive sparse features and cross product features as inputs. Many primitive dense features are also constructed into sparse features by barrel discreteness. The advantages of this approach are that the model is highly interpretable, the implementation is fast and efficient, and the feature importance is easy to analyze. It has been proved to be very effective in the industry.

The disadvantages of Memorization are:

  • More manual design is needed;
  • There may have been a fit. It can be understood as follows: if all feature cross products are multiplied, it is almost equivalent to purely remembering each training sample. This extreme case is the most fine-grained cross product. We can enhance generalization by constructing coarse-grained feature cross products.
  • Unable to capture feature pairs that do not appear in the training data. For example, in the above example, if no one in each major has downloaded “Melancholy”, then the frequency of the two features appearing together is 0, and the corresponding weight of the model after training will also be 0.

2.1.2 Generalization

Generalization as sparse feature, low-dimensional dense Embeddings were learned to capture feature correlation. The embeddings learned had certain semantic information. Generalization can be associated with word vectors in NLP, and word vectors of different words have correlation, so Generalization is based on transmission between correlations. These models are represented by DNN and FM.

The advantage of Generalization is that Generalization has less manual involvement and better Generalization for feature combinations that have not occurred historically.

In the recommendation system, when the user-item matrix is very sparse, such as users with unique hobbies and a small number of items, NN can hardly learn effective embedding for users and items. In this case, most user-items should be uncorrelated, but the dense embedding method can still get non-zero prediction of all user-item pairs, thus leading to over-generalize and recommending less relevant items. This is where Memorization shows the advantage of “remembering” these particular combinations of features.

2.2 Development Context

Of NN and FM seemingly multifarious, in fact, as long as the grasp of their development, the * * pay attention to the “how to memory and expansion” and “how to deal with high dimensional and sparse category features” and “how to realize the characteristics of the cross” * *, you will find all sorts of new algorithm on tall but is along the lines, the repair on one of the branches. In this way, all kinds of NN and FM are no longer independent abbreviations in your mind, but can be woven into a network and integrated.

Compared with real features, sparse category /ID features are the “first-class citizens” in the field of recommendation and search, and have been studied more. Even if there are some real numeric features, such as the number of historical exposures, clicks, and CTR, they are often categorical by way of buckets before being fed into the model.

However, the sparse categorical/ID features also have the disadvantages of weak expression ability of single feature, explosive combination of feature, and uneven distribution leading to uneven training degree. To this end, a series of new technologies have been developed.

A single categorical/ID feature is very weak, so it is necessary to cross features to enhance the ability of categorical feature. And around how to do feature cross, derive all kinds of algorithms.

The depth neural network (DNN) first maps the categorical/ ID features into dense vectors through embedding, and then feeds them to DNN, so that DNN can automatically learn the deep cross between these features to enhance the ability of expansion.

0x03 DNN

3.1 Depth model idea

Accurate CTR estimation requires careful balancing of the interests of users, advertisers and platforms. After years of technical update and development, CTR prediction technology has experienced a process from LR/FM to fusion model (RF/GBDT/XGBoost) to depth CTR prediction model (FNN/PNN/WDL/DeepFM/DIN). The main thread running through it is how to make the model automatically mining combination features?

Such as:

  • Wide&Deep and DeepFM: the combination of high-order and low-order features is used to improve the expression ability of the model;
  • PNN: A product layer (inner product and outer product) is introduced before MLP, which emphasizes the crossover mode between feature Embedding vectors and makes it easier for the model to capture the cross-information of features.

Or take a look at ali’s starting point:

The first thing we consider is dimensionality reduction, and on the basis of dimensionality reduction, we further consider the combination of features. So DNN was a natural thing to consider. Another consideration is that if the user behavior sequence is modeled, we hope that after users open the hand shopping, they will first click a good product, and then guess you want to click a product, and finally enter the search will be affected by the previous behavior. Of course, there are many similar methods to indirectly achieve such an idea. However, it is difficult for LR models to support such features in direct modeling, so IT is easy to think of RNN model.

3.2 within DNN models

Most of the DNN models follow the basic network architecture of Embedding + MLP, that is, different discrete features of original high dimensional Embedding are mapped to low dimensional Embedding vector of fixed length, and the Embedding vector is used as input of multiple fully connected layers to fit the nonlinear relationship of high order. Finally, the output value is normalized to 0~1 by Sigmoid and other means, indicating the click probability. Compared with traditional LR, GBDT and FM models, this kind of DNN model can reduce a lot of artificial feature construction process, and can learn the nonlinear relationship between features.

The usual process is:

Sparse Features -> Embedding Vector -> pooling layer -> MLPs -> Sigmoid -> Output  
Copy the code

3.3 Working Mechanism

The Base Model shown below is the pattern adopted by most existing CTR models

Red, blue and pink three-color nodes represent Goods ID, Shop ID and category ID, respectively. White nodes are used to represent other input features (such as user features on the left, such as user ID; Notice that Goods 1 to Goods N are used to describe the user’s historical behavior. Ad itself is a commodity and has three characteristics: Goods/Shop/Cate ID.

Bottom-up observation of the working mechanism of Base Model:

  • The first module: feature representation.
    • Features can be roughly divided into four categories: User profile, User behavior, AD, and Context.
    • Target ads.
    • Each type of feature contains multiple fields, and user information includes gender, age, and so on. User behavior includes the number of items accessed by the user; Advertisement includes advertisement ID, store ID, etc.; Context contains design type ID, time, and so on.
    • Some features can be encoded as one-hot representations, for example, women can be encoded as [0,1]. Some features can be multi-hot encoded, which is different from one-Hot encoding. In multi-hot encoding, a vector may have multiple ones.
    • In the CTR sequence model, it is worth noting that each field contains a list of behaviors, and each behavior corresponds to a one-HOT vector.
  • The second module: embedding layer.
    • By learning the low-dimensional vector representation of features, the sparse feature matrix with larger dimension is transformed into a low-dimensional dense feature matrix.
    • Each field has an independent embedding matrix.
    • It is worth noting that the number of columns in E is uncertain because the historical behavior data of each user is different. Accordingly, the embedding vector of other fields cannot be directly connected end to end as the input of MLP layer.
  • The third module: pooling layer.
    • As different users have different numbers of behavior data, the vector sizes of the embedding matrix are inconsistent, and the full connection Layer can only process data of fixed dimensions, so a vector of fixed length is obtained by Pooling Layer.
    • This layer conducts sum pooling for E, that is, the embedding vector of a category is input into the pooling operation and converted into a fixed-length vector to solve the problem of variable dimensions.
  • The fourth module: link layer.
    • After embedding layer and pooling layer, the original sparse features are converted into multiple abstract representation vectors of fixed length of user interest.
    • Then concat layer is used to aggregate the abstract representation vector and output the unique abstract representation vector of the user’s interest. As input to the MLP layer.
  • The fifth module: MLP layer, takes the abstract representation vector output by concat layer as the INPUT of MLP, and automatically learns the high-order cross features between data.
  • Loss function: The loss function widely used in CTR models based on deep learning is the Negative log-likelihood function Loglos, which uses labels as target items to supervise the overall prediction.

3.4 Model Features

Advantages:

  • High order nonlinear relation can be fitted by neural network, and the workload of artificial feature is reduced.

Disadvantages:

  • There is a limit to the diversity of interests expressed by users (this is the biggest bottleneck). In the process of historical behavior data of users, the number of historical clicks of each user is not equal, which contains a lot of interest information. How to model diverse interests of users? We need to code them into a fixed-length vector (the vector is the user representation, which is the representative of the user’s interest), and we need to conduct pooling (sum or average), so the information will be lost. Such as:
    • The k-dimensional vector can only express K independent interests at most, and the user’s interests may be more than K;
    • The size of K will have a significant impact on the amount of calculation. Generally, it will be better to use a large K, that is, to expand the dimension of the vector, but this will increase the parameters to be learned and the risk of over-fitting in limited data.
  • The relationship between the user and the AD is not considered. In the field of e-commerce, User Behavior Data contains a large amount of User interest information, and previous studies did not model the unique structure (Diversity + Local Activation) of Behavior Data **. For example, for the same user, if the Candidate Ad changes, the user’s interest is still expressed by the same vector, which obviously limits the expression ability of the model. After all, the user’s interest is rich/changing.
  • Ignore the mining and representation of implicit features. The DNN model takes the user’s behavior directly as the user’s interest. Behavior is the carrier of interest, can reflect interest, but if directly expressed interest by behavior is slightly inappropriate. Because behaviors are serialized, if you take the behavior-as-interest approach as most existing models do, you ignore the dependencies between behaviors. In addition, interest in the present moment often leads directly to the next action.
  • Ignore changes in interest. As mentioned earlier, users’ interests are constantly changing. For example, users’ preferences for clothes will change with the seasons, fashion trends and personal tastes, presenting a continuous trend of change. However, in Taobao platform, users’ interests are rich and varied, and the evolution of each interest basically does not affect each other. Moreover, it is only the interest related to the target product that influences the final behavior.
  • It is not necessary to compress all of a user’s interests into the vector, because only part of the user’s interest affects the current behavior (clicking or not clicking on the AD candidate). For example, a female swimmer will click on the recommended goggles, mostly because she bought a swimsuit rather than the shoes on last week’s shopping list.

0x04 DIN

To solve the problem of DNN model, Ali proposed DIN model. The core idea is that users’ interests are diverse, and different users’ interests can generate different local activation for specific ads. DIN models both Diversity and Local Activation.

DIN does not express the different interests of all users by using the same vector, but adaptively calculates a representation vector of user interest (for a given AD) by considering the relevance of historical behavior. This representation vector varies from AD to AD. DIN calculates a representation vector of user interest by considering the correlation between a given AD candidate and the user’s historical behavior. Specifically, it introduces local activation units, focuses on relevant user interests through the relevant part of soft search history behavior, and uses weighted sum to obtain the representation of user interests related to candidate ads. Behaviors that are highly relevant to candidate ads have a higher activation weight and dominate user interest. The representation vector is different in different advertisements, which greatly improves the expression ability of the model.

4.1 innovation

Deep Interest NetWork has the following innovations:

  • For’m:DIN is used for a wide range of user interestsan interest distributionTo represent the Diversity model by Pooling (weighted sum).
  • In view of the Local Activation:
    • DNN loses a lot of information by directly solving sum or average. Therefore, DIN is slightly improved to realize Local Activation by using attention mechanism, learn user interest vector dynamically from user history behavior, and construct different user abstract representation for different ads, thus realizing that when the data dimension is certain, Capture users’ current interests more accurately.
    • The historical behaviors of users are weighted differently. For different ads, different behavior ids are assigned different weights, which are jointly determined by the current behavior ID and candidate ads. This is the Attention mechanism. That is, for the current candidate Ad, delocalized activation (Local Activate) relevant historical interest information.
    • The more relevant the historical behavior is to the current candidate Ad, the higher the score will beattention score“Will dominate this forecast.
  • In CTR, features are sparse and have high dimensions, and overfitting is usually prevented by means of L1, L2 and Dropout. Since the traditional L2 regularization computs all the parameters, the model parameters of CTR are often in the hundreds of millions. DIN proposed a regularization method, which gives different regularization weights to features of different frequencies in each small batch iteration.
  • Since the traditional activation function, such as Relu, outputs 0 when the input is less than 0, the iteration speed of many network nodes will be slow. Although PRelu speeds up the iteration speed, its segmentation point is 0 by default. Actually, the segmentation point should be determined by the data. Therefore, DIN proposed a data dynamic adaptive activation function Dice.
  • Model training for large-scale sparse data: when THE DEPTH of DNN is deep (with many parameters) and the input is very sparse, it is easy to overfit. DIN proposed Adaptive regularizaion to prevent overfitting with remarkable effect.

4.2 architecture

The DIN architecture diagram is as follows:

DIN models both Diversity and Local Activation, as illustrated in the figure below.

Let’s take a look at each part of the system.

0 x05 characteristics

5.1 Feature Classification

In this paper, the author divides the characteristics of Ali’s display advertising system into four categories.

1) User portrait features;

2) User behavior characteristics, that is, the length of the product clicked by the user is different;

3) Advertising to be exposed, advertising is actually a commodity;

4) Context features;

Each feature category includes multiple feature fields. For example, user portrait features include gender, age, etc. User behavior characteristics, including the commodities clicked by the user, the category of commodities, and the shops they belong to, etc. Context includes time.

5.2 Input Features

Common characteristics of input in CTR:

  • High latitudes
  • A very thin

Some feature fields are single-valued features, and different feature values are mutually exclusive. For example, gender can only belong to male or female, which can be transformed into one-hot representation.

Some feature fields are multi-valued discrete features, such as user behavior features. Users may click multiple commodities, forming a commodity click sequence, which can only be represented by multi-Hot coding. Different from one-HOT encoding, multi-hot encoding may have multiple 1’s in a vector, such as:

  • Videos watched and searched by users on YouTube. There is more than one, both viewed and searched, but the number of viewed and searched is small (very sparse) relative to all the videos.

  • For example, in e-commerce, there are multiple good_id and shop_ID purchased by users, which directly leads to the different length of historical behavior ID of each user.

5.3 Feature Processing

DNN does not perform feature combination/crossover features. Instead, DNN is used to learn the interaction information between features.

The processing of single-valued features is relatively simple, and the processing of multi-valued features is slightly more troublesome. The multi-valued feature results in different sample lengths for each user. How to solve this problem? Embedding -> Pooling + Attention

0x06 Embedding

The application of deep learning in the field of recommendation and search is based on sparse ID class features, and its main method is Embedding. Change the “exact match” of ID class features to “fuzzy lookup” to enhance expansion. That is, the feature of high dimensional and sparse categorical/ ID class is mapped into a low dimensional and dense vector by embedding.

6.1 the characteristics of

Features of Embedding are as follows:

  • The application of deep learning in recommendation system, for example, all kinds of NN and FM are based on embedding.
  • The features of high-dimensional and sparse categorical/ ID are first-class citizens in the recommendation system.
  • In Embedding layer, each feature domain corresponds to an Embedding matrix.
  • The function of embedding is to transform the original “exact matching” of the high dimensional and sparse categorical/ ID class into “fuzzy search” among vectors, which improves the scalability.
  • Embedding in recommendation systems is also different from Embedding in NLP.
    • In NLP, there is only one Embedding word in one position, so Embedding usually becomes: the line vector corresponding to the word is extracted from the Embedding matrix;
    • In the recommendation system, there are often multiple features under a Field. Embedding merges multiple features into a vector, which is calledPooling. For example, the Feature under an App Field has “wechat :0.9, Weibo :0.5, Taobao :0.3”, so it can be obtainedEmbedding = 0.9 * wechat vector + 0.5 * Weibo vector + 0.3 * Taobao vector;

6.2 Variable length characteristics

MLP can only accept input of fixed length, but the length of commodity click sequence of each user in a period of time may be different, which is a variable length feature. Then how to deal with such variable length feature?

Generally, it is handled by the Pooling layer. Let’s look at the Pooling layer.

0 x07 Pooling layer

The function of Pooling is to transform the embedding vector into a fixed length vector to solve the problem of variable dimensions.

7.1 Pooling role

Users have multiple interests, which leads to two problems:

  • When users express their interests, their historical behaviors often involve multiple categorical/ ID features, such as multiple products clicked, multiple videos watched and multiple search terms entered, which involve multiple good_id and shop_id.
  • Different users have different number of historical behaviors, that is, the vectors of multi-hot behavior feature will lead to different lengths of the generated embedding vector list, and the input of fixed length is required for full connection.

In order to reduce the latitude and make the arithmetic operation between goods shops meaningful, we embed ID features first.

So how do you model your users’ diverse interests? We “merge” the low-dimensional vector list after the embedding of these ID features into one vector as the representation of user interest.

Since full connections require fixed-length inputs, we need to “merge” into a fixed-length vector in order to feed DNN.

Pooling is called Pooling.

7.2 Implementation Method

There are different strategies for the Pooling process:

  • In this paper, Youtube DNN does the simplest and intuitive embedding vector, that is, making a simple average of the video embedding vector that users have seen and the keyword embedding vector that they have searched.

  • In the Neural Factorization Machine, compress n (n= feature number) K-dimensional vectors into a k-dimensional vector named bi-interaction pooling. The pooling is completed, and the second-order crossover between features is realized.

  • DIN implements pooling by weighted average of embedding vectors, and the “weight” is calculated by the attention mechanism.

  • Text classification based on deep learning also faces the problem of how to compress multiple word vectors in a paragraph into a vector to represent the paragraph. The commonly used method is to feed multiple word vectors into RNN, and the output vector of RNN at the last moment represents the “combination” result of multiple word vectors. Obviously, DIEN borrowed this idea and modified the structure of GRU, using attention Score to control the door.

7.3 within DNN

DNN base model adopts pooling method, generally there are two methods, sum pooling (each corresponding element is accumulated) and average pooling (each corresponding element is averaged). All vectors are then concatenated to obtain the overall representation vector of the instance.

  • The summation is summed over the embedding of multiple goods on each corresponding dimension. For example, if there are 10 embedding items in the click sequence, then the 10 embedding values are summed on dimensions 1 to 16, assuming that the embedding dimension is 16.
  • Averaging is the averaging of multiple embedding on each corresponding dimension. No matter how many products are clicked by pooling, the final representation vector embedding and the embedding dimension of each product are the same.

Base model for any candidate to be predicted, no matter the candidate is clothes, electronic products, etc., the user’s representation vector is determined and unchanged, and there is no discrimination for any candidate.

7.4 DIN

Going back to the display advertising system of Ali, as shown in the framework diagram, each commodity has three characteristic domains, including commodity itself, commodity category and the shop to which the commodity belongs. For each commodity, the representation vector of the commodity is realized after the three features are added together.

The pooling method is adopted in the framework diagram for commodity sequence, and the representation vector of user behavior sequence is obtained after pooling. Then it is spliced with other features as the input of MLP.

Besides the embedding part of the candidate, the rest embedding part of the MLP input can be regarded as the representation vector of the user.

A careful study of the Pooling Layer in the Base Model will find that the Pooling operation loses a lot of information.

Therefore, DIN uses Pooling (weighted sum) to model Diversity, because direct sum cannot reflect the Diversity of differences, and weighting can.

That is, DIN implements pooling by weighted average of embedding vectors, and the “weight” is calculated by the attention mechanism.

0 x08 Attention mechanism

The simple understanding of the Attention mechanism is that for different ads, the user’s historical behavior and the weight of that AD are different. Assume that the user has three historical behaviors of ABC. For AD D, the weight of ABC may be 0.8, 0.1 and 0.1. For advertisement E, the weight of ABC may be 0.3, 0.6 and 0.1. The weight here is what the Activation Unit in the Architecture diagram needs to learn about the Attention mechanism.

The DIN model is just DNN plus attention. Pooling is implemented through Attention, and the vector representation of user interest is different according to the different candidate materials, so as to realize the “thousand things and thousand aspects” of user interest.

The objective of the model is to fully explore the relationship between user interest and candidate advertising based on user history behavior. Whether or not a user clicks on an AD is often based on some of their previous interests, which is the basis of the Attention mechanic in apps. Because both user interest behaviors and candidate advertisements are mapped to the Embedding space. Therefore, the relationship between them is learned in Embedding space.

8.1 the problem

DIN’s attention mechanism was designed in part to use a vector of fix length to depict the user’s interest in different items, a seemingly simple point that is difficult in practice.

  • In order to obtain a fixed length Embedding Vector representation in the traditional DNN model, the original Embedding Vector representation is described inEmbedding LayerI’ll add one after thatPooling Layer. Pooling can be done by sum or average. I end up with a fixed lengthEmbedding VectorIs an abstract representation of a user’s interest, often referred to asUser Representation. The downside is that some information is lost.
  • The dimension of user Embedding Vector is K, which represents k independent interests at most. But the user’s interests are far more than K, how to do?
  • The traditional DNN model is inEmbedding Layer -> Pooling LayerWhen the expression of user interest is obtained, the relationship between user and advertisement is not considered, that is, the weight between different advertisements is consistent. Such traditional estimation methods use the same vector to represent a user for different products (advertising). If you want to express multiple interests in this case, the simplest solution would be to increase the dimension of the user vector, which would lead to overfitting and computational stress.

So DIN tries to solve this problem with a attention-like mechanism.

8.2 Attention mechanism

As the name implies, the attention mechanism means that the model pays different attention to different user behaviors when predicting. “relevant” behavior history pays more attention, while “irrelevant” history can even be ignored. That is, different weights are assigned to different features, so that certain features dominate the prediction this time, just as the model pays attention to certain features.

This idea is also intuitive when reflected in the model. For example, in the video recommendation model, DIN can add a user’s historical behavior feature: nearly 20 show_id and nearly 20 video_id that the user watched, then use the attention network, and finally merge with other non-historical behavior features in the MLP.

DIN uses the attention mechanism to better model local activation. When user interest representation is obtained, different historical behaviors are given different weights, that is, local activation is realized through Embedding Layer -> Pooling Layer+attention. From the perspective of final reverse training, it is to activate users’ historical interests and hobbies in reverse direction according to the current candidate advertisement, and assign different weights to different historical behaviors.

Instead of a single point representing a user’s interest, DIN suggests a distribution that varies from moment to moment: the distribution can be multimodal, indicating that each person has multiple interests. A peak indicates an interest, and the size of the peak indicates the intensity of interest. Therefore, for different candidate ads, users’ interest intensity is different, that is to say, with the change of candidate ads, users’ interest intensity is constantly changing. Because user interest is a multi-peak function, even in low-dimensional space, almost infinite expression ability can be obtained.

In other words: assuming that the Embedding Vector represented by user interest is Vu and the advertisement candidate is Va, Vu is a function of Va. In other words, agree that users have different user interest representations (embedding vectors are different) for different advertisements.

Among them:

  • Vi represents the embedding vector of behavior ID I, such as good_id and shop_id.
  • Vu is the weighted sum of all behaviors and represents user interests.
  • Candidate ads affect the weight of each behavior ID, known as Local Activation.
  • The weight represents the contribution of each behavior ID to the total user interest expressed by the Embedding Vector for the current AD candidate Va. In the actual implementation, weights are represented by the output of the activation function Dice, with inputs Vi and Va.

8.3 implementation

DIN doesn’t have a direct attention mechanism. Because the user interest vector should be different for different AD candidates.

The Local Activation Unit uses the Attention mechanism in NMT to implement its own attention mechanism. Local Activation learns the relationship between the candidate ads and the user’s historical behavior, gives the correlation between the candidate ads and each historical behavior (i.e., the weight parameter), and then sums the historical behavior sequence to obtain the characteristic expression of user interest. That is to say, users show different expressions of interest for different advertisements. Even though the behavior of historical interest is the same, the weight of each behavior is different.

When DIN pooling, the weights of the items related to the candidates are heavier, while the weights of the items unrelated to the candidates are smaller. This is a kind of Attention idea. Calculate the Attention score by interacting the candidate with each item in the click sequence. The calculated input includes the embedding vector of the commodity and the candidate, as well as their cross product. For different candidates, the user representation vector obtained is also different, with greater flexibility.

In DIN, the user interest vector calculated according to the Local Activation Unit for candidate ads is:

Among them,

  • Ei represents the vector of user U’s historical behavior sequence, such as good_id and shop_id, with a length of H.
  • Vu represents the weighted sum of all user behavior embedding vectors and represents user interest.
  • Va represents the Mbedding vector of AD A;
  • Wj represents the weight of Ej;
  • The weight represents the contribution of each behavior ID to the total user interest expressed by the Embedding Vector for the current AD candidate Va.
  • In implementation, the weight wj was fitted by function, calculated by Activation Unit, and represented by the output of Activation function Dice, expressed as G (Vi,Va), with Vi and Va inputs;
  • Candidate ads affect the weight of each behavior ID, i.e. Local Activation;
  • A (.) represents a feed-forward network whose output, as a weight for local activation, is multiplied by the user vector;

In this calculation method, the interest vector of the final user U will change according to different AD A. This is the “multifaceted user interest”. For example, a user who had previously bought milk powder and a swimsuit was more likely to recall the swimsuit she had bought when she was shown goggles. And when she was shown a diaper, it clearly reminded her of the milk powder she had bought.

In the DIN attention mechanism, the user interest vector Vu is the weighted average of the item embedding vector we have seen in history. The weight Wi of the ith historical item is jointly determined by the embedding vector Vi of the historical item and the embedding vector Va of the candidate material (function G). It can be seen that the same user will have different interest vectors when facing different candidate materials, thus realizing the “thousand things and thousand faces”.

The main difference between DIN and base model lies in the activation unit. This structure obtains the corresponding weight by calculating the similarity between advertisement embedding and user embedding, and then sums the weight of the representation sequence, achieving excellent performance.

8.4 normalization

Generally speaking, when doing attention, you need to normalize all the scores through Softmax. This has two advantages: one is to ensure that the weights are non-negative, and the other is to ensure that the sum of the weights is 1.

However, DIN emphasizes in his paper that the attention score of the click sequence is not normalized and the weighted sum of the score and the corresponding commodity embedding vector is directly added to preserve the user’s interest intensity. For example, if a user’s click sequence is 90% clothes and 10% electronics, and a T-shirt and a mobile phone need to predict CTR, then the T-shirt will activate most of the user’s behavior, making the user behavior vector calculated from the T-shirt numerically larger.

0x09 Evaluation Indicator

The evaluation was GAUC, ali’s own measure. The practice has proved that GAUC is more stable and reliable than AUC.

AUC represents the probability that a positive sample score is higher than a negative sample score. In CTR application scenarios, CTR prediction is often used to rank candidate ads for each user. But there are differences between users: some users are naturally click-through high. Previous evaluation indicators were used to calculate AUC for samples regardless of users. In this paper, GAUC is used to realize user-level AUC calculation. Based on the AUC of a single user, weighted average is carried out according to the number of clicks or display times, eliminating the influence of user bias on the model and describing the performance effect of the model more accurately.

0x10 Adaptive Regularization

Due to the complexity of depth model and sparse input, there are many parameters and it is very easy to overfit.

CTR input is sparse and the dimensions are high. The existing METHODS of L1 L2 Dropout to prevent overfitting have not achieved very good results after attempts in this paper. User data conforms to the long-tail law, that is, many feature ids only appear for several times, while a small part of feature ids appear for many times. This adds a lot of noise during training and aggravates overfitting.

A simple solution to this problem is to manually remove feature ids that appear less frequently. Disadvantages: Lost information is not easy to assess; The threshold setting is very crude.

The solution given by DIN is:

  1. According to the frequency of feature ID, the intensity of regularization should be adjusted accordingly.
  2. For those with high frequency, less regularization intensity is given;
  3. For those with low frequency, greater regularization intensity is given.

As for the improvement of L2 regularization, when SGD optimization is carried out, each Mini-batch only inputs part of the training data, and backpropagation only trains part of non-zero feature parameters. After L2 is added, it is necessary to train the parameters of the whole network including all feature vectors. This calculation is very large and unacceptable. In this paper, L2 regularization is applied only to the characteristic embedding parameters of each Mini-batch.

0 x11 summary

The thesis is summarized as follows:

  1. Users have multiple interests and have accessed multiple good_id and shop_id. In order to reduce the latitude and make the arithmetic operation between goods shops meaningful, we embed it first. So how do we model the diverse interests of our users? Pooling is used to add or average the Embedding Vector. At the same time, this also solves the problem that different user input length is different, and a fixed length vector is obtained. This vector is the user representation, the user interest representation.
  2. But finding sum or average directly loses a lot of information. Therefore, a slight improvement is made to assign different weights to different behavior ids, which are jointly determined by the current behavior ID and candidate ads. This is the Attention mechanism, enabling Local Activation.
  3. Use DINactivation unitTo capture the characteristics of Local activation, usingweighted sum poolingTo capture the diversity structure.
  4. In model learning optimization, DIN proposedDice Activation function,Adaptive regularization, significantly improved the model performance and convergence speed.

0xEE Personal information

★★★★ Thoughts on life and technology ★★★★★

Wechat official account: Rosie’s Thoughts

If you want to get a timely news feed of personal articles, or want to see the technical information of personal recommendations, please pay attention.

0 XFF reference

Build Wide & Deep by hand with NumPy

How Google implements The Wide & Deep Model (1)

How to make recommendations using deep Learning on Youtube

Also review Deep Interest Evolution Network

The evolution of Ali CTR algorithm from DIN to DIEN

Chapter 7 Artificial Intelligence, 7.6 APPLICATION of DNN in Search Scenarios (Author: Renzhong)

#Paper Reading# Deep Interest Network for Click-Through Rate Prediction

【 Paper Reading 】Deep Interest Evolution Network for click-through Rate Prediction

Also review Deep Interest Evolution Network

Deep Interest Evolution Network for Click-Through Rate Prediction

Deep Interest Evolution Network(AAAI 2019)

Deep Interest Evolution Network for click-through Rate Prediction

DIN(Deep Interest Network): core ideas + source code to read notes

Calculating advertising CTR Estimation Series (5)– Ali’s Deep Interest Network Theory

Detailed explanation of Deep Interest NetWork model principle of CTR prediction

LSTM that everyone can understand

Understand RNN, LSTM and GRU from the driven graph

Machine learning (I) — NN&LSTm

Li Hongyi machine Learning (2016)

Recommendation system meets deep learning (24)– Deep interest evolution network DIEN principle and actual combat!

Import terror: DLL load failed from google.protobuf.pyext import _message

DIN deep interest network introduction and source analysis

Deep Interest Network for click-through Rate Prediction

Deep Interest Network for click-through Rate Prediction

Ali CTR Prediction Trilogy (2) : Deep Interest Evolution Network for click-through Rate Prediction

Deep Interest Network interpretation

Deep Interest Network (DIN)

DIN paper official implementation analysis

Ali DIN source code how to model user sequence (1) : Base scheme

How to model user sequences (2) : DIN and feature Engineering perspectives

Ali Deep Interest Network (DIN) paper translation

Recommendation system meets deep learning (24)– Deep interest evolution network DIEN principle and actual combat!

Recommendation system meets deep learning (18)– Probe into ali’s deep Interest Network (DIN) analysis and implementation

[Paper introduction] 2018 Alibaba CTR prediction model –DIN(Deep Interest Network), attached with TF2.0 recurrence code

2019 Ali CTR Prediction Model –DIEN(Deep Interest Evolution Network)