0 x00 the

We introduced Alibaba’s Deep Interest Network (DIN) earlier. A year later, Alibaba upgraded its model again to the Deep Interest Evolution Network (DIEN).

In this series of articles, we will review some concepts related to deep learning and the implementation of TensorFlow by reading the DIN & DIEN papers and source code.

This is the sixth article in the series: DIEN paper interpretation, referring to a large number of articles, sincerely thank you brothers to share, see the link at the end of the article.

0x01 Paper Summary

1.1 Article Information

  • Deep Interest Evolution Network for click-through Rate Prediction
  • Address: arxiv.org/pdf/1809.03…
  • Code address: github.com/mouna99/die…

1.2 Basic Views

1.2.1 DIN issues

DIN ignores changes in interest.

  • Users’ interests are constantly changing. For example, users’ preferences for clothes will change with the seasons, fashion trends and personal tastes, presenting a continuous trend of change.
  • In Taobao platform, users’ interests are diverse, and the evolution of each interest does not affect each other.
  • Moreover, it is only the interest related to the target product that influences the final behavior.

1.2.2 DIEN innovation

The author points out that the previous CTR estimation methods take the representation vector of user’s representation as interest directly, instead of modeling the hidden interest through specific representation. Therefore, DIEN was proposed [Key point — interest directly leads to continuous behavior, so it is necessary to model user interest and its evolution, and mine user interest and evolution related to target goods from user history behavior].

DIEN has two key modules:

  • One is the interest extraction layer:
    • The potential interest is extracted from the specific user performance by simulating the user interest transfer process, mainly using GRU plus a auxiliary loss. DIN does not consider temporal relationships between user histories, while DIEN uses GRU to model time series of user histories.
    • The downside of using GRU directly is that the hidden state only indicates the dependency between capture actions, not the interest. And the click of the target object is triggered by the final interest, so THE GRU can only learn the dependence between behaviors and cannot reflect the user’s interest well.
    • Innovation: Because the state of interest at each step leads directly to the next successive action. Therefore, the authors propose: auxiliary loss, the use of the next action to monitor the learning of interest state;
  • One is the evolutionary layer of interest:
    • The diversity of interest will lead to the phenomenon of interest deviation. In adjacent visits, users’ intentions may be very different, and a user’s behavior may depend on a behavior long ago.
    • Based on the interest sequence obtained from the interest extraction layer, attention mechanism is added to simulate the interest evolution process related to the current target advertisement. Attentional Update Gate (AUGRU) is used to model the process of interest change.
    • AUGRU enhances the influence of related interest in the evolution of interest and weakens the effect of unrelated interest caused by interest drift. By introducing attention mechanism in update gate, AUGRU can realize the evolution process of specific interest for different target objects.

It can be said that in each step of training, we introduced auxiliary loss for interest Extractor layer. Add attention mechanism to interest Evolving layer.

1.3 Explanation of Nouns

Latent interest: user and system interaction behavior is the carrier of interest expression.

Interest evolving: Influenced by external environment and internal cognitive changes, users’ interests tend to change over time. Taking clothes shopping as an example, users’ preferences for hobbies will change with the seasons, fashion trends and personal tastes.

Therefore, if we want to do a good job in predicting the click rate, we must capture the changing process of user interest while mining the user interest.

0x02 Overview

2.1 Model Architecture

The DIN architecture is as follows:

Similar to DIN, the model architecture is also the overall architecture of input layer + Embedding layer + connection layer + multi-layer fully connected neural network + output layer.

Unlike DIN, DIEN organizes user behavior in the form of sequential data and turns a simple activation unit using cross product into a attention-based GRU network.

The evolutionary network of deep interest is divided into several layers, from bottom to top:

  • Behavior Layer: it mainly converts the goods browsed by users into corresponding embedding and sorts them according to the browsing time, that is, the original Behavior sequence features of ID class are converted into embedding Behavior sequence.
  • Interest Extractor Layer: the main function is to extract user Interest sequence based on behavior sequence by simulating user Interest migration process.
  • Interest Evolving Layer: the main role is to simulate the evolution process of Interest related to the current target advertisement by adding Attention mechanism on the basis of Interest extraction Layer, and to model the evolution process of Interest related to the target items;
  • The embedding vector of interest representation and AD, user profile and context is spliced. Finally, MLP is used to complete the final prediction.

To be more specific:

  • The user history must be a time series, and if it is fed into the RNN, the last state can be considered to contain all the historical information. Therefore, the author uses a two-tier GRU to model user interests.
    • The trace of item embedding that the user has contacted historically is fed into the first layer GRU, and the user’s interest at each time is output. This Layer is called the Interest Extraction Layer
    • The output of the first layer is fed into the second layer GRU, and the UPDATE gate of the second layer GRU is controlled by attention score (calculated based on the output vector of the first layer and candidate materials). This Layer is called the Interest Evolving Layer.
  • The last state of the Interest Evolving Layer serves as a vector representation of the user’s Interest, which is fed into MLP together with the features of AD and context to predict the click rate.

0x03 Interest Extraction layer

3.1 Previous Work

DIEN mentioned in his paper that some previous algorithms introduce RNN to explore and utilize the dependency relations in the behavior sequence (that is, the browsed commodity sequence), and the pooling effect is better than the direct pooling effect. However, the problem of these algorithms is to directly take the hidden output of RNN as the expression of user’s interest. The embedding of goods is the real expression of goods as well as the direct response of user interest. The hidden output vector of RNN may not be able to truly express user interest.

Because the observed interest at the current moment directly affects the occurrence of the behavior at the next moment, GRU has weak pertinence to the expression of interest. The auxiliary Loss function (AUXILIARY Loss function) was designed to monitor the learning of interest at the current moment by using the behavior at the next moment. This forces the RNN output hidden layer to interact with the commodity embedding, as shown in the Auxiliary Loss on the left in the architecture diagram.

That is:

  • The user’s behavior is sequence data generated according to time, so THE RNN of GRU structure is used.
  • The user’s current interest directly leads to the next action, so an auxiliary loss function is designed to monitor the learning of the current moment’s interest with the next moment’s action.

3.2 GRU helped

The basic structure of interest evolution layer is a NETWORK of Gated Recurrent Unit (GRU), as shown in the yellow region in the architecture, which uses GRU to model the dependence between user behaviors.

The user behavior in the electronic commerce system is rich, even in a short time such as two weeks, its historical behavior sequence is very long. In order to balance efficiency and performance, we use GRU to model the behavior.

The input of GRU is the time-sorted behavior sequence of the user, that is, the corresponding goods of the behavior (the goods vector sorted by time step). Compared with the traditional sequence model RNN and LSTM, GRU solves the gradient disappearance problem of RNN. Compared with LSTM, GRU has fewer parameters and faster training convergence.

The parameters are described as follows:

  • Assume that the t time step input e(t), GRU output hidden element H (t);
  • Take the input vector e(t+1) of the next time step as the positive sample, randomly sample the negative sample E (t+1) ‘, and e(t+1) ‘! = e (t);
  • H (t) is inner product of positive and negative sample vectors;

3.3 Auxiliary loss

In order for the hidden state of the sequence structure model to effectively represent the potential interest, additional oversight of the hidden state, such as the introduction of ranking information, should be performed. Ranking loss has been widely used in ranking tasks in recommendation systems.

3.3.1 Auxiliary loss

Auxiliary loss comes from all click records, rather than targeted ads, which is helpful to avoid gradient disappearance. Click samples are taken as positive samples, while unclicked samples are taken as negative samples.

DIEN defines auxiliary loss as follows:

3.3.2 Global loss

The global loss function used by DIEN is as follows:

  • Label Target is the Loss function of CTR task.
  • CTR loss and auxiliary loss were added together to optimize the loss of the whole network.
  • Alpha is the hyperparameter that balances the final CTR prediction and expression of interest;

3.3.3 Effect of auxiliary loss

DIEN points out that GRU can only learn the dependencies between behaviors and does not reflect user interests well. The Label target only contains the monitoring information of the final interest, and the intermediate history state HT cannot get the monitoring information to guide learning. Interest can lead to multiple continuous behaviors, so Auxiliary Loss was introduced to improve the accuracy of interest expression.

Specifically, the behavior B (t+1) at time T is used as the supervision to learn the hidden layer vector HT. In addition to using the real next behavior as the positive sample, the selection of negative examples can be randomly selected from commodities that users have not interacted with, or commodities that have been shown to users but not clicked by users. The positive and negative samples respectively represent the t item vector clicked/unclicked by the user.

The advantages of introducing Auxiliary Loss include:

  • Helps the hidden state of the GRU better represent user interest. When auxiliary loss is added, each hidden state of GRU represents the user’s interest state under time, and the splicing of all interest state points constructs an interest sequence.

  • Gradient propagation of RNN in modeling long sequence scenes may not be able to affect the beginning part of the sequence well, and the introduction of an auxiliary supervision signal in each part of the sequence can reduce the difficulty of optimization to a certain extent.

  • Auxiliary Loss provides more semantic information for Embedding learning, and the learned items correspond to better Embedding.

3.4 summarize

After the interest extraction layer composed of GRU, the user’s behavior vector B (t) is further abstracted to form the interest state vector H (t).

In another sentence, the function of the interest extraction layer is to mine the connection between goods in the behavior sequence and extract and express the user’s interest.

0x04 Interest evolution layer

The main goal of Interest Evolution Layer is to depict the Evolution process of user Interest.

User interests are constantly changing:

  • The preferences of users in a certain period of time have a certain concentration. A user might buy books at one time and clothes at another;
  • Each interest has its own evolution trend, and there is little interaction between different kinds of interests. For example, the interests of buying books and clothes are basically unrelated to each other.

This change will directly affect the user’s click decision. Modeling the evolution of user interests has two benefits:

  • Tracking users’ interest can enable us to include more historical information when learning the expression of final interest.
  • CTR prediction can be better based on the changing trend of interest.

4.1 Evolution Rules

With the change of external environment and internal cognition, users’ interests are constantly changing, so users’ behaviors are affected by different interests. Compared with the interest extraction layer, the biggest feature of the interest evolution layer is the introduction of Attention mechanism, in order to more specifically simulate the interest evolution path related to target advertising.

Recommendation model can never be separated from specific business scenarios. In alibaba’s e-commerce environment, users are very likely to be interested in multiple categories of goods at the same time. For example, when buying “mechanical keyboard”, they are also checking products under the category of “clothes”. Thus, when the target product is an electronic product, the evolutionary path of interest related to the “mechanical keyboard” is more important than the evolutionary path related to the “clothing”.

The evolution of user interest has the following rules:

  • Interest Drift: Due to the diversity of interests, interests may Drift. Users’ interest in a certain period of time will have a certain concentration. For example, a user might keep buying books at one time and clothes at another.
  • Interest Individual: a kind of Interest has its own development trend, and different kinds of Interest rarely affect each other. For example, Interest in buying books and clothes is basically unrelated to each other. We focus only on the evolution associated with the object.

4.2 AUGRU

Based on the above rules, the interest evolution layer introduces the attention mechanism through AUGRU (GRU with Attentional Update Gate). The correlation is obtained through the calculation of interest state and target item, and AUGRU enhances the influence of relevant interest. At the same time, the influence of unrelated interests is weakened to capture the interest related to the target product and its evolution.

By analyzing the characteristics of interest evolution, the author combined the local activation ability of attention mechanism with the sequential learning ability of GRU to model interest evolution. In this way, at each step of the GRU, the attentional mechanism can enhance the influence of relative interest and attenuate the interference from interest drift.

With the user’s expression of interest, the role of the interest development layer is to capture the interest development pattern related to the candidate, as shown in the red area of the architecture diagram, where the second GRU is used. The embedding vector of the candidate interacts with the output implicit vector of the first GRU to generate attention score. Note that unlike DIN, the attention score is normalized using Softmax. The attention score reflects the relationship between the target object and the current state of interest. The greater the correlation, the greater the score.

4.3 attention

For attention:

How to add the attention mechanic to the GRU? Three methods are tried in this paper

  • GRU with attentional Input (AIGRU) : Combines the attention mechanism in the input.

    AIGRU uses the attention score to influence the input of the evolutionary layer of interest. You just multiply the attention factor by the input. Ideally, with fewer related interests and smaller input values, we can model the evolutionary trends of interest related to the target project. However, AIGRU is not doing well. Because even zero input will change the hidden state of GRU, less relative interest will also affect the learning of interest evolution.

  • Attention Based GRU(AGRU) : The Attention score is replaced by the update gate of GRU(Attention score is used to control the update of hidden State), which directly changes the hidden state. That is, the GRU update gate is directly replaced by the ATTENTION coefficient to directly update the hidden state.

    AGRU uses attention score to directly control the updating of hidden states, which weakens the influence of less related interests in the evolution of interest. Embedding attention into GRU can improve the influence of attention mechanism and help AGRU overcome the defects of AIGRU. Although AGRU can directly control the updating of hidden states using attention score, it uses a scalar (attention score) instead of a vector, ignoring the difference in importance between different dimensions.

  • GRU with attentional Update Gate (AUGRU)

    In AUGRU, the original size information of the update door is retained, and all dimensions of the update door are scaled by the attention score, so that the less relevant interest has less influence on the hidden state. AUGRU can more effectively avoid the interference brought by interest drift and promote the steady development of relative interest.

AUGRU is one of the most effective. In this paper, attention score is multiplied by update gate to replace the original update gate, which is called AUGRU, where A refers to attention and U refers to Update gate.

Let’s see what’s wrong with DIEN. GRU is a serial calculation structure, which should be calculated step by step in time. DIEN has two GRUs, and the second GRU also needs to do attention based on the results of the first GRU, so the second GRU can only be calculated after all the calculations of the first GRU are completed. The two GRU units cannot be calculated in parallel, so there may be a problem of long delay. The longer the sequence, the longer the delay may be. This paper describes that the sequence length of industrial scene input is 50, and the cumulative delay of two GRUs is equivalent to the delay of sequence length 100.

4.4 the characteristics of

The advantages of modeling interest evolution are as follows:

  • The interest evolution module can provide more relevant historical information for the final expression of interest
  • It is better to predict click-through rates of targeted projects based on evolutionary trends of interest

The interest evolution layer combines the local activation ability of attention mechanism with the sequential learning ability of GRU to achieve the goal of modeling interest evolution.

0 x05 summary

DIEN’s main contributions are as follows:

  • This paper focuses on the evolution of interest in e-commerce system and proposes a new network architecture to model the evolution process of interest. Interest evolution model makes interest representation richer and CTR prediction more accurate.
  • Instead of taking behavior as an interest directly, DIEN has a special interest extraction layer. A kind of auxiliary loss is proposed to solve the problem that the hidden state of GRU shows poor interest.
  • The interest evolution layer is designed. The interest evolution layer effectively simulates the interest evolution process related to the target project.

The next article will introduce the overall architecture of the model source code, so stay tuned.

0xEE Personal information

★★★★ Thoughts on life and technology ★★★★★

Wechat official account: Rosie’s Thoughts

If you want to get a timely news feed of personal articles, or want to see the technical information of personal recommendations, please pay attention.

0 XFF reference

Blog.csdn.net/John159151/…).

【 Paper Reading 】Deep Interest Evolution Network for click-through Rate Prediction

Also review Deep Interest Evolution Network

zhuanlan.zhihu.com/p/134170462)

Deep Interest Evolution Network for click-through Rate Prediction

LSTM that everyone can understand

Understand RNN, LSTM and GRU from the driven graph

Machine learning (I) — NN&LSTm

Li Hongyi machine Learning (2016)

Recommendation system meets deep learning (24) – Deep interest evolution network DIEN principle and practice!

Import terror: DLL load failed from google.protobuf.pyext import _message

DIN deep interest network introduction and source analysis

Blog.csdn.net/qq_35564813…).

Ali CTR Prediction Trilogy (2) : Deep Interest Evolution Network for click-through Rate Prediction

Recommendation system meets deep learning (24) – Deep interest evolution network DIEN principle and practice!

2019 Alibaba CTR Prediction Model — DIEN(Deep Interest Evolution Network)