0 x00 the

Deep Interest Network (DIN) was proposed by ali Mama precise directional retrieval and basic algorithm team in June 2017. Its CTR estimates for the e-Commerce industry focus on making full use of/mining information from historical user behavior data.

This is the third article in a series that looks at the whole idea of DIN source code. github.com/mouna99/die… The implementation of.

Because this project includes several models, such as DIN and DIEN, some files are used for DIEN model, which is also mentioned here in passing and will be explained in a special article later.

0x01 File Overview

Data files mainly include:

  • Uid_voc. PKL: user dictionary, id corresponding to user name;
  • Mid_voc. PKL: movie dictionary, id corresponding to item;
  • Cat_voc. PKL: category dictionary, category id;
  • Item-info: indicates the category information of an item.
  • Review-info: review metadata, in the format of userID, itemID, score, timestamp, used for negative sampling;
  • Local_train_splitByUser: indicates training data in the format of label, user name, target item, target item category, history item, and corresponding category of history item.
  • Local_test_splitByUser: test data in the same format as training data;

The code mainly includes:

  • Rnn. py: Modify the original RNN in Tensorflow to combine attention with RNN
  • Vecattgrucell. py: Modify GRU source code, add attention to it, design AUGRU structure
  • Data_iterator. py: data iterator, used for continuous input of data
  • Utils. Py: some auxiliary functions, such as dice activation function, attention score calculation, etc
  • Model. py:DIEN model file
  • Train.py: Entry to the model for training data, saving the model, and testing data

0x02 Overall Architecture

DIN tries to capture different similarities between the previously clicked item and the target item.

The first is to extract the architecture diagram from the paper for illustration.

  • Deep Interest NetWork has the following innovations:

    1. For’m:DIN is used for a wide range of user interestsan interest distributionTo represent the Diversity model by Pooling (weighted sum).
    2. For Local Activation:Local Activation is realized by using the attention mechanism, which dynamically learns user interest vector from user history behavior and constructs different user abstract representations for different advertisements, so as to capture user’s current interest more accurately under certain data dimensions. The historical behaviors of users are weighted differently, and the weights of historical behaviors of users are inconsistent for different ads. That is, for the current candidate Ad, delocalized activation (Local Activate) relevant historical interest information. Historical actions that are more relevant to the current Ad candidate will earn higher scoresattention score“Will dominate this forecast.
    3. In CTR, features are sparse and have high dimensions, and overfitting is usually prevented by means of L1, L2 and Dropout. Since the traditional L2 regularization computs all the parameters, the model parameters of CTR are often in the hundreds of millions. DIN proposed a regularization method, which gives different regularization weights to features of different frequencies in each small batch iteration.
    4. Since the traditional activation function, such as Relu, outputs 0 when the input is less than 0, the iteration speed of many network nodes will be slow. Although PRelu speeds up the iteration speed, its segmentation point is 0 by default. Actually, the segmentation point should be determined by the data. Therefore, DIN proposed a data dynamic adaptive activation function Dice.
    5. Model training for large-scale sparse data: when THE DEPTH of DNN is deep (with many parameters) and the input is very sparse, it is easy to overfit. DIN proposed Adaptive regularizaion to prevent overfitting with remarkable effect.

0x03 Overall code

The DIN code starts with train.py. Train.py evaluates the test set with the initial model and calls train:

  • Get training data and test data, both of which are data iterators for continuous data input
  • Generate the corresponding model based on model_type
  • The test set was evaluated every 1000 times according to the Batch training.

The code is as follows:

def train(
        train_file = "local_train_splitByUser",
        test_file = "local_test_splitByUser",
        uid_voc = "uid_voc.pkl",
        mid_voc = "mid_voc.pkl",
        cat_voc = "cat_voc.pkl",
        batch_size = 128,
        maxlen = 100,
        test_iter = 100,
        save_iter = 100,
        model_type = 'DNN',
        seed = 2.) :

    with tf.Session(config=tf.ConfigProto(gpu_options=gpu_options)) as sess:
        ## Training data
        train_data = DataIterator(train_file, uid_voc, mid_voc, cat_voc, batch_size, maxlen, shuffle_each_epoch=False)
        ## Test data
        test_data = DataIterator(test_file, uid_voc, mid_voc, cat_voc, batch_size, maxlen)
        n_uid, n_mid, n_cat = train_data.get_n()

        ......
        
        elif model_type == 'DIN':
            model = Model_DIN(n_uid, n_mid, n_cat, EMBEDDING_DIM, HIDDEN_SIZE, ATTENTION_SIZE)
        elif model_type == 'DIEN':
            model = Model_DIN_V2_Gru_Vec_attGru_Neg(n_uid, n_mid, n_cat, EMBEDDING_DIM, HIDDEN_SIZE, ATTENTION_SIZE)
            
        ......    

        sess.run(tf.global_variables_initializer())
        sess.run(tf.local_variables_initializer())

        iter = 0
        lr = 0.001
        for itr in range(3):
            loss_sum = 0.0
            accuracy_sum = 0.
            aux_loss_sum = 0.
            for src, tgt in train_data:
                uids, mids, cats, mid_his, cat_his, mid_mask, target, sl, noclk_mids, noclk_cats = prepare_data(src, tgt, maxlen, return_neg=True)
                loss, acc, aux_loss = model.train(sess, [uids, mids, cats, mid_his, cat_his, mid_mask, target, sl, lr, noclk_mids, noclk_cats])
                loss_sum += loss
                accuracy_sum += acc
                aux_loss_sum += aux_loss
                iter+ =1
                if (iter % test_iter) == 0:
                    eval(sess, test_data, model, best_model_path)
                    loss_sum = 0.0
                    accuracy_sum = 0.0
                    aux_loss_sum = 0.0
                if (iter % save_iter) == 0:
                    model.save(sess, model_path+"--"+str(iter))
            lr *= 0.5
Copy the code

0x04 Model base class

The base class of a Model is Model, and its constructor __init__ can be understood as the Behavior Layer: The main function is to transform the products browsed by users into corresponding embedding and sort them according to the browsing time, that is, the original behavior sequence of ID class is converted into embedding behavior sequence.

4.1 Basic Logic

The basic logic is as follows:

  • Construct various placeholder variables under ‘Inputs’ scope;
  • The embedding Lookup table for user and item is built under the ‘Embedding_layer’ scope to convert the input data to the corresponding embedding.
  • Combining various embedding vectors, for example, the embedding corresponding to the ID of item and the embedding corresponding to the CateID of item are joined together as the embedding of item.

4.2 Module Analysis

Below B is batch size, T is sequence length, and H is hidden size. Initialization variables in the program are as follows:

EMBEDDING_DIM = 18
HIDDEN_SIZE = 18 * 2
ATTENTION_SIZE = 18 * 2
best_auc = 0.0
Copy the code

4.2.1 Building variables

The first is to build the placeholder variable.

with tf.name_scope('Inputs') :# shape: [B, T] # shape: [B, T] T is the length of the sequence
    self.mid_his_batch_ph = tf.placeholder(tf.int32, [None.None], name='mid_his_batch_ph')
    # shape: [B, T] # shape: [B, T] # shape: [B, T] # shape: [B, T] T is the length of the sequence
    self.cat_his_batch_ph = tf.placeholder(tf.int32, [None.None], name='cat_his_batch_ph')
    # shape: [B], user id sequence. (B: batch size)
    self.uid_batch_ph = tf.placeholder(tf.int32, [None, ], name='uid_batch_ph')
    # shape: [B], movie ID sequence. (B: batch size)
    self.mid_batch_ph = tf.placeholder(tf.int32, [None, ], name='mid_batch_ph')
    # shape: [B], category id sequence. (B: batch size)
    self.cat_batch_ph = tf.placeholder(tf.int32, [None, ], name='cat_batch_ph')
    self.mask = tf.placeholder(tf.float32, [None.None], name='mask')
    # shape: [B]; Sl: sequence length, the actual sequence length of the sequence in User Behavior (?)
    self.seq_len_ph = tf.placeholder(tf.int32, [None], name='seq_len_ph')
    # shape: [B, T], y: label sequence corresponding to the target node, positive sample corresponds to 1, negative sample corresponds to 0
    self.target_ph = tf.placeholder(tf.float32, [None.None], name='target_ph')
    # Learning rate
    self.lr = tf.placeholder(tf.float64, [])
    self.use_negsampling =use_negsampling
    if use_negsampling:
        self.noclk_mid_batch_ph = tf.placeholder(tf.int32, [None.None.None], name='noclk_mid_batch_ph') #generate 3 item IDs from negative sampling.
        self.noclk_cat_batch_ph = tf.placeholder(tf.int32, [None.None.None], name='noclk_cat_batch_ph')
Copy the code

See the run-time variables below for details of the various shapes

self = {Model_DIN_V2_Gru_Vec_attGru_Neg} 
 cat_batch_ph = {Tensor} Tensor("Inputs/cat_batch_ph:0", shape=(? ,), dtype=int32) uid_batch_ph = {Tensor} Tensor("Inputs/uid_batch_ph:0", shape=(? ,), dtype=int32) mid_batch_ph = {Tensor} Tensor("Inputs/mid_batch_ph:0", shape=(? ,), dtype=int32) cat_his_batch_ph = {Tensor} Tensor("Inputs/cat_his_batch_ph:0", shape=(? ,?) , dtype=int32) mid_his_batch_ph = {Tensor} Tensor("Inputs/mid_his_batch_ph:0", shape=(? ,?) , dtype=int32) lr = {Tensor} Tensor("Inputs/Placeholder:0", shape=(), dtype=float64)
 mask = {Tensor} Tensor("Inputs/mask:0", shape=(? ,?) , dtype=float32) seq_len_ph = {Tensor} Tensor("Inputs/seq_len_ph:0", shape=(? ,), dtype=int32) target_ph = {Tensor} Tensor("Inputs/target_ph:0", shape=(? ,?) , dtype=float32) noclk_cat_batch_ph = {Tensor} Tensor("Inputs/noclk_cat_batch_ph:0", shape=(? ,? ,?) , dtype=int32) noclk_mid_batch_ph = {Tensor} Tensor("Inputs/noclk_mid_batch_ph:0", shape=(? ,? ,?) , dtype=int32) use_negsampling = {bool} True
Copy the code

4.2.2 build embedding

Then, the embedding lookup table of user and item is constructed, and the input data is converted into the corresponding embedding, that is, sparse features are converted into dense features. This series will introduce the principle and code analysis of embedding layer.

The subsequent U is the hash bucket size of user_id, I is the hash bucket size of item_id, and C is the hash bucket size of cat_id.

Note that a variable like self. mid_HIS_batch_ph holds a sequence of the user’s historical behavior and is of size [B, T], so when embedding_lookup, the output size is [B, T, H/2];

# Embedding layer
with tf.name_scope('Embedding_layer') :[U, H/2] : [U, H/2] : [U, H/2] : [U, H/2] : [U, H/2
    self.uid_embeddings_var = tf.get_variable("uid_embedding_var", [n_uid, EMBEDDING_DIM])
    The uid vector is added to the uid vector
    self.uid_batch_embedded = tf.nn.embedding_lookup(self.uid_embeddings_var, self.uid_batch_ph)

    [I, H/2] : [I, H/2] : [I, H/2] : [I, H/2
    self.mid_embeddings_var = tf.get_variable("mid_embedding_var", [n_mid, EMBEDDING_DIM])
    The uid vector is added to mid embedding weight
    self.mid_batch_embedded = tf.nn.embedding_lookup(self.mid_embeddings_var, self.mid_batch_ph)
    Local vector mid History embedding vector
    # Note that a variable like self.mid_HIS_batch_ph holds a sequence of the user's historical behavior and is of size [B, T], so when embedding_lookup, the output size is [B, T, H/2];
    self.mid_his_batch_embedded = tf.nn.embedding_lookup(self.mid_embeddings_var, self.mid_his_batch_ph)
    Vector mid History embedding vector is a negative vector
    if self.use_negsampling:
        self.noclk_mid_his_batch_embedded = tf.nn.embedding_lookup(self.mid_embeddings_var, self.noclk_mid_batch_ph)

    [C, H/2] # shape: [C, H/2] cate_id embedding weight. C is the hash bucket size of cat_id
    self.cat_embeddings_var = tf.get_variable("cat_embedding_var", [n_cat, EMBEDDING_DIM])
    Cid History embedding vector () : cid history embedding vector (
    # for example, cat_embeddings_var is (1601, 18), and cat_batch_ph is (? ,), cat_batch_embedded is (? 18),
    self.cat_batch_embedded = tf.nn.embedding_lookup(self.cat_embeddings_var, self.cat_batch_ph)
    Cid embedding vector is added to cid embedding weight
    self.cat_his_batch_embedded = tf.nn.embedding_lookup(self.cat_embeddings_var, self.cat_his_batch_ph)
    Local Cid History vector (cid history vector
    if self.use_negsampling:
        self.noclk_cat_his_batch_embedded = tf.nn.embedding_lookup(self.cat_embeddings_var, self.noclk_cat_batch_ph)
Copy the code

See the run-time variables below for details of the various shapes

self = {Model_DIN_V2_Gru_Vec_attGru_Neg} 
 cat_embeddings_var = {Variable} <tf.Variable 'cat_embedding_var:0' shape=(1601.18) dtype=float32_ref>
 uid_embeddings_var = {Variable} <tf.Variable 'uid_embedding_var:0' shape=(543060.18) dtype=float32_ref>
 mid_embeddings_var = {Variable} <tf.Variable 'mid_embedding_var:0' shape=(367983.18) dtype=float32_ref>
  
 cat_batch_embedded = {Tensor} Tensor("Embedding_layer/embedding_lookup_4:0", shape=(? .18), dtype=float32)
 mid_batch_embedded = {Tensor} Tensor("Embedding_layer/embedding_lookup_1:0", shape=(? .18), dtype=float32)
 uid_batch_embedded = {Tensor} Tensor("Embedding_layer/embedding_lookup:0", shape=(? .18), dtype=float32)

 cat_his_batch_embedded = {Tensor} Tensor("Embedding_layer/embedding_lookup_5:0", shape=(? ,? .18), dtype=float32)
 mid_his_batch_embedded = {Tensor} Tensor("Embedding_layer/embedding_lookup_2:0", shape=(? ,? .18), dtype=float32)

 noclk_cat_his_batch_embedded = {Tensor} Tensor("Embedding_layer/embedding_lookup_6:0", shape=(? ,? ,? .18), dtype=float32)
 noclk_mid_his_batch_embedded = {Tensor} Tensor("Embedding_layer/embedding_lookup_3:0", shape=(? ,? ,? .18), dtype=float32)
Copy the code

Holdings splicing embedding

This part combines various embedding vectors. For example, the embedding corresponding to the ID of item and the embedding corresponding to the CateID of item are joined together as the embedding of item.

Shape:

  • Note that in the previous step, a variable like self. mid_HIS_batch_ph holds a sequence of the user’s historical behavior and is of size [B, T], so when embedding_lookup, the output is of size [B, T, H/2].
  • Here the embedding of Goods and Cate is concat to get [B, T, H]. Notice that the axis parameter in tf.concat has a value of 2.

A note on logic:

[self.item_eb = tf.concat([self.mid_batch_embedded, self.cat_batch_embedded], 1) Stored in I_EMB, it is concatenated by Goods and classes. For example, [[MID1, MID2], [MID3, MID4]] and [[CID1, cid2], [cid3, cid4]] can be spliced together to obtain [[MID1, MID2,cid1, cid2], [MID3, MID4, Cid3 cid4]].

Corresponding to the architecture diagram:

The second step is self.item_HIS_eb = tf.concat([self.mid_HIS_batch_embedded, self.cat_his_batch_embedded], 2) These two history matrices hold the sequence of the user’s historical behavior and are of size [B, T], so when embedding_lookup, the output is of size [B, T, H/2]. Then the embedding of Goods and Cate is concat to get [B, T, H] size. Notice that the axis parameter in tf.concat has a value of 2. Such as [[[mid1, mid2]]] and [[[cid1 cid2]]], joining together to get [[[mid1, mid2, cid1, cid2]]].

Corresponding to the architecture diagram:

The third step is to use tF.reduce_sum (self.item_his_eb, 1) to sum according to the first dimension, which will reduce the dimension.

For example, [[MID1, MID2,cid1, cid2], [MID3, MID4, cid3, cid4]] yields [[MID1 + MID3, MID2 + MID4,cid1 + cid3, cid2 + cid4]].

The specific code is as follows:

Embedding stitching of positive samples includes Item and cate. The corresponding commodity embedding and class embedding of the target node are concatenated
self.item_eb = tf.concat([self.mid_batch_embedded, self.cat_batch_embedded], 1)
Embedding of Goods and Cate is concat to [B, T, H] size. Notice that the axis parameter in tf.concat has a value of 2
self.item_his_eb = tf.concat([self.mid_his_batch_embedded, self.cat_his_batch_embedded], 2)
Summing in the first dimension reduces the dimension
self.item_his_eb_sum = tf.reduce_sum(self.item_his_eb, 1)
Item_eb is (128, 36), item_his_eb is (128,? , 36), this is read from real data, for example, could be (128, 6, 36).

# Negative sample embedding stitching, negative sample includes Item and cate. The corresponding commodity embedding and class embedding of the target node are concatenated
if self.use_negsampling:
    # 0 means only using the first negative item ID. 3 item IDs are inputed in the line 24.
    self.noclk_item_his_eb = tf.concat(   
        [self.noclk_mid_his_batch_embedded[:, :, 0, :], self.noclk_cat_his_batch_embedded[:, :, 0, :]] -1)
    # cat embedding 18 concate item embedding 18.
    self.noclk_item_his_eb = tf.reshape(self.noclk_item_his_eb,
                                        [-1, tf.shape(self.noclk_mid_his_batch_embedded)[1].36])
    self.noclk_his_eb = tf.concat([self.noclk_mid_his_batch_embedded, self.noclk_cat_his_batch_embedded], -1)
    self.noclk_his_eb_sum_1 = tf.reduce_sum(self.noclk_his_eb, 2)
    self.noclk_his_eb_sum = tf.reduce_sum(self.noclk_his_eb_sum_1, 1)
Copy the code

See the run-time variables below for details of the various shapes

self = {Model_DIN_V2_Gru_Vec_attGru_Neg} 
 item_eb = {Tensor} Tensor("concat:0", shape=(? .36), dtype=float32)
 item_his_eb = {Tensor} Tensor("concat_1:0", shape=(? ,? .36), dtype=float32)
 item_his_eb_sum = {Tensor} Tensor("Sum:0", shape=(? .36), dtype=float32)
  
 noclk_item_his_eb = {Tensor} Tensor("Reshape:0", shape=(? ,? .36), dtype=float32)
 noclk_his_eb = {Tensor} Tensor("concat_3:0", shape=(? ,? ,? .36), dtype=float32)
 noclk_his_eb_sum = {Tensor} Tensor("Sum_2:0", shape=(? .36), dtype=float32)
 noclk_his_eb_sum_1 = {Tensor} Tensor("Sum_1:0", shape=(? ,? .36), dtype=float32)
Copy the code

0x05 Model_DIN

Model_DIN is the model for DIN implementation.

class Model_DIN(Model) :
    def __init__(self, n_uid, n_mid, n_cat, EMBEDDING_DIM, HIDDEN_SIZE, ATTENTION_SIZE, use_negsampling=False) :
        super(Model_DIN, self).__init__(n_uid, n_mid, n_cat, EMBEDDING_DIM, HIDDEN_SIZE,
                                           ATTENTION_SIZE,
                                           use_negsampling)

        # Attention layer
        with tf.name_scope('Attention_layer') : attention_output = din_attention(self.item_eb, self.item_his_eb, ATTENTION_SIZE, self.mask) att_fea = tf.reduce_sum(attention_output,1)

        inp = tf.concat([self.uid_batch_embedded, self.item_eb, self.item_his_eb_sum, self.item_eb * self.item_his_eb_sum, att_fea], -1)
        # Fully connected layer
        self.build_fcn_net(inp, use_dice=True)
Copy the code

The overall idea is simple:

  • Attention layer
  • Fully connected layer

Specific analysis is as follows.

5.1 Attention mechanism

The Attention mechanism is: The constituent elements in the Source are imagined to be composed of a series of < Key, Value > data pairs. At this time, an element Query in the given Target is obtained by calculating the similarity or correlation between Query and each Key to obtain the weight coefficient of the Value corresponding to each Key. Then add the weighted sum of values to get the final Attention Value. So essentially, the Attention mechanism is a weighted sum of the values of elements in Source, while Query and Key are used to calculate the weight coefficients of the corresponding values. That is, the essential idea can be rewritten into the following formula:

Of course, conceptually, Attention is still understood as selectively selecting a small amount of important information from a large amount of information and focusing on these important information, while ignoring most of the unimportant information. This idea is still valid. The focusing process is reflected in the calculation of the weight coefficient. The larger the weight is, the more the focus will be on the corresponding Value, that is, the weight represents the importance of the information, and the Value is the corresponding information.

Another way to think about it is that the Attention mechanism can also be thought of as Soft Addressing: Source can be regarded as the contents stored in memory. The element consists of the address Key and the Value Value. The current Query Key=Query is used to retrieve the corresponding Value in memory, namely the Attention Value. Soft addressing refers to the fact that, unlike ordinary addressing, only one item is retrieved from the stored contents. Instead, it is possible to retrieve the contents from each Key address. The importance of retrieving the contents depends on the similarity between Query and Key. The weighted sum of the values is then used to extract the final Value, that is, the Attention Value. So it makes sense for many researchers to view the Attention mechanism as a special case of soft addressing.

As for the specific calculation process of Attention mechanism, if most current methods are abstracted, it can be summarized into two processes:

  • The first procedure calculates the weight coefficients based on the Query and Key.
  • The second procedure weights and sums values according to the weight coefficients;

The first process can be subdivided into two stages:

  • The first small stage calculates the similarity or correlation between Query and Key.
  • In the second small stage, the original score of the first small stage is normalized.

In this way, the calculation process of Attention can be abstracted into three stages as shown in the figure.

In the first phase, different functions and computation mechanisms can be introduced to calculate the similarity or correlation between Query and a particular Keyi. The most common methods include finding their vector dot product, Cosine similarity or by introducing additional neural networks.

In the second stage, a calculation method similar to SoftMax is introduced for numerical conversion of the score in the first stage. On the one hand, normalization can be carried out, and the original calculated score can be sorted into probability distribution where the sum of the weights of all elements is 1. On the other hand, SoftMax’s built-in mechanism can be used to highlight the weight of important elements. In other words, the following formula is generally used:

The calculation result of the second stage ai is the weight coefficient corresponding to Valuei, and then the weighted sum can get the Attention value:

Through the calculation of the above three stages, the Attention value for Query can be calculated. At present, most concrete calculation methods of Attention mechanism conform to the above three-stage abstract calculation process.

5.2 Attention to realize

In DIN, all field features of each item are concat to form the temporary EMB of the item, instead of simply sum pooling of all temporary ITEM EMBS in the sequence. Instead, the activation Unit module calculates correlation weights with candidate item EMBs for each item EMB.

This functionality is implemented in Attention:

5.2.1 call

How to call:

attention_output = din_attention(self.item_eb, self.item_his_eb, ATTENTION_SIZE, self.mask)
Copy the code

Where, relevant parameters, etc. :

  • Query: embedding, Shape: [B, H] corresponding to the candidate advertisement, namely i_EMB;
  • Embedding, shape: [B, T, H] : h_emb
  • Mask: Batch the real meaning of each behavior, shape: [B, H], because the sequence of user behaviors in a Batch may not all be the same, but the input keys dimension is fixed (all are the maximum length of historical behaviors), its real length is stored in self.sl, so it previously generated masks to select real historical behaviors to tell the model which behaviors are useless. Which are used to calculate the user’s interest distribution;
  • B:batch size; T: the maximum length of the user’s historical behavior sequence; H: embedding size;
  • Attention_output: Output is the representation of user interest;

Parameter variables are dynamic as follows:

self = {Model_DIN_V2_Gru_Vec_attGru_Neg} 
 item_eb = {Tensor} Tensor("concat:0", shape=(? .36), dtype=float32)
 item_his_eb = {Tensor} Tensor("concat_1:0", shape=(? ,? .36), dtype=float32)
 mask = {Tensor} Tensor("Inputs/mask:0", shape=(? ,?) , dtype=float32)Copy the code

5.2.2 Functions of masks

As for the role of mask, here is a review with Transformer:

Mask indicates a mask that masks certain values so that they do not take effect when parameters are updated. Transformer model involves two types of mask, namely padding mask and Sequence mask. The padding Mask is used in all scaled dot-Product attention, while the Sequence Mask is only used in self-attention for decoder.

Padding Mask

What is a padding mask? Because the length of the input sequence is different from batch to batch in other words, we need to align the input sequence. In particular, you populate short sequences with zeros. However, if the input sequence is too long, the content on the left is cut and the excess is discarded. Since these fill positions are meaningless, the attention mechanism should not focus on these positions and needs to do some processing.

To do this, add the values of these positions to a very large negative number (negative infinity), so that the probability of these positions approaches 0 by SoftMax! And our padding mask is actually a tensor, and each value is a Boolean, and the value false is where we’re going to do the processing.

Sequence mask

Sequence masks are designed to prevent the decoder from seeing future information. In other words, for a sequence, when time_step is t, our decoding output should only depend on the output before t, but not on the output after T. So we need to figure out a way to hide the information after t.

So how do you do that? It’s also very simple: generate an upper triangle matrix with all values of 0. Applying this matrix to each sequence will do the trick.

For scaled dot-product attention, use padding Mask and Sequence Mask as attn_mask. The implementation is to add two masks together as attn_mask.

In all other cases, attn_mask equals the padding mask.

DIN uses the padding mask.

5.2.3 Basic Logic

Query (); query (); query (); query (); query ();

  • If time_major, the conversion is performed :(T,B,D) => (B,T,D);
  • Conversion mask.
    • Use tf.ones_like(mask) to build a tensor with the same elements as the mask dimension;
    • Convert mask from int to bool using tf.equal. Tf.equal checks whether two inputs are equal. If they are equal, the value is True; if they are unequal, the value is False.
  • Convert the Query dimension to the same shape as facts. Here, T varies with each specific training data. For example, the length of a certain time series of a user is 5 and the length of another time series is 15.
    • Query is [B, H], which is converted to the queries dimension (B, T, H).
    • To calculate weights for each element in pos_item and user action sequences. Here it istf.tile(query, [1, tf.shape(facts)[1]]). Tf.shape (keys)[1] result = T, query = [B, H], tile = [B, T * H];
    • 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  • Do more operations to capture the relationship between action items and candidate items before MLP: addition, subtraction, multiplication and division. You then get the input from the Local Activation Unit. That is, candidate advertising queries correspond to EMB, user history behavior sequence facts correspond to Embed, plus the cross features between them, concat results;
  • The attention operation calculates the correlation between query and key. Specifically, the weight of each key in queries and facts is obtained by three-layer neural network. The output node of this DNN network is 1.
    • D_layer_3_all’s shape is [B, T, 1].
    • Then reshape for [B, 1, T], axis = 2 this said T the weight of a corresponding user behavior sequence parameters;
    • Attention output, [B, 1, T];
  • Get a score that has real meaning;
    • usekey_masks = tf.expand_dims(mask, 1)Mask is extended from [B, T] to [B, 1, T].
    • Use tf.ones_like(scores) to construct a tensor whose elements are all 1 as scores dimension;
    • The padding mask is followed by a small negative number, so that e^{x} is approximately equal to 0 when calculating softmax.
    • Perform the [B, 1, T] padding operation. In order to ignore the padding effect on the whole, the code uses tf.where to reset the padding vector weight (the vacant item in each sample sequence) to a minimum value (-2 ** 32 + 1) instead of 0.
    • usingtf.where(key_masks, scores, paddings)To get a really meaningful score;
  • Scale is the standard operation for attention. After scaled, feed it into Softmax to get the final weight. But it’s not in the code, it’s written off;
  • After standardization by Softmax, the normalized weight is obtained.
  • The correct weighted scores and facts of the user’s historical behavior sequence have been obtained, so weighted sum is used to obtain the representation of the end user’s interest.
    • If SUM mode is used, the representation of users’ interests can be obtained by matrix multiplication. Specifically, scores is [B, 1, T], indicating the weight of each historical behavior, facts is historical behavior sequence, size is [B, T, H], matrix multiplication is used to make both, and the output obtained is [B, 1, H].
    • Otherwise, multiply the Hadad code.
      • First turn on the scores to reshape, from [B, 1, H] change into Batch * Time;
      • And use expand_dims to increment scores by one dimension at the end;
      • Then the Hada coproduct is performed, [B, T, H] x [B, T, 1] = [B, T, H];
      • 0 0 Batch * Time * Hidden Size;

The specific code is as follows:

def din_attention(query, facts, attention_size, mask, stag='null', mode='SUM', softmax_stag=1, time_major=False, return_alphas=False) :
    Shape: [B, H], i_emb; Shape: [B, T, H], h_emb, T is the padding length, and each emB of length H represents an item. Mask: Batch the real meaning of each behavior, shape: [B, H]; ' ' '     
    if isinstance(facts, tuple) :# In case of Bi-RNN, concatenate the forward and the backward RNN outputs.
        facts = tf.concat(facts, 2)
        print ("querry_size mismatch")
        query = tf.concat(values = [
        query,
        query,
        ], axis=1)

    if time_major:
        # (T,B,D) => (B,T,D)
        facts = tf.array_ops.transpose(facts, [1.0.2])  
      
    # transformation mask
    mask = tf.equal(mask, tf.ones_like(mask))
    facts_size = facts.get_shape().as_list()[-1]  # D value - hidden size of the RNN layer,
    querry_size = query.get_shape().as_list()[-1] # H, this is 36
    
    # 1. Transform the Query dimension to the historical dimension T
    # query = [B, H], converts queries to dimension (B, T, H), in order to calculate weights for each element in pos_item and user behavior sequence
    Tensor("concat:0", shape=(? , 36), dtype=float32)
    # tf.shape(keys)[1] query [B, H]
    queries = tf.tile(query, [1, tf.shape(facts)[1]]) # [B, T * H], think of it as tile
    Tensor("Attention_layer/Tile:0", shape=(? ,?) , dtype=float32)
    # queries Need 0 0 To be the same size as Facts
    queries = tf.reshape(queries, tf.shape(facts)) # [B, T * H] -> [B, T, H]
    0 0 = 0 0 = 0 0 ,? , 36), dtype=float32)
    
    # 2. The purpose of this part is to do more operations to capture the relationship between behavior item and candidate item before MLP: addition, subtraction, multiplication and division, etc.
    Get the input for the Local Activation Unit. That is, the EMB of candidate ads queries and facts of user history behavior sequence
    # corresponding to the embed, plus the crossover features between them, concat the result after
    din_all = tf.concat([queries, facts, queries-facts, queries*facts], axis=-1) # T*[B,H] ->[B, T, H]
    
    # 3. Attention operation, get the weight through several layers of MLP, this DNN network output node is 1
    d_layer_1_all = tf.layers.dense(din_all, 80, activation=tf.nn.sigmoid, name='f1_att' + stag)
    d_layer_2_all = tf.layers.dense(d_layer_1_all, 40, activation=tf.nn.sigmoid, name='f2_att' + stag)
    d_layer_3_all = tf.layers.dense(d_layer_2_all, 1, activation=None, name='f3_att' + stag)
  	Shape d_layer_3_all = [B, T, 1]
 	  0 0 # Next 0 is 0 [B, 1, T], Axis =2 This one dimension represents the weight parameters for the 0 0 user behavior sequence
    d_layer_3_all = tf.reshape(d_layer_3_all, [-1.1, tf.shape(facts)[1]])
    scores = d_layer_3_all [B, 1, T]
    
    # 4. Get a meaningful score
    # key_masks = tf.sequence_mask(facts_length, tf.shape(facts)[1]) # [B, T]
    key_masks = tf.expand_dims(mask, 1) # [B, 1, T]
    # padding mask is followed by a small negative number so that e^{x} is approximately 0 when softmax is calculated later
    paddings = tf.ones_like(scores) * (-2 ** 32 + 1) Note that the initialization is minimal
    # [B, 1, T] padding operation, in order to ignore the effect of padding on the whole, the code uses tf-where to reset the padding vector weight (the vacant item in each sample sequence) to the minimum value (-2 ** 32 + 1), instead of 0
    scores = tf.where(key_masks, scores, paddings)  # [B, 1, T]

    Scale # attention standard operation, after scaled, feed into Softmax to get the final weight.
    # scores = scores/(facts. Get_shape () as_list () [1] 0.5 * *)

    # 6. Activation, to obtain the normalized weight
    if softmax_stag:
        scores = tf.nn.softmax(scores)  # [B, 1, T]

    # 7. Correct weight scores and facts of user history behavior sequence are obtained, and then the representation of user interest is obtained by matrix multiplication
    # Weighted sum,
    if mode == 'SUM':
        # scores = [B, 1, T], indicating the weight of each historical action,
        # facts is a sequence of historical actions, size is [B, T, H];
        [B, 1, H] [B, 1, H]
        B * (1 * T) * (T * H)
        Here output is the weight calculated by attention, that is, w in formula (3).
        output = tf.matmul(scores, facts)  # [B, 1, H]
        # output = tf.reshape(output, [-1, tf.shape(facts)[-1]])
    else:
        # from [B, 1, H] to Batch * Time
        scores = tf.reshape(scores, [-1, tf.shape(facts)[1]]) 
        [B, T, H] x [B, T, 1] = [B, T, H]
        output = facts * tf.expand_dims(scores, -1) 
        output = tf.reshape(output, tf.shape(facts)) # Batch * Time * Hidden Size
    return output
Copy the code

The program runtime variables are as follows:

attention_size = {int} 36
d_layer_1_all = {Tensor} Tensor("Attention_layer/f1_attnull/Sigmoid:0", shape=(? ,? .80), dtype=float32)
d_layer_2_all = {Tensor} Tensor("Attention_layer/f2_attnull/Sigmoid:0", shape=(? ,? .40), dtype=float32)
d_layer_3_all = {Tensor} Tensor("Attention_layer/Reshape_1:0", shape=(? .1, ?), dtype=float32)
din_all = {Tensor} Tensor("Attention_layer/concat:0", shape=(? ,? .144), dtype=float32)
facts = {Tensor} Tensor("concat_1:0", shape=(? ,? .36), dtype=float32)
facts_size = {int} 36
key_masks = {Tensor} Tensor("Attention_layer/ExpandDims:0", shape=(? .1, ?), dtype=bool)
mask = {Tensor} Tensor("Attention_layer/Equal:0", shape=(? ,?) , dtype=bool)
mode = {str} 'SUM'
output = {Tensor} Tensor("Attention_layer/MatMul:0", shape=(? .1.36), dtype=float32)
paddings = {Tensor} Tensor("Attention_layer/mul_1:0", shape=(? .1, ?), dtype=float32)
queries = {Tensor} Tensor("Attention_layer/Reshape:0", shape=(? ,? .36), dtype=float32)
querry_size = {int} 36
query = {Tensor} Tensor("concat:0", shape=(? .36), dtype=float32)
return_alphas = {bool} False
scores = {Tensor} Tensor("Attention_layer/Reshape_3:0", shape=(? .1, ?), dtype=float32)
softmax_stag = {int} 1
stag = {str} 'null'
time_major = {bool} False
Copy the code

0x06 Full Connection Layer

Now that we have the connected dense representation vector, the next step is to use the fully connected layer to automatically learn the combination of nonlinear relations between features.

The final CTR estimate is then obtained through a multi-layer neural network, which is a function call.

# Attention layers
with tf.name_scope('Attention_layer') : attention_output = din_attention(self.item_eb, self.item_his_eb, ATTENTION_SIZE, self.mask) att_fea = tf.reduce_sum(attention_output,1)
  tf.summary.histogram('att_fea', att_fea)
  inp = tf.concat([self.uid_batch_embedded, self.item_eb, self.item_his_eb_sum, self.item_eb * self.item_his_eb_sum, att_fea], -1)
# Fully connected layer
self.build_fcn_net(inp, use_dice=True) Call the multilayer neural network
Copy the code

Corresponding to the paper:

This multi-layer neural network contains several full connection layers which are essentially linear transformation from one feature space to another. Any dimension of the target space — that is, a cell in the hidden layer — is thought to be affected by each dimension of the source space. It can be loosely said that the target vector is the weighted sum of the source vectors.

The logic is as follows:

  • There is Batch Normalization first.
  • Add a full connection layertf.layers.dense(bn1, 200, activation=None, name='f1');
  • Activate with DICE or Prelu
  • Add a full connection layertf.layers.dense(dnn1, 80, activation=None, name='f2');
  • Activate with DICE or Prelu
  • Add a full connection layertf.layers.dense(dnn2, 2, activation=None, name='f3');
  • Get the outputY_hat = tf.nn.softmax(DNN3) + 0.00000001;
  • Cross entropy and Optimizer initialization;
    • You get cross entropy- tf.reduce_mean(tf.log(self.y_hat) * self.target_ph);
    • If there is negative sampling, auxiliary loss should be added;
    • Use AdamOptimizer,tf.train.AdamOptimizer(learning_rate=self.lr).minimize(self.loss)So that it can then be optimized by taking care of it;
  • Calculation Accuracy;

The specific code is as follows:

def build_fcn_net(self, inp, use_dice = False) :
    bn1 = tf.layers.batch_normalization(inputs=inp, name='bn1')
    dnn1 = tf.layers.dense(bn1, 200, activation=None, name='f1')
    if use_dice:
        dnn1 = dice(dnn1, name='dice_1')
    else:
        dnn1 = prelu(dnn1, 'prelu1')

    dnn2 = tf.layers.dense(dnn1, 80, activation=None, name='f2')
    if use_dice:
        dnn2 = dice(dnn2, name='dice_2')
    else:
        dnn2 = prelu(dnn2, 'prelu2')
    dnn3 = tf.layers.dense(dnn2, 2, activation=None, name='f3')
    self.y_hat = tf.nn.softmax(dnn3) + 0.00000001

    with tf.name_scope('Metrics') :# Cross-entropy loss and optimizer initialization
        ctr_loss = - tf.reduce_mean(tf.log(self.y_hat) * self.target_ph)
        self.loss = ctr_loss
        if self.use_negsampling:
            self.loss += self.aux_loss
        tf.summary.scalar('loss', self.loss)
        self.optimizer = tf.train.AdamOptimizer(learning_rate=self.lr).minimize(self.loss)

        # Accuracy metric
        self.accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.round(self.y_hat), self.target_ph), tf.float32))
        tf.summary.scalar('accuracy', self.accuracy)

    self.merged = tf.summary.merge_all()
Copy the code

0x07 Training model

Train the model with model.train.

The input data of model.train are as follows:

  • The user id;
  • Target’s item ID;
  • Cateid of target item;
  • Item ID list of user history actions;
  • The cate ID list corresponding to the user’s historical behavior item;
  • Mask of historical behavior;
  • The target;
  • Length of historical behavior;
  • Learning rate.
  • Negative sampled data;

Train code is as follows:

def train(self, sess, inps) :
    if self.use_negsampling:
        loss, accuracy, aux_loss, _ = sess.run([self.loss, self.accuracy, self.aux_loss, self.optimizer], feed_dict={
            self.uid_batch_ph: inps[0],
            self.mid_batch_ph: inps[1],
            self.cat_batch_ph: inps[2],
            self.mid_his_batch_ph: inps[3],
            self.cat_his_batch_ph: inps[4],
            self.mask: inps[5],
            self.target_ph: inps[6],
            self.seq_len_ph: inps[7],
            self.lr: inps[8],
            self.noclk_mid_batch_ph: inps[9],
            self.noclk_cat_batch_ph: inps[10],})return loss, accuracy, aux_loss
    else:
        loss, accuracy, _ = sess.run([self.loss, self.accuracy, self.optimizer], feed_dict={
            self.uid_batch_ph: inps[0],
            self.mid_batch_ph: inps[1],
            self.cat_batch_ph: inps[2],
            self.mid_his_batch_ph: inps[3],
            self.cat_his_batch_ph: inps[4],
            self.mask: inps[5],
            self.target_ph: inps[6],
            self.seq_len_ph: inps[7],
            self.lr: inps[8],})return loss, accuracy, 0
Copy the code

0x08 AUC

To mention the function AUC, which at first was thought to be a complex algorithm, turns out to be the simplest implementation.

def calc_auc(raw_arr) :
    arr = sorted(raw_arr, key=lambda d:d[0], reverse=True)
    pos, neg = 0..0.
    for record in arr: # Calculate the number of positive samples and negative samples
        if record[1] = =1.:
            pos += 1
        else:
            neg += 1

    fp, tp = 0..0.
    xy_arr = []
    for record in arr:
        if record[1] = =1.:
            tp += 1
        else:
            fp += 1
        xy_arr.append([fp/neg, tp/pos])

    auc = 0.
    prev_x = 0.
    prev_y = 0.
    Prev_y = prev_y + prev_y + (y - prev_y)
    # y + prev_y + delta_x = 2 * delta_x * prev_y + 2 * delta_x * prev_y
    PI PI divided by 2 is exactly the trapezoid area
    for x, y in xy_arr: 
        ifx ! = prev_x: auc += ((x - prev_x) * (y + prev_y) /2.)
            prev_x = x
            prev_y = y

    return auc
Copy the code

0xEE Personal information

★★★★ Thoughts on life and technology ★★★★★

Wechat official account: Rosie’s Thoughts

If you want to get a timely news feed of personal articles, or want to see the technical information of personal recommendations, please pay attention.

0 XFF reference

Build Wide & Deep by hand with NumPy

How Google implements The Wide & Deep Model (1)

How to make recommendations using deep Learning on Youtube

Also review Deep Interest Evolution Network

The evolution of Ali CTR algorithm from DIN to DIEN

Chapter 7 Artificial Intelligence, 7.6 APPLICATION of DNN in Search Scenarios (Author: Renzhong)

#Paper Reading# Deep Interest Network for Click-Through Rate Prediction

【 Paper Reading 】Deep Interest Evolution Network for click-through Rate Prediction

Also review Deep Interest Evolution Network

Deep Interest Evolution Network for Click-Through Rate Prediction

Deep Interest Evolution Network(AAAI 2019)

Deep Interest Evolution Network for click-through Rate Prediction

DIN(Deep Interest Network): core ideas + source code to read notes

Calculating advertising CTR Estimation Series (5)– Ali’s Deep Interest Network Theory

Detailed explanation of Deep Interest NetWork model principle of CTR prediction

LSTM that everyone can understand

Understand RNN, LSTM and GRU from the driven graph

Machine learning (I) — NN&LSTm

Li Hongyi machine Learning (2016)

Recommendation system meets deep learning (24)– Deep interest evolution network DIEN principle and actual combat!

Import terror: DLL load failed from google.protobuf.pyext import _message

DIN deep interest network introduction and source analysis

Deep Interest Network for click-through Rate Prediction

Deep Interest Network for click-through Rate Prediction

Ali CTR Prediction Trilogy (2) : Deep Interest Evolution Network for click-through Rate Prediction

Deep Interest Network interpretation

Deep Interest Network (DIN)

DIN paper official implementation analysis

Ali DIN source code how to model user sequence (1) : Base scheme

How to model user sequences (2) : DIN and feature Engineering perspectives

Ali Deep Interest Network (DIN) paper translation

Recommendation system meets deep learning (24)– Deep interest evolution network DIEN principle and actual combat!

Recommendation system meets deep learning (18)– Probe into ali’s deep Interest Network (DIN) analysis and implementation

[Paper introduction] 2018 Alibaba CTR prediction model –DIN(Deep Interest Network), attached with TF2.0 recurrence code

2019 Ali CTR Prediction Model –DIEN(Deep Interest Evolution Network)

Attention mechanism in deep learning

Attention is all you need