0 x00 the

DIEN stands for Ali’s Deep Interest Evolution Network.

Before we read DIEN source code, it is based on github.com/mouna99/die… Is implemented in.

Later because continue to look at DSIN, found in the DSIN code github.com/shenweichen…

So reading sorting, hence this article.

0 x01 background

1.1 Code Evolution

As we all know, the evolution of ali’s model is: DIN, DIEN, DSIN,……

Then there are three versions of the code.

The first version of the author directly explained the efficiency is not good, recommend the second version github.com/mouna99/die… . This version of the code is pure TensorFlow code, a trick a solid, read also smoothly.

The third version was based on Keras and DeepCTr, and compared with the previous version, it was upgraded from guerrilla to regular army, with all kinds of high quality and routines.

1.2 Deepctr

Deepctr has been very useful in becoming a regular army, as will be explained below.

DeepCtr is a simple CTR model framework that integrates all popular deep learning models and is suitable for those who learn recommendation system models.

  • Making portal
  • Official document portal

This project mainly implements some current click-through rate prediction algorithms based on deep learning, such as PNN, WDL, DeepFM, MLR, DeepCross, AFM, NFM, DIN, DIEN, xDeepFM, AutoInt, etc., and provides consistent calling interfaces externally.

1.2.1 Unified Perspective

Deepctr not only makes it easier to get started with click-through prediction models and compare them, but it also gives us the opportunity to learn how to build models from these excellent sources.

DeepCTR is designed for students who are interested in deep learning and CTR prediction algorithms so that they can take advantage of this package:

  • Look at each model from a unified perspective
  • A quick and simple comparison experiment was conducted
  • Build new models quickly using existing components

1.2.2 modular

DeepCTR abstracted the structure of an existing deep learn-based click-through prediction model and designed it with a modular approach that is highly reusable and independent of each other. The click-through rate prediction model based on deep learning can be divided into the following four modules according to the functions of the internal components of the model:

  • Input module
  • Embedded module
  • Feature extraction module
  • Predictive output module

All models are constructed in strict accordance with four modules, and the input, embedding and output are basically common. The differences of each model are mainly in the feature extraction part.

1.2.3 Framework Advantages

  • The overall structure is clear and flexible. Linear returns logit, FM returns logit, and deep contains the results of the middle layer. In each model, the last layer of deep is packaged, and whether linear, FM and deep need to be determined, and finally the full connection layer is connected.
  • Main modules and architectures used: Concatenate (List to tensor), Dense (the final full connection layer and Dense), Embedding (Sparse, Dense, sequence), Input (Sparse, Dense, sequce) and routine operations of Keras: Optimizer, regularization term
  • Compute_output_shape, compute_mask, get_config

Here’s a closer look at the latest version of DIEN.

0x2 Test data

DIEN uses data from tianchi.

1. Download Dataset [Ad Display/Click Data on Taobao.com](https://tianchi.aliyun.com/dataset/dataDetail?dataId=56)
2. Extract the files into the ``raw_data`` directory
Copy the code

Ali_Display_Ad_Click is a data set provided by Alibaba that estimates click-through rates for display ads on Taobao.

2.1 Introduction to data sets

The name of the data instructions attribute
raw_sample The skeleton of the original sample User ID, AD ID, time, resource bit, whether to click
ad_feature Basic information about advertising AD ID, AD plan ID, category ID, brand ID
user_profile Basic information about a user User ID, age group, gender, etc
raw_behavior_log User behavior logs User ID, behavior type, time, product category ID, brand ID

2.2 Raw sample skeleton raw_sample

The original sample skeleton was formed by randomly sampling 1.14 million users’ AD display/click logs (26 million records) over an 8-day period from The Taobao website. The fields are described as follows:

  • (1) user_id: ID of the desensitized user;
  • (2) adgroup_id: ID of the desensitized AD unit;
  • (3) Time_stamp:
  • (4) PID: resource bit;
  • (5) noclk: 1 means no click; 0 means click;
  • (6) CLK: 0 means no click; For 1, click;

We used the first 7 days as the training sample (20170506-20170512), and the 8th day as the test sample (20170513).

2.3 AD information table ad_feature

This data set covers the basic information for all the ads in raw_sample. The fields are described as follows:

  • (1) adgroup_id: ID of the desensitized AD;
  • (2) cate_id: ID of the desensitized commodity category;
  • (3) campaign_id: ID of the desensitized campaign;
  • (4) Customer_id: ID of advertiser desensitized;
  • (5) Brand: desensitized brand ID;
  • (6) The price of the baby

One AD ID corresponds to a product (baby), a baby belongs to a category, a baby belongs to a brand.

2.4 Basic User information user_profile

This data set covers the basic information of all users in raw_sample. The fields are described as follows:

  • (1) UserID: ID of the desensitized user;
  • (2) cms_segid: microgroup ID;
  • (3) cms_group_id: cms_group_id;
  • (4) Final_gender_code: Gender 1: male; 2: female;
  • (5) age_level;
  • (6) Pvalue_level: consumption level, 1: low, 2: medium, 3: high;
  • (7) shopping_level: shopping depth, 1: shallow user,2: medium user,3: deep user
  • (8) Occupation: Is he a college student? 1: Yes; 0: no
  • (9) new_user_class_level: indicates the city level

2.5 User behavior log behavior_log

This data set covers the shopping behavior of all users in RAW_sample over 22 days (a total of 700 million records). The fields are described as follows:

  • (1) user: ID of the desensitized user;
  • (2) Time_stamp:
  • (3) BTAG: behavior type, including the following four types:
type instructions
ipv browse
cart Add to cart
fav like
buy buy
  • (4) Cate: desensitized commodity categories;
  • (5) brand: desensitized brand words;

User + time_stamp as key, there will be many duplicate records; This is because our different types of behavioral data are recorded by different departments. When they are packaged together, there will actually be a small deviation (that is, two same time_stamps are actually two times with relatively small differences).

2.6 Typical research scenarios

Predict the click probability of users receiving exposure to an AD based on their historical shopping behavior.

Baseline AUC: 0.622

0x03 Directory structure

The code directory structure is as follows: the first five are data processing, models are models, train_xxx is training code.

.exercises ── 1_gen_session.py.Exercises ── 2_gen_dien_input.py.Exercises ── 2_gen_din_input.py.exercises ── 2_gen_dsin_input.py ├ ─ ─ config. Py ├ ─ ─ config. Pyc ├ ─ ─ models │ ├ ─ ─ just set py │ ├ ─ ─ dien. Py │ ├ ─ ─ din. Py │ └ ─ ─ dsin. Py ├ ─ ─ train_dien. Py ├ ─ ─ train_din. Py └ ─ ─ train_dsin. PyCopy the code

0x04 Data construction

Let’s look at the data construction section.

4.1 Generation of sampled data

0_gen_sampled_data.py is used to generate sampled data:

  • The basic logic is as follows:
  • Extracting samples from users at a rate;
  • Extract the data corresponding to the sampling user from the original sample skeleton RAW_sample;
  • Deduplicate user data;
  • Extracting the data corresponding to the sampled user from the behavior data;
  • For AD [‘brand’] missing data supplement -1;
  • LabelEncoder is used to hardcode features, and text features are numbered:
    • Demerge AD [‘cate_id’] and log[‘cate’], then encode;
    • AD [‘brand’] and log[‘brand’] are demerged and encoded.
  • Log removes btag column;
  • Log remove invalid timestamp column;
  • And then store it as a file;

4.2 Input required for generating DIEN

2_gen_dien_input.py is used to generate the input required by DIEN.

Get the user session related files from the sampled data (DIEN does not need this part).

FILE_NUM = len(
    list(
        filter(lambda x: x.startswith('user_hist_session_' + str(FRAC) + '_din_'), os.listdir('.. /sampled_data/'))))
Copy the code

Go through the file and put the data into user_HIST_Session_

for i in range(FILE_NUM):
    user_hist_session_ = pd.read_pickle(
        '.. /sampled_data/user_hist_session_' + str(FRAC) + '_din_' + str(i) + '.pkl')
    user_hist_session.update(user_hist_session_)
    del user_hist_session_
Copy the code

The session data is generated using gen_sess_FEATure_dien and the dictionaries are generated separately and displayed with progress bars.

Generate a dict where each value is a list of user actions (cate_id, brand, time_stamp) :

sess_input_dict = {'cate_id': [].'brand': []}
neg_sess_input_dict = {'cate_id': [].'brand': []}
sess_input_length = []
for row in tqdm(sample_sub[['user'.'time_stamp']].iterrows()):
    a, b, n_a, n_b, c = gen_sess_feature_dien(row)
    sess_input_dict['cate_id'].append(a)
    sess_input_dict['brand'].append(b)
    neg_sess_input_dict['cate_id'].append(n_a)
    neg_sess_input_dict['brand'].append(n_b)
    sess_input_length.append(c)
Copy the code

The gen_sess_feature_dien function obtains session data.

  • The current user’s session history is traversed from the back to the front, and each session is retrieved from the timestamp.
  • throughfor e in cur_sess[max(0, i + 1 - sess_max_len):i + 1]]Let’s fetch the last sess_max_len log from the session,
  • Using the sample function to generate negative sample data,
  • The result is ‘cate_id’ and’ brand’.
def gen_sess_feature_dien(row):
    sess_max_len = DIN_SESS_MAX_LEN
    sess_input_dict = {'cate_id': [0].'brand': [0]}
    neg_sess_input_dict = {'cate_id': [0].'brand': [0]}
    sess_input_length = 0
    user, time_stamp = row[1] ['user'], row[1] ['time_stamp']
    if user not in user_hist_session or len(user_hist_session[user]) = =0:

        sess_input_dict['cate_id'] = [0]
        sess_input_dict['brand'] = [0]
        neg_sess_input_dict['cate_id'] = [0]
        neg_sess_input_dict['brand'] = [0]
        sess_input_length = 0
    else:
        cur_sess = user_hist_session[user][0]
        for i in reversed(range(len(cur_sess))) :if cur_sess[i][2] < time_stamp:
                sess_input_dict['cate_id'] = [e[0]
                                              for e in cur_sess[max(0, i + 1 - sess_max_len):i + 1]]
                sess_input_dict['brand'] = [e[1]
                                            for e in cur_sess[max(0, i + 1 - sess_max_len):i + 1]]

                neg_sess_input_dict = {'cate_id': [].'brand': []}

                for c in sess_input_dict['cate_id']:
                    neg_cate, neg_brand = sample(c)
                    neg_sess_input_dict['cate_id'].append(neg_cate)
                    neg_sess_input_dict['brand'].append(neg_brand)

                sess_input_length = len(sess_input_dict['brand'])
                break
    return sess_input_dict['cate_id'], sess_input_dict['brand'], neg_sess_input_dict['cate_id'], neg_sess_input_dict[
        'brand'], sess_input_length
Copy the code

Sample generates negative sample data, that is, randomly generate index, and then regenerate index if the corresponding category is equal to cate_id. To get a sample data.

def sample(cate_id):
    global ad
    while True:
        i = np.random.randint(0, ad.shape[0])
        sample_cate = ad.iloc[i]['cate_id']
        ifsample_cate ! = cate_id:break
    return sample_cate, ad.iloc[i]['brand']
Copy the code

Fill in the missing value for user with -1; Rename new_user_class_level; Rename user to ‘userid’;

user = user.fillna(-1)
user.rename(
    columns={'new_user_class_level ': 'new_user_class_level'}, inplace=True)
sample_sub.rename(columns={'user': 'userid'}, inplace=True)
Copy the code

Connect to sample_sub, user, data, AD.

data = pd.merge(sample_sub, user, how='left', on='userid', )
data = pd.merge(data, ad, how='left', on='adgroup_id')
Copy the code

Hardcode sparSE_Features to number the text features.

Standard scaling for dense_features.

sparse_features = ['userid'.'adgroup_id'.'pid'.'cms_segid'.'cms_group_id'.'final_gender_code'.'age_level'.'pvalue_level'.'shopping_level'.'occupation'.'new_user_class_level'.'campaign_id'.'customer']
dense_features = ['price']

for feat in tqdm(sparse_features):
    lbe = LabelEncoder()  # or Hash
    data[feat] = lbe.fit_transform(data[feat])
mms = StandardScaler()
data[dense_features] = mms.fit_transform(data[dense_features])
Copy the code

For SPARse_Features and dense_Features to build SingleFeat respectively, this is the Namedtuple built in deepCtr.

sparse_feature_list = [SingleFeat(feat, data[feat].nunique(
) + 1) for feat in sparse_features + ['cate_id'.'brand']]

dense_feature_list = [SingleFeat(feat, 1) for feat in dense_features]
sess_feature = ['cate_id'.'brand']
Copy the code

For features in sparse_feature, values are obtained from sess_input_dict and neg_sess_input_dict and sequences are constructed. These two are session data and are behavior data.

sess_input = [pad_sequences(
    sess_input_dict[feat], maxlen=DIN_SESS_MAX_LEN, padding='post') for feat in sess_feature]
neg_sess_input = [pad_sequences(neg_sess_input_dict[feat], maxlen=DIN_SESS_MAX_LEN, padding='post') for feat in
                  sess_feature]
Copy the code

For the features in SPARse_FEATure_list and dense_FEATure_list, the corresponding values in data are traversed to build model_INPUT.

Sess_input, neg_sess_INPUT and [np.array(sess_input_length)] are constructed into sess_lists.

Add sess_lists to model_input;

model_input = [data[feat.name].values for feat in sparse_feature_list] + \
              [data[feat.name].values for feat in dense_feature_list]
sess_lists = sess_input + neg_sess_input + [np.array(sess_input_length)]
model_input += sess_lists
Copy the code

The next step is to store the data into a file.

pd.to_pickle(model_input, '.. /model_input/dien_input_' +
             str(FRAC) + '_' + str(DIN_SESS_MAX_LEN) + '.pkl')
pd.to_pickle(data['clk'].values, '.. /model_input/dien_label_' +
             str(FRAC) + '_' + str(DIN_SESS_MAX_LEN) + '.pkl')
try:
    pd.to_pickle({'sparse': sparse_feature_list, 'dense': dense_feature_list},
                 '.. /model_input/dien_fd_' + str(FRAC) + '_' + str(DIN_SESS_MAX_LEN) + '.pkl'.)except:
    pd.to_pickle({'sparse': sparse_feature_list, 'dense': dense_feature_list},
                 '.. /model_input/dien_fd_' + str(FRAC) + '_' + str(DIN_SESS_MAX_LEN) + '.pkl'.)Copy the code

0 x05 DIEN model

Specifically, the model can be divided into two parts.

  • Train_dien. py is responsible for building the model and related parts.
  • Dien.py is the concrete model implementation;

5.1 train_dien. Py

First read in the data.

fd = pd.read_pickle('.. /model_input/dien_fd_' +
                    str(FRAC) + '_' + str(SESS_MAX_LEN) + '.pkl')
model_input = pd.read_pickle(
    '.. /model_input/dien_input_' + str(FRAC) + '_' + str(SESS_MAX_LEN) + '.pkl')
label = pd.read_pickle('.. /model_input/dien_label_' +
                       str(FRAC) + '_' + str(SESS_MAX_LEN) + '.pkl')

sample_sub = pd.read_pickle(
    '.. /sampled_data/raw_sample_' + str(FRAC) + '.pkl')
Copy the code

Build a label.

sample_sub['idx'] = list(range(sample_sub.shape[0]))
train_idx = sample_sub.loc[sample_sub.time_stamp < 1494633600.'idx'].values
test_idx = sample_sub.loc[sample_sub.time_stamp >= 1494633600.'idx'].values

train_input = [i[train_idx] for i in model_input]
test_input = [i[test_idx] for i in model_input]

train_label = label[train_idx]
test_label = label[test_idx]

sess_len_max = SESS_MAX_LEN
BATCH_SIZE = 4096
sess_feature = ['cate_id'.'brand']
TEST_BATCH_SIZE = 2支那14
Copy the code

The keras model is generated, so we can fit, predict.

model = DIEN(fd, sess_feature, 4, sess_len_max, "AUGRU", att_hidden_units=(64.16),
             att_activation='sigmoid', use_negsampling=DIEN_NEG_SAMPLING)
Copy the code

Compile, fit, predict.

model.compile('adagrad'.'binary_crossentropy',
              metrics=['binary_crossentropy',]),if DIEN_NEG_SAMPLING:
    hist_ = model.fit(train_input, train_label, batch_size=BATCH_SIZE,
                      epochs=1, initial_epoch=0, verbose=1, )
    pred_ans = model.predict(test_input, TEST_BATCH_SIZE)
else:
    hist_ = model.fit(train_input[:-3] + train_input[-1:], train_label, batch_size=BATCH_SIZE, epochs=1,
                      initial_epoch=0, verbose=1, )
    pred_ans = model.predict(
        test_input[:-3] + test_input[-1:], TEST_BATCH_SIZE)
Copy the code

5.2 dien. Py

Here is the model code, the core of this article. Since the general idea of DIEN remains the same, we will focus on the previous second version (github.com/mouna99/die…) The difference between.

5.2.1 The second version

Let’s recall the second version of the body code for comparison.

class Model_DIN_V2_Gru_Vec_attGru(Model) :
    def __init__(self, n_uid, n_mid, n_cat, EMBEDDING_DIM, HIDDEN_SIZE, ATTENTION_SIZE, use_negsampling=False) :
        super(Model_DIN_V2_Gru_Vec_attGru, self).__init__(n_uid, n_mid, n_cat,
                                                          EMBEDDING_DIM, HIDDEN_SIZE, ATTENTION_SIZE,
                                                          use_negsampling)

        # RNN layer(-s)
        with tf.name_scope('rnn_1') : rnn_outputs, _ = dynamic_rnn(GRUCell(HIDDEN_SIZE), inputs=self.item_his_eb, sequence_length=self.seq_len_ph, dtype=tf.float32, scope="gru1")
            tf.summary.histogram('GRU_outputs', rnn_outputs)

        # Attention layer
        with tf.name_scope('Attention_layer_1'):
            att_outputs, alphas = din_fcn_attention(self.item_eb, rnn_outputs, ATTENTION_SIZE, self.mask,
                                                    softmax_stag=1, stag='1 _1', mode='LIST', return_alphas=True)
            tf.summary.histogram('alpha_outputs', alphas)

        with tf.name_scope('rnn_2') : rnn_outputs2, final_state2 = dynamic_rnn(VecAttGRUCell(HIDDEN_SIZE), inputs=rnn_outputs, att_scores = tf.expand_dims(alphas, -1),
                                                     sequence_length=self.seq_len_ph, dtype=tf.float32,
                                                     scope="gru2")
            tf.summary.histogram('GRU2_Final_State', final_state2)

        inp = tf.concat([self.uid_batch_embedded, self.item_eb, self.item_his_eb_sum, self.item_eb * self.item_his_eb_sum, final_state2], 1)
        self.build_fcn_net(inp, use_dice=True)
Copy the code

Then look at the latest version 3.

5.2.2 Input Parameters

The input parameters for DIEN are as follows, which you can roughly tell from the comments.

def DIEN(feature_dim_dict, seq_feature_list, embedding_size=8, hist_len_max=16,
         gru_type="GRU", use_negsampling=False, alpha=1.0, use_bn=False, dnn_hidden_units=(200.80),
         dnn_activation='relu',
         att_hidden_units=(64.16), att_activation="dice", att_weight_normalization=True,
         l2_reg_dnn=0, l2_reg_embedding=1e-6, dnn_dropout=0, init_std=0.0001, seed=1024, task='binary') :
   """Instantiates the Deep Interest Evolution Network architecture.

    :param feature_dim_dict: dict,to indicate sparse field (**now only support sparse feature**)like {'sparse':{'field_1':4,'field_2':3,'field_3':2},'dense':[]}
    :param seq_feature_list: list,to indicate  sequence sparse field (**now only support sparse feature**),must be a subset of ``feature_dim_dict["sparse"]``
    :param embedding_size: positive integer,sparse feature embedding_size.
    :param hist_len_max: positive int, to indicate the max length of seq input
    :param gru_type: str,can be GRU AIGRU AUGRU AGRU
    :param use_negsampling: bool, whether or not use negtive sampling
    :param alpha: float ,weight of auxiliary_loss
    :param use_bn: bool. Whether use BatchNormalization before activation or not in deep net
    :param dnn_hidden_units: list,list of positive integer or empty list, the layer number and units in each layer of DNN
    :param dnn_activation: Activation function to use in DNN
    :param att_hidden_units: list,list of positive integer , the layer number and units in each layer of attention net
    :param att_activation: Activation function to use in attention net
    :param att_weight_normalization: bool.Whether normalize the attention score of local activation unit.
    :param l2_reg_dnn: float. L2 regularizer strength applied to DNN
    :param l2_reg_embedding: float. L2 regularizer strength applied to embedding vector
    :param dnn_dropout: float in [0,1), the probability we will drop out a given DNN coordinate.
    :param init_std: float,to use as the initialize std of embedding vector
    :param seed: integer ,to use as random seed.
    :param task: str, ``"binary"`` for  binary logloss or  ``"regression"`` for regression loss
    :return: A Keras model instance.

    """
Copy the code

5.2.3 Building vectors

Here, two vectors are constructed, namely, the density Vector and the Spasre Vector. Density vector stores all values including zero value, while sparse vector stores index positions and values without zero value. Only when the amount of data is relatively large, sparse vector can reflect its advantages and value.

The function get_input extracts the vector from the input dictionary.

def get_input(feature_dim_dict, seq_feature_list, seq_max_len):
    sparse_input, dense_input = create_singlefeat_inputdict(feature_dim_dict)
    user_behavior_input = OrderedDict()
    for i, feat in enumerate(seq_feature_list):
        user_behavior_input[feat] = Input(shape=(seq_max_len,), name='seq_' + str(i) + The '-' + feat)
    user_behavior_length = Input(shape=(1,), name='seq_length')
    return sparse_input, dense_input, user_behavior_input, user_behavior_length
Copy the code

Run feature_DIM_dict to build the feature dictionary. Each item is name:Embedding. Its function is to obtain the embedding corresponding to the specific input variable in SPARse_INPUT from sparse_embedding_dict.

sparse_embedding_dict = {feat.name: Embedding(feat.dimension, embedding_size,
                                              embeddings_initializer=RandomNormal(
                                                  mean=0.0, stddev=init_std, seed=seed),
                                              embeddings_regularizer=l2(
                                                  l2_reg_embedding),
                                              name='sparse_emb_' + str(i) + The '-' + feat.name) for i, feat in
                         enumerate(feature_dim_dict["sparse"])}
Copy the code

Get the embedded var, where each embedding_dict[feat] is a matrix.

query_emb_list = get_embedding_vec_list(sparse_embedding_dict, sparse_input, feature_dim_dict["sparse"],
                                        return_feat_list=seq_feature_list)
Copy the code

Put these together.

query_emb = concat_fun(query_emb_list)
keys_emb = concat_fun(keys_emb_list)
deep_input_emb = concat_fun(deep_input_emb_list)
Copy the code

5.2.4 Interest evolution layer

Let’s start calling the generate interest evolution layer.

hist, aux_loss_1 = interest_evolution(keys_emb, query_emb, user_behavior_length, gru_type=gru_type,
                                      use_neg=use_negsampling, neg_concat_behavior=neg_concat_behavior,
                                      embedding_size=embedding_size, att_hidden_size=att_hidden_units,
                                      att_activation=att_activation,
                                      att_weight_normalization=att_weight_normalization, )
Copy the code

Among them:

  • DynamicGRU is equivalent to the second version of dynamic_rnn, which is the first layer ‘rnn_1’;
  • Auxiliary_loss is almost identical to the second edition;
  • Auxiliary_net is only the last step y_hat = tf.nn.sigmoid(dnn3) is different;

The specific code is as follows:

def interest_evolution(concat_behavior, deep_input_item, user_behavior_length, gru_type="GRU", use_neg=False,
                       neg_concat_behavior=None, embedding_size=8, att_hidden_size=(64.16), att_activation='sigmoid',
                       att_weight_normalization=False.) :

    aux_loss_1 = None

    rnn_outputs = DynamicGRU(embedding_size * 2, return_sequence=True,
                             name="gru1")([concat_behavior, user_behavior_length])
    
    if gru_type == "AUGRU" and use_neg:
        aux_loss_1 = auxiliary_loss(rnn_outputs[:, :-1, :], concat_behavior[:, 1:, :],

                                    neg_concat_behavior[:, 1:, :],

                                    tf.subtract(user_behavior_length, 1), stag="gru")  [: # 1:]

    if gru_type == "GRU":
        rnn_outputs2 = DynamicGRU(embedding_size * 2, return_sequence=True,
                                  name="gru2")([rnn_outputs, user_behavior_length])
        hist = AttentionSequencePoolingLayer(att_hidden_units=att_hidden_size, att_activation=att_activation,
                                             weight_normalization=att_weight_normalization, return_score=False)([
            deep_input_item, rnn_outputs2, user_behavior_length])

    else:  # AIGRU AGRU AUGRU

        scores = AttentionSequencePoolingLayer(att_hidden_units=att_hidden_size, att_activation=att_activation,
                                               weight_normalization=att_weight_normalization, return_score=True)([
            deep_input_item, rnn_outputs, user_behavior_length])

        if gru_type == "AIGRU":
            hist = multiply([rnn_outputs, Permute([2.1])(scores)])
            final_state2 = DynamicGRU(embedding_size * 2, gru_type="GRU", return_sequence=False, name='gru2')(
                [hist, user_behavior_length])
        else:  # AGRU AUGRU
            final_state2 = DynamicGRU(embedding_size * 2, gru_type=gru_type, return_sequence=False,
                                      name='gru2')([rnn_outputs, user_behavior_length, Permute([2.1])(scores)])
        hist = final_state2
    return hist, aux_loss_1
Copy the code
5.2.4.1 DynamicGRU 1

DynamicGRU is equivalent to the second version of dynamic_rnn, which is the first layer ‘rnn_1’.

This Layer corresponds to the yellow part of the architecture diagram, namely the Interest Extractor Layer, the main component is GRU.

The main function is to extract user interest sequence based on behavior sequence by simulating user interest migration process. Input the item embedding of user behavior history into dynamic RNN (first-layer GRU).

rnn_outputs = DynamicGRU(embedding_size * 2, return_sequence=True,
                         name="gru1")([concat_behavior, user_behavior_length])
Copy the code
5.2.4.2 auxiliary_loss

The calculation of auxiliary loss is actually a dichotomy model, corresponding to:

Auxiliary_loss is almost identical to the second edition.

def auxiliary_loss(h_states, click_seq, noclick_seq, mask, stag=None) :
    #:param h_states:
    #:param click_seq:
    #:param noclick_seq: #[B,T-1,E]
    #:param mask:#[B,1]
    #:param stag:
    #:return:
    hist_len, _ = click_seq.get_shape().as_list()[1:]
    mask = tf.sequence_mask(mask, hist_len)
    mask = mask[:, 0, :]
    mask = tf.cast(mask, tf.float32)
    # penultimate first dimension concat, the rest unchanged
    click_input_ = tf.concat([h_states, click_seq], -1)
    # penultimate first dimension concat, the rest unchanged
    noclick_input_ = tf.concat([h_states, noclick_seq], -1)
    # Obtain the last y_hat of positive sample
    click_prop_ = auxiliary_net(click_input_, stag=stag)[:, :, 0]
    # Obtain the last y_hat of negative sample
    noclick_prop_ = auxiliary_net(noclick_input_, stag=stag)[
                    :, :, 0]  # [B,T-1]
    # Log loss and mask true historical behavior
    click_loss_ = - tf.reshape(tf.log(click_prop_),
                               [-1, tf.shape(click_seq)[1]]) * mask
    noclick_loss_ = - \
                        tf.reshape(tf.log(1.0 - noclick_prop_),
                                   [-1, tf.shape(noclick_seq)[1]]) * mask
    loss_ = tf.reduce_mean(click_loss_ + noclick_loss_)

    return loss_
Copy the code
5.2.4.3 auxiliary_net

Auxiliary_net only the last step y_hat = tf.nn.sigmoid(dnn3) is different.

def auxiliary_net(in_, stag='auxiliary_net'):
    bn1 = tf.layers.batch_normalization(
        inputs=in_, name='bn1' + stag, reuse=tf.AUTO_REUSE)
    dnn1 = tf.layers.dense(bn1, 100, activation=None,
                           name='f1' + stag, reuse=tf.AUTO_REUSE)
    dnn1 = tf.nn.sigmoid(dnn1)
    dnn2 = tf.layers.dense(dnn1, 50, activation=None,
                           name='f2' + stag, reuse=tf.AUTO_REUSE)
    dnn2 = tf.nn.sigmoid(dnn2)
    dnn3 = tf.layers.dense(dnn2, 1, activation=None,
                           name='f3' + stag, reuse=tf.AUTO_REUSE)
    y_hat = tf.nn.sigmoid(dnn3)
    return y_hat
Copy the code
5.2.4.4 AttentionSequencePoolingLayer

This was done by deepCTR, which corresponds to the second version of DIN_fcn_attention, which can be seen in the preceding paragraph.

In DIEN, the function of ‘Attention_layer_1’ layer is to simulate the interest evolution process related to the current target advertisement by adding Attention mechanism to the interest extraction layer, and to model the interest evolution process related to the target object. That is, the output of the first layer is fed into the second layer GRU, and the attention score (calculated based on the output vector and candidate materials of the first layer) is used to control the update gate of the second layer GRU.

class AttentionSequencePoolingLayer(Layer) :
    """The Attentional sequence pooling operation used in DIN. Input shape - A list of three tensor: [query,keys,keys_length] - query is a 3D tensor with shape: ``(batch_size, 1, embedding_size)`` - keys is a 3D tensor with shape: ``(batch_size, T, embedding_size)`` - keys_length is a 2D tensor with shape: ``(batch_size, 1)`` Output shape - 3D tensor with shape: ``(batch_size, 1, embedding_size)``. Arguments - **att_hidden_units**:list of positive integer, the attention net layer number and units in each layer. - **att_activation**: Activation function to use in attention net. - **weight_normalization**: bool.Whether normalize the attention score of local activation unit. - **supports_masking**:If True,the input need to support masking. References - [Zhou G, Zhu X, Song C, et al. Deep interest network for click-through rate prediction[C]//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2018:1059-1068.] (https://arxiv.org/pdf/1706.06978.pdf) "" "

    def __init__(self, att_hidden_units=(80.40), att_activation='sigmoid', weight_normalization=False,
                 return_score=False,
                 supports_masking=False, **kwargs) :
        self.att_hidden_units = att_hidden_units
        self.att_activation = att_activation
        self.weight_normalization = weight_normalization
        self.return_score = return_score
        super(AttentionSequencePoolingLayer, self).__init__(**kwargs)
        self.supports_masking = supports_masking

    def build(self, input_shape) :
        if not self.supports_masking:
            if not isinstance(input_shape, list) or len(input_shape) ! =3:
                raise ValueError('...). if len(input_shape[0]) ! = 3 or len(input_shape[1]) ! = 3 or len(input_shape[2]) ! = 2: raise ValueError(...) if input_shape[0][-1] ! = input_shape[1][-1] or input_shape[0][1] ! = 1 or input_shape[2][1] ! = 1: raise ValueError(...) else: pass self.local_att = LocalActivationUnit( self.att_hidden_units, self.att_activation, l2_reg=0, dropout_rate=0, use_bn=False, seed=1024, ) super(AttentionSequencePoolingLayer, self).build( input_shape) # Be sure to call this somewhere! def call(self, inputs, mask=None, training=None, **kwargs): if self.supports_masking: if mask is None: raise ValueError(...) queries, keys = inputs key_masks = tf.expand_dims(mask[-1], axis=1) else: queries, keys, keys_length = inputs hist_len = keys.get_shape()[1] key_masks = tf.sequence_mask(keys_length, hist_len) attention_score = self.local_att([queries, keys], training=training) outputs = tf.transpose(attention_score, (0, 2, 1)) if self.weight_normalization: paddings = tf.ones_like(outputs) * (-2 ** 32 + 1) else: paddings = tf.zeros_like(outputs) outputs = tf.where(key_masks, outputs, paddings) if self.weight_normalization: outputs = tf.nn.softmax(outputs) if not self.return_score: outputs = tf.matmul(outputs, keys) outputs._uses_learning_phase = attention_score._uses_learning_phase return outputsCopy the code
5.2.4.5 DynamicGRU 2

The score of attention is entered as part of the GRU.

    if gru_type == "AIGRU":
        hist = multiply([rnn_outputs, Permute([2.1])(scores)])
        final_state2 = DynamicGRU(embedding_size * 2, gru_type="GRU", return_sequence=False, name='gru2')(
            [hist, user_behavior_length])
    else:  # AGRU AUGRU
        final_state2 = DynamicGRU(embedding_size * 2, gru_type=gru_type, return_sequence=False,
                                  name='gru2')([rnn_outputs, user_behavior_length, Permute([2.1])(scores)])
    hist = final_state2
Copy the code

This part was done by DeepCTr, which corresponds to the second version of the GRU, migrating VecAttGRUCell and others here.

class DynamicGRU(Layer) :
    def __init__(self, num_units=None, gru_type='GRU', return_sequence=True, **kwargs) :

        self.num_units = num_units
        self.return_sequence = return_sequence
        self.gru_type = gru_type
        super(DynamicGRU, self).__init__(**kwargs)

    def build(self, input_shape) :
        # Create a trainable weight variable for this layer.
        input_seq_shape = input_shape[0]
        if self.num_units is None:
            self.num_units = input_seq_shape.as_list()[-1]
        if self.gru_type == "AGRU":
            self.gru_cell = QAAttGRUCell(self.num_units)
        elif self.gru_type == "AUGRU":
            self.gru_cell = VecAttGRUCell(self.num_units)
        else:
            self.gru_cell = tf.nn.rnn_cell.GRUCell(self.num_units)

        # Be sure to call this somewhere!
        super(DynamicGRU, self).build(input_shape)

    def call(self, input_list) :
        """ :param concated_embeds_value: None * field_size * embedding_size :return: None*1 """
        if self.gru_type == "GRU" or self.gru_type == "AIGRU":
            rnn_input, sequence_length = input_list
            att_score = None
        else:
            rnn_input, sequence_length, att_score = input_list

        rnn_output, hidden_state = dynamic_rnn(self.gru_cell, inputs=rnn_input, att_scores=att_score,sequence_length=tf.squeeze(sequence_length,), dtype=tf.float32, scope=self.name)
        if self.return_sequence:
            return rnn_output
        else:
            return tf.expand_dims(hidden_state, axis=1)
Copy the code

5.2.5 DNN Full connection layer

Now that we have the connected dense representation vector, the next step is to use the nonlinear relationship combination between the fully connected layer auto-learning features.

So you have a multi-layer neural network, and you get the final CTR estimate, and this part is a function call.

In the corresponding paper:

The code is as follows:

deep_input_emb = Concatenate()([deep_input_emb, hist])

deep_input_emb = tf.keras.layers.Flatten()(deep_input_emb)
if len(dense_input) > 0:
    deep_input_emb = Concatenate()(
        [deep_input_emb] + list(dense_input.values()))

output = DNN(dnn_hidden_units, dnn_activation, l2_reg_dnn,
             dnn_dropout, use_bn, seed)(deep_input_emb)
final_logit = Dense(1, use_bias=False)(output)
output = PredictionLayer(task)(final_logit)

model_input_list = get_inputs_list(
    [sparse_input, dense_input, user_behavior_input])

if use_negsampling:
    model_input_list += list(neg_user_behavior_input.values())

model_input_list += [user_behavior_length]

model = tf.keras.models.Model(inputs=model_input_list, outputs=output)

if use_negsampling:
    model.add_loss(alpha * aux_loss_1)
tf.keras.backend.get_session().run(tf.global_variables_initializer())
return model
Copy the code

So far, the Keras version analysis is basically completed.

0xEE Personal information

Thoughts on life and technology

Wechat public account: Rosie’s Thinking

If you want to get updates on your own articles, or check out your own technical recommendations, please pay attention.

0 XFF reference

AD click through rate prediction: a brief introduction to the DeepCTR library

DeepCTR: Easy-to-use and scalable deep learning click-through rate prediction algorithm package

Deepctr framework code reading