0 x00 the

Deep Interest Network (DIN) was proposed by ali Mama precise directional retrieval and basic algorithm team in June 2017. Its CTR estimates for the e-Commerce industry focus on making full use of/mining information from historical user behavior data.

This series of articles will review some concepts related to deep learning and the implementation of TensorFlow by interpreting the paper and source code. This second article will analyze how to generate training data and model user sequences.

0x01 What data is required for DIN

Let’s summarize DIN’s behavior:

  • CTR estimation generally abstracts the user’s behavior sequence into a feature, which is called behavioral EMB here.
  • The previous prediction model treats a group of behavior sequences of users equally, such as pooling or adding time attenuation.
  • DIN deeply analyzes the user behavior intent, that is, the correlation between each user behavior and candidate goods is different. Taking this as an opportunity, a module for calculating correlation (later called attention) is used to weight pooling of sequence behaviors and obtain the desired embedding.

It can be seen that the user sequence is the core input data. Around this data, a series of data such as users, commodities and commodity attributes are needed. So DIN requires the following data:

  • User dictionary, id corresponding to user name;
  • Movie dictionary, the ID of item;
  • Category dictionary, category id;
  • Category information corresponding to item;
  • Training data in the format of label, user name, target item, target item category, history item, and corresponding category of history item;
  • Test data in the same format as training data;

0x02 How do I Generate data

Prepare_data. sh file for data processing, generate a variety of data, its content is as follows.

export PATH="~/anaconda4/bin:$PATH"

wget http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Books.json.gz
wget http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/meta_Books.json.gz

gunzip reviews_Books.json.gz
gunzip meta_Books.json.gz

python script/process_data.py meta_Books.json reviews_Books_5.json
python script/local_aggretor.py
python script/split_by_user.py
python script/generate_voc.py
Copy the code

We can see what these processing files do as follows:

  • Process_data.py: generate metadata file, build negative sample, sample separation;
  • Local_aggretor. py: Generates a sequence of user actions;
  • Split_by_user. py: Split into data sets;
  • Generate_voc. py: Generates three data dictionaries for user, movie, and genre respectively;

2.1 Basic Data

This paper uses Amazon Product Data, which contains two files: Reviews_electronics_5.json and meta_Electronics. Json.

Among them:

  • Reviews refers to the contextual information generated by the user’s purchase of relevant products, including product ID, time, reviews, etc.
  • Meta file is the information about the product itself, including the product ID, name, category, bought or bought information, etc.

The specific format is as follows:

Reviews_Electronics data
reviewerID Commenter ID, for example [A2SUAM1J3GNN3B]
asin Product ID, for example, [0000013714]
reviewerName Commenter nickname
helpful Review usefulness rating, e.g. 2/3
reviewText The comment text
overall Product rating
summary Comment on the
unixReviewTime Audit Time (Unix time)
reviewTime Audit Time (original)
Meta_Electronics data
asin The product ID
title The product name
imUrl Product Picture address
categories List of categories to which the product belongs
description The product description

The user behavior in this dataset is rich, with more than five reviews per user and item. Features include goods_id, cate_id, user comment goods_id_list, and cate_id_list. All actions of the user are (b1, B2… , bk,… , bn).

The task is to predict the (k + 1) review item by using the first k review item. The training data set is used with each user’s k = 1,2… N minus 2.

2.2 Data Processing

2.2.1 Generating Metadata

By processing these two JSON files, we can generate two metadata files: item-info and review-info.

python script/process_data.py meta_Books.json reviews_Books_5.json
Copy the code

The specific code is as follows, which is simple extraction:

def process_meta(file) :
    fi = open(file, "r")
    fo = open("item-info"."w")
    for line in fi:
        obj = eval(line)
        cat = obj["categories"] [0] [-1]
        print>>fo, obj["asin"] + "\t" + cat

def process_reviews(file) :
    fi = open(file, "r")
    user_map = {}
    fo = open("reviews-info"."w")
    for line in fi:
        obj = eval(line)
        userID = obj["reviewerID"]
        itemID = obj["asin"]
        rating = obj["overall"]
        time = obj["unixReviewTime"]
        print>>fo, userID + "\t" + itemID + "\t" + str(rating) + "\t" + str(time)
Copy the code

The generated file is as follows.

The format of review-info is userID, itemID, score, and timestamp

A2S166WSCFIFP5	000100039X	5.0	1071100800
A1BM81XB4QHOA3	000100039X	5.0	1390003200
A1MOSTXNIO5MPJ	000100039X	5.0	1317081600
A2XQ5LZHTD4AFT	000100039X	5.0	1033948800
A3V1MKC2BVWY48	000100039X	5.0	1390780800
A12387207U8U24	000100039X	5.0	1206662400
Copy the code

Item-info is a product ID and a list of categories to which the product belongs, which is like a mapping table. The product 0001048791 corresponds to the Books category.

0001048791	Books
0001048775	Books
0001048236	Books
0000401048	Books
0001019880	Books
0001048813	Books
Copy the code

2.2.2 Build a sample list

The negative sample is constructed by the manual_join function, and the specific logic is as follows:

  • Item_list = item_list;
  • Get the sequence of actions for all users. Each user has an execution sequence, the content of each sequence item is a tuple2 (userID + item ID + rank + timestamp, timestamp);
  • Iterate over each user
    • The behavior sequence of the user is sorted by timestamp.
    • For each sorted user behavior, build two samples:
      • A negative sample. Replace the item ID of the user action with a randomly selected item ID (click set to 0).
      • A positive sample, which is user behavior, and then click to 1.
      • Write samples to files separately.

Such as:

The list of goods is:

item_list = 
 0000000 = {str} '000100039X'
 0000001 = {str} '000100039X'
 0000002 = {str} '000100039X'
 0000003 = {str} '000100039X'
 0000004 = {str} '000100039X'
 0000005 = {str} '000100039X'
Copy the code

The sequence of user behaviors is:

user_map = {dict: 603668} 
'A1BM81XB4QHOA3' = {list: 6} 
 0 = {tuple: 2} ('A1BM81XB4QHOA3 \ t000100039X \ t5.0 \ t1390003200'.1390003200.0)
 1 = {tuple: 2} ('A1BM81XB4QHOA3 \ t0060838582 \ t5.0 \ t1190851200'.1190851200.0)
 2 = {tuple: 2} ('A1BM81XB4QHOA3 \ t0743241924 \ t4.0 \ t1143158400'.1143158400.0)
 3 = {tuple: 2} ('A1BM81XB4QHOA3 \ t0848732391 \ t2.0 \ t1300060800'.1300060800.0)
 4 = {tuple: 2} ('A1BM81XB4QHOA3 \ t0884271781 \ t5.0 \ t1403308800'.1403308800.0)
 5 = {tuple: 2} ('A1BM81XB4QHOA3 \ t1885535104 \ t5.0 \ t1390003200'.1390003200.0)
'A1MOSTXNIO5MPJ' = {list: 9} 
 0 = {tuple: 2} ('A1MOSTXNIO5MPJ \ t000100039X \ t5.0 \ t1317081600'.1317081600.0)
 1 = {tuple: 2} ('A1MOSTXNIO5MPJ \ t0143142941 \ t4.0 \ t1211760000'.1211760000.0)
 2 = {tuple: 2} ('A1MOSTXNIO5MPJ \ t0310325366 \ t1.0 \ t1259712000'.1259712000.0)
 3 = {tuple: 2} ('A1MOSTXNIO5MPJ \ t0393062112 \ t5.0 \ t1179964800'.1179964800.0)
 4 = {tuple: 2} ('A1MOSTXNIO5MPJ \ t0872203247 \ t3.0 \ t1211760000'.1211760000.0)
 5 = {tuple: 2} ('A1MOSTXNIO5MPJ \ t1455504181 \ t5.0 \ t1398297600'.1398297600.0)
 6 = {tuple: 2} ('A1MOSTXNIO5MPJ \ t1596917024 \ t5.0 \ t1369440000'.1369440000.0)
 7 = {tuple: 2} ('A1MOSTXNIO5MPJ \ t1600610676 \ t5.0 \ t1276128000'.1276128000.0)
 8 = {tuple: 2} ('A1MOSTXNIO5MPJ \ t9380340141 \ t3.0 \ t1369440000'.1369440000.0)
Copy the code

The specific code is as follows:

def manual_join() :
    f_rev = open("reviews-info"."r")
    user_map = {}
    item_list = []
    for line in f_rev:
        line = line.strip()
        items = line.split("\t")
        if items[0] not in user_map:
            user_map[items[0]]= []
        user_map[items[0]].append(("\t".join(items), float(items[-1])))
        item_list.append(items[1])
        
    f_meta = open("item-info"."r")
    meta_map = {}
    for line in f_meta:
        arr = line.strip().split("\t")
        if arr[0] not in meta_map:
            meta_map[arr[0]] = arr[1]
            arr = line.strip().split("\t")
            
    fo = open("jointed-new"."w")
    for key in user_map:
        sorted_user_bh = sorted(user_map[key], key=lambda x:x[1]) # Order user actions by time
        for line, t in sorted_user_bh:
            # for every user behavior
            items = line.split("\t")
            asin = items[1]
            j = 0
            while True:
                asin_neg_index = random.randint(0.len(item_list) - 1) Get a random item ID index
                asin_neg = item_list[asin_neg_index] Get the random item ID
                if asin_neg == asin: If it happens to be that item ID, select continue
                    continue 
                items[1] = asin_neg
                Write negative samples
                print>>fo, "0" + "\t" + "\t".join(items) + "\t" + meta_map[asin_neg]
                j += 1
                if j == 1: #negative sampling frequency
                    break
            Write positive sample
            if asin in meta_map:
                print>>fo, "1" + "\t" + line + "\t" + meta_map[asin]
            else:
                print>>fo, "1" + "\t" + line + "\t" + "default_cat"
Copy the code

The final file is extracted as follows, generating a series of positive and negative samples.

0	A10000012B7CGYKOMPQ4L	140004314X	5.0	1355616000	Books
1	A10000012B7CGYKOMPQ4L	000100039X	5.0	1355616000	Books
0	A10000012B7CGYKOMPQ4L	1477817603	5.0	1355616000	Books
1	A10000012B7CGYKOMPQ4L	0393967972	5.0	1355616000	Books
0	A10000012B7CGYKOMPQ4L	0778329933	5.0	1355616000	Books
1	A10000012B7CGYKOMPQ4L	0446691437	5.0	1355616000	Books
0	A10000012B7CGYKOMPQ4L	B006P5CH1O	4.0	1355616000	Collections & Anthologies
Copy the code

2.2.3 Sample separation

This step separates the samples in order to determine the last two samples on the timeline.

  • Read joinTED-new from the previous step;
  • Count the number of records for each user using user_count;
  • Walk through Jointed-New again.
    • If it is the last two lines of the user record, write 20190119 before the line;
    • If it is the first several lines of the user record, write 20180118 before the line.
    • New records are written to jointed-new-split-info;

So, in jointed-new-split-info, the two records with the prefix 20190119 are the last two records of the user’s behavior. They happen to be one positive sample, one negative sample, and the last two in time.

The code is as follows:

def split_test() :
    fi = open("jointed-new"."r")
    fo = open("jointed-new-split-info"."w")
    user_count = {}
    for line in fi:
        line = line.strip()
        user = line.split("\t") [1]
        if user not in user_count:
            user_count[user] = 0
        user_count[user] += 1
    fi.seek(0)
    i = 0
    last_user = "A26ZDKC53OP6JD"
    for line in fi:
        line = line.strip()
        user = line.split("\t") [1]
        if user == last_user:
            if i < user_count[user] - 2:  # 1 + negative samples
                print>> fo, "20180118" + "\t" + line
            else:
                print>>fo, "20190119" + "\t" + line
        else:
            last_user = user
            i = 0
            if i < user_count[user] - 2:
                print>> fo, "20180118" + "\t" + line
            else:
                print>>fo, "20190119" + "\t" + line
        i += 1
Copy the code

The final document is as follows:

20180118	0	A10000012B7CGYKOMPQ4L	140004314X	5.0	1355616000	Books
20180118	1	A10000012B7CGYKOMPQ4L	000100039X	5.0	1355616000	Books
20180118	0	A10000012B7CGYKOMPQ4L	1477817603	5.0	1355616000	Books
20180118	1	A10000012B7CGYKOMPQ4L	0393967972	5.0	1355616000	Books
20180118	0	A10000012B7CGYKOMPQ4L	0778329933	5.0	1355616000	Books
20180118	1	A10000012B7CGYKOMPQ4L	0446691437	5.0	1355616000	Books
20180118	0	A10000012B7CGYKOMPQ4L	B006P5CH1O	4.0	1355616000	Collections & Anthologies
20180118	1	A10000012B7CGYKOMPQ4L	0486227081	4.0	1355616000	Books
20180118	0	A10000012B7CGYKOMPQ4L	B00HWI5OP4	4.0	1355616000	United States
20180118	1	A10000012B7CGYKOMPQ4L	048622709X	4.0	1355616000	Books
20180118	0	A10000012B7CGYKOMPQ4L	1475005873	4.0	1355616000	Books
20180118	1	A10000012B7CGYKOMPQ4L	0486274268	4.0	1355616000	Books
20180118	0	A10000012B7CGYKOMPQ4L	098960571X	4.0	1355616000	Books
20180118	1	A10000012B7CGYKOMPQ4L	0486404730	4.0	1355616000	Books
20190119	0	A10000012B7CGYKOMPQ4L	1495459225	4.0	1355616000	Books
20190119	1	A10000012B7CGYKOMPQ4L	0830604790	4.0	1355616000	Books
Copy the code

2.2.4 Generating behavior sequences

Local_aggretor. py is used to generate a sequence of user actions.

For example, for a user whose reviewerID=0, his pos_list is [13179, 17993, 28326, 29247, 62275], The generated training set is in the format of (reviewerID, hist, pos_item, 1) and (reviewerID, hist, neg_item, 0).

Note that hist does not contain pos_item or neg_item. Hist only contains items clicked before pos_item. DIN uses a similar mechanism to attention, and only historical attention affects subsequent attention. So it makes sense that hist only contains items clicked before pos_item.

The specific logic is:

  • Traverses all lines of “jointed-new-split-info”
    • Continuously add click state item ID and cat ID.
      • If it starts with 20180118, write local_train.
      • If it starts with 20190119, write local_test.

Because 20190119 is the last two sequences in time, the final local_test file obtains two cumulative behavior sequences of each user, that is, this behavior sequence includes all time from beginning to end.

The file is oddly named here because the actual training test uses data from the local_test file.

One positive sample, one negative sample. The two sequences are the same except for the last item ID and click.

The specific code is as follows:

fin = open("jointed-new-split-info"."r")
ftrain = open("local_train"."w")
ftest = open("local_test"."w")

last_user = "0"
common_fea = ""
line_idx = 0
for line in fin:
    items = line.strip().split("\t")
    ds = items[0]
    clk = int(items[1])
    user = items[2]
    movie_id = items[3]
    dt = items[5]
    cat1 = items[6]

    if ds=="20180118":
        fo = ftrain
    else:
        fo = ftest
    ifuser ! = last_user: movie_id_list = [] cate1_list = []else:
        history_clk_num = len(movie_id_list)
        cat_str = ""
        mid_str = ""
        for c1 in cate1_list:
            cat_str += c1 + ""
        for mid in movie_id_list:
            mid_str += mid + ""
        if len(cat_str) > 0: cat_str = cat_str[:-1]
        if len(mid_str) > 0: mid_str = mid_str[:-1]
        if history_clk_num >= 1:    # 8 is the average length of user behavior
            print >> fo, items[1] + "\t" + user + "\t" + movie_id + "\t" + cat1 +"\t" + mid_str + "\t" + cat_str
    last_user = user
    if clk: # if it is in the click state
        movie_id_list.append(movie_id) # Accumulate the corresponding movie iD
        cate1_list.append(cat1) # accumulate the corresponding CAT ID
    line_idx += 1

Copy the code

Finally, the local_test data is extracted as follows:

0	A10000012B7CGYKOMPQ4L	1495459225	Books	000100039X039396797204466914370486227081048622709X04862742680486404730	BooksBooksBooksBooksBooksBooksBooks
1	A10000012B7CGYKOMPQ4L	0830604790	Books	000100039X039396797204466914370486227081048622709X04862742680486404730	BooksBooksBooksBooksBooksBooksBooks
Copy the code

2.2.5 It is divided into training set and test set

Split_by_user. py is used to split data sets.

Is an integer randomly selected from 1 to 10. If it is exactly 2, it is used as the validation data set.

fi = open("local_test"."r")
ftrain = open("local_train_splitByUser"."w")
ftest = open("local_test_splitByUser"."w")

while True:
    rand_int = random.randint(1.10)
    noclk_line = fi.readline().strip()
    clk_line = fi.readline().strip()
    if noclk_line == "" or clk_line == "":
        break
    if rand_int == 2:
        print >> ftest, noclk_line
        print >> ftest, clk_line
    else:
        print >> ftrain, noclk_line
        print >> ftrain, clk_line
Copy the code

Examples are as follows:

The format is label, user ID, candidate Item ID, candidate Item type, behavior sequence, type sequence

0	A3BI7R43VUZ1TY	B00JNHU0T2	Literature & Fiction	0989464105B00B01691C14778097321608442845	BooksLiterature & FictionBooksBooks
1	A3BI7R43VUZ1TY	0989464121	Books	0989464105B00B01691C14778097321608442845	BooksLiterature & FictionBooksBooks
Copy the code

2.2.6 Generating a Data dictionary

Generate_voc. py generates three data dictionaries for the user, movie, and genre. The three dictionaries include all user ids, all movie ids, and all category ids respectively. This is simply sorting the three elements starting at 1.

Movie ID, Categories, and reviewerID are used to produce three maps (movie_map, cate_map, and uID_map). Key is the corresponding original information and value is the index sorted by key (starting from 0). Then the corresponding column of the original data is converted to the index corresponding to the key.

import cPickle

f_train = open("local_train_splitByUser"."r")
uid_dict = {}
mid_dict = {}
cat_dict = {}

iddd = 0
for line in f_train:
    arr = line.strip("\n").split("\t")
    clk = arr[0]
    uid = arr[1]
    mid = arr[2]
    cat = arr[3]
    mid_list = arr[4]
    cat_list = arr[5]
    if uid not in uid_dict:
        uid_dict[uid] = 0
    uid_dict[uid] += 1
    if mid not in mid_dict:
        mid_dict[mid] = 0
    mid_dict[mid] += 1
    if cat not in cat_dict:
        cat_dict[cat] = 0
    cat_dict[cat] += 1
    if len(mid_list) == 0:
        continue
    for m in mid_list.split("") :if m not in mid_dict:
            mid_dict[m] = 0
        mid_dict[m] += 1
    iddd+=1
    for c in cat_list.split("") :if c not in cat_dict:
            cat_dict[c] = 0
        cat_dict[c] += 1

sorted_uid_dict = sorted(uid_dict.iteritems(), key=lambda x:x[1], reverse=True)
sorted_mid_dict = sorted(mid_dict.iteritems(), key=lambda x:x[1], reverse=True)
sorted_cat_dict = sorted(cat_dict.iteritems(), key=lambda x:x[1], reverse=True)

uid_voc = {}
index = 0
for key, value in sorted_uid_dict:
    uid_voc[key] = index
    index += 1

mid_voc = {}
mid_voc["default_mid"] = 0
index = 1
for key, value in sorted_mid_dict:
    mid_voc[key] = index
    index += 1

cat_voc = {}
cat_voc["default_cat"] = 0
index = 1
for key, value in sorted_cat_dict:
    cat_voc[key] = index
    index += 1

cPickle.dump(uid_voc, open("uid_voc.pkl"."w"))
cPickle.dump(mid_voc, open("mid_voc.pkl"."w"))
cPickle.dump(cat_voc, open("cat_voc.pkl"."w"))
Copy the code

Finally, we get several files processed by the DIN model:

  • Uid_voc. PKL: user dictionary, id corresponding to user name;
  • Mid_voc. PKL: movie dictionary, id corresponding to item;
  • Cat_voc. PKL: category dictionary, category id;
  • Item-info: indicates the category information of an item.
  • Review-info: review metadata, in the format of userID, itemID, score, timestamp, used for negative sampling;
  • Local_train_splitByUser: indicates training data in the format of label, user name, target item, target item category, history item, and corresponding category of history item.
  • Local_test_splitByUser: test data in the same format as training data;

0x03 How Do I Use Data

3.1 Training data

The train.py section evaluates the test set once with the initial model, and then evaluates the test set every 1000 times according to batch training.

The code for the lite version is as follows:

def train(
        train_file = "local_train_splitByUser",
        test_file = "local_test_splitByUser",
        uid_voc = "uid_voc.pkl",
        mid_voc = "mid_voc.pkl",
        cat_voc = "cat_voc.pkl",
        batch_size = 128,
        maxlen = 100,
        test_iter = 100,
        save_iter = 100,
        model_type = 'DNN',
	seed = 2.) :
    with tf.Session(config=tf.ConfigProto(gpu_options=gpu_options)) as sess:
        Get training data and test data
        train_data = DataIterator(train_file, uid_voc, mid_voc, cat_voc, batch_size, maxlen, shuffle_each_epoch=False)
        test_data = DataIterator(test_file, uid_voc, mid_voc, cat_voc, batch_size, maxlen)
        n_uid, n_mid, n_cat = train_data.get_n()
        # Build a model
        model = Model_DIN(n_uid, n_mid, n_cat, EMBEDDING_DIM, HIDDEN_SIZE, ATTENTION_SIZE)
        iter = 0
        lr = 0.001
        for itr in range(3):
            loss_sum = 0.0
            accuracy_sum = 0.
            aux_loss_sum = 0.
            for src, tgt in train_data:
                # Prepare data
                uids, mids, cats, mid_his, cat_his, mid_mask, target, sl, noclk_mids, noclk_cats = prepare_data(src, tgt, maxlen, return_neg=True)
                # training
                loss, acc, aux_loss = model.train(sess, [uids, mids, cats, mid_his, cat_his, mid_mask, target, sl, lr, noclk_mids, noclk_cats])
                
                loss_sum += loss
                accuracy_sum += acc
                aux_loss_sum += aux_loss
                iter+ =1
                if (iter % test_iter) == 0:
					eval(sess, test_data, model, best_model_path)
                    loss_sum = 0.0
                    accuracy_sum = 0.0
                    aux_loss_sum = 0.0
                if (iter % save_iter) == 0:
                    model.save(sess, model_path+"--"+str(iter))
            lr *= 0.5
Copy the code

3.2 Iteratively read

DataInput is an iterator that returns the next batch of data per call. This code deals with how the data is divided by batch and how to construct an iterator.

As mentioned above, the format of the training data set is: Label, user ID, candidate Item ID, candidate item category, behavior sequence, and category sequence

3.2.1 initialization

The basic logic is:

The __init__ function:

  • Self. Source_dicts: [uid_VOC, MID_VOC, cat_VOC];
  • Finally, self.meta_id_map is the cateory ID corresponding to each movie ID, that is, the mapping between movie ID and category ID is constructed. The key code is: self.meta_id_map[MID_idx] = cat_idx;
  • Read from “review-info” to generate a list of ids needed for negative sampling;
  • You have to take all kinds of basic data, like the length of the user list, the length of the movie list, and so on;

The code is as follows:

class DataIterator:

    def __init__(self, source,
                 uid_voc,
                 mid_voc,
                 cat_voc,
                 batch_size=128,
                 maxlen=100,
                 skip_empty=False,
                 shuffle_each_epoch=False,
                 sort_by_length=True,
                 max_batch_size=20,
                 minlen=None) :
        if shuffle_each_epoch:
            self.source_orig = source
            self.source = shuffle.main(self.source_orig, temporary=True)
        else:
            self.source = fopen(source, 'r')
        self.source_dicts = []
        Self. Source_dicts [uid_VOC, mid_VOC, cat_VOC]
        for source_dict in [uid_voc, mid_voc, cat_voc]:
            self.source_dicts.append(load_dict(source_dict))

        Self. meta_id_map[mid_idx] = cat_idx; self.meta_id_map[mid_idx] = cat_idx;
        f_meta = open("item-info"."r")
        meta_map = {}
        for line in f_meta:
            arr = line.strip().split("\t")
            if arr[0] not in meta_map:
                meta_map[arr[0]] = arr[1]
        self.meta_id_map ={}
        for key in meta_map:
            val = meta_map[key]
            if key in self.source_dicts[1]:
                mid_idx = self.source_dicts[1][key]
            else:
                mid_idx = 0
            if val in self.source_dicts[2]:
                cat_idx = self.source_dicts[2][val]
            else:
                cat_idx = 0
            self.meta_id_map[mid_idx] = cat_idx

        # review-info (); # review-info ();
        f_review = open("reviews-info"."r")
        self.mid_list_for_random = []
        for line in f_review:
            arr = line.strip().split("\t")
            tmp_idx = 0
            if arr[1] in self.source_dicts[1]:
                tmp_idx = self.source_dicts[1][arr[1]]
            self.mid_list_for_random.append(tmp_idx)

        Base data such as user list length, movie list length, etc.
        self.batch_size = batch_size
        self.maxlen = maxlen
        self.minlen = minlen
        self.skip_empty = skip_empty

        self.n_uid = len(self.source_dicts[0])
        self.n_mid = len(self.source_dicts[1])
        self.n_cat = len(self.source_dicts[2])

        self.shuffle = shuffle_each_epoch
        self.sort_by_length = sort_by_length

        self.source_buffer = []
        self.k = batch_size * max_batch_size

        self.end_of_data = False
Copy the code

The final data is as follows:

self = {DataIterator} <data_iterator.DataIterator object at 0x000001F56CB44BA8>
 batch_size = {int} 128
 k = {int} 2560
 maxlen = {int} 100
 meta_id_map = {dict: 367983} {0: 1572.115840: 1.282448: 1.198250: 1.4275: 1.260890: 1.260584: 1.110331: 1.116224: 1.2704: 1.298259: 1.47792: 1.186701: 1.121548: 1.147230: 1.238085: 1.367828: 1.270505: 1.354813: 1.. mid_list_for_random = {list: 8898041} [4275.4275.4275.4275.4275.4275.4275.4275.. minlen = {NoneType}None
 n_cat = {int} 1601
 n_mid = {int} 367983
 n_uid = {int} 543060
 shuffle = {bool} False
 skip_empty = {bool} False
 sort_by_length = {bool} True
 source = {TextIOWrapper} <_io.TextIOWrapper name='local_train_splitByUser' mode='r' encoding='cp936'>
 source_buffer = {list: 0} []
 source_dicts = {list: 3} 
  0 = {dict: 543060} {'ASEARD9XL1EWO': 449136.'AZPJ9LUT0FEPY': 0.'A2NRV79GKAU726': 16.'A2GEQVDX2LL4V3': 266686.'A3R04FKEYE19T6': 354817.'A3VGDQOR56W6KZ': 4..1 = {dict: 367983} {'1594483752': 47396.'0738700797': 159716.'1439110239': 193476..2 = {dict: 1601} {'Residential': 1281.'Poetry': 250.'Winter Sports': 1390..Copy the code

3.2.2 Iteratively Read

When iteratively reading, the logic is as follows:

  • ifself.source_bufferThere is no data, the total number of file lines k is read. It can be interpreted as reading the maximum buffer at one time;
  • If set, sort by the length of user history behaviors.
  • Internal iteration starts fromself.source_bufferPull out a piece of data:
    • Retrieve the user’s historical action movie ID list to mid_list;
    • Cat id list to cat_list;
    • For each POS_mid in mid_list, five negative sampling historical behavior data are generated. Get 5 ids from mid_list_FOR_random (pos_mid if same); That is, for each user’s historical behavior, 5 samples are selected as negative samples in the code;
    • Insert [uid, mid, cat, MID_list, cat_list, noclk_MID_list, noclk_cat_list] into souCE as training data;
    • Put [float(ss[0]), 1-float(ss[0])] into target as label;
    • If the batCH_size is reached, the internal iteration is skipped and the batch data is returned, that is, a list with a maximum length of 128.

See the specific code below:

def __next__(self) :
        if self.end_of_data:
            self.end_of_data = False
            self.reset()
            raise StopIteration

        source = []
        target = []
        
        If self.source_buffer has no data, read k lines. Read the maximum buffer at one time
        if len(self.source_buffer) == 0:
            #for k_ in xrange(self.k):
            for k_ in range(self.k):
                ss = self.source.readline()
                if ss == "":
                    break
                self.source_buffer.append(ss.strip("\n").split("\t"))

            # sort by history behavior length
            # If set, sort by user history behavior length;
            if self.sort_by_length:
                his_length = numpy.array([len(s[4].split("")) for s in self.source_buffer])
                tidx = his_length.argsort()

                _sbuf = [self.source_buffer[i] for i in tidx]
                self.source_buffer = _sbuf
            else:
                self.source_buffer.reverse()

        if len(self.source_buffer) == 0:
            self.end_of_data = False
            self.reset()
            raise StopIteration

        try:

            # actual work here, internal iteration begins
            while True:

                # read from source file and map to word index
                try:
                    ss = self.source_buffer.pop()
                except IndexError:
                    break

                uid = self.source_dicts[0][ss[1]] if ss[1] in self.source_dicts[0] else 0
                mid = self.source_dicts[1][ss[2]] if ss[2] in self.source_dicts[1] else 0
                cat = self.source_dicts[2][ss[3]] if ss[3] in self.source_dicts[2] else 0
                
                # retrieve a list of historical behavior movie ids from the user to the mid_list;
                tmp = []
                for fea in ss[4].split(""):
                    m = self.source_dicts[1][fea] if fea in self.source_dicts[1] else 0
                    tmp.append(m)
                mid_list = tmp

                # retrieve a list of cat ids from user history to cat_list;
                tmp1 = []
                for fea in ss[5].split(""):
                    c = self.source_dicts[2][fea] if fea in self.source_dicts[2] else 0
                    tmp1.append(c)
                cat_list = tmp1

                # read from source file and map to word index

                #if len(mid_list) > self.maxlen:
                # continue
                ifself.minlen ! =None:
                    if len(mid_list) <= self.minlen:
                        continue
                if self.skip_empty and (not mid_list):
                    continue

                # Create 5 negative sampling historical data for each POS_mid in mid_list; Get 5 ids from mid_list_FOR_random (pos_mid if same);
                noclk_mid_list = []
                noclk_cat_list = []
                for pos_mid in mid_list:
                    noclk_tmp_mid = []
                    noclk_tmp_cat = []
                    noclk_index = 0
                    while True:
                        noclk_mid_indx = random.randint(0.len(self.mid_list_for_random)-1)
                        noclk_mid = self.mid_list_for_random[noclk_mid_indx]
                        if noclk_mid == pos_mid:
                            continue
                        noclk_tmp_mid.append(noclk_mid)
                        noclk_tmp_cat.append(self.meta_id_map[noclk_mid])
                        noclk_index += 1
                        if noclk_index >= 5:
                            break
                    noclk_mid_list.append(noclk_tmp_mid)
                    noclk_cat_list.append(noclk_tmp_cat)
                source.append([uid, mid, cat, mid_list, cat_list, noclk_mid_list, noclk_cat_list])
                target.append([float(ss[0]), 1-float(ss[0]])if len(source) >= self.batch_size or len(target) >= self.batch_size:
                    break
        except IOError:
            self.end_of_data = True

        # all sentence pairs in maxibatch filtered out because of length
        if len(source) == 0 or len(target) == 0:
            source, target = self.next(a)return source, target
Copy the code

3.2.3 Data processing

After the iteration data is captured, further processing is required.

uids, mids, cats, mid_his, cat_his, mid_mask, target, sl, noclk_mids, noclk_cats = prepare_data(src, tgt, return_neg=True)
Copy the code

It can be understood as grouping the batch of data (let’s say 128 items). For example, the 128 historical sequences of UIDS, MIDS, CATS are aggregated separately and finally sent to the model for training.

The important point here is that the mask is generated. Its meaning is:

Mask indicates a mask that masks certain values so that they do not take effect when parameters are updated. A padding mask is a type of mask,

  • What is a padding mask? Because the length of the input sequence is different from batch to batch. That is, we want to align the input sequence. In particular, you populate short sequences with zeros. However, if the input sequence is too long, the content on the left is cut and the excess is discarded. Since these fill positions are meaningless, the attention mechanism should not focus on these positions and needs to do some processing.
  • To do this, add the values of these positions to a very large negative number (negative infinity), so that the probability of these positions approaches 0 by SoftMax! And our padding mask is actually a tensor, and each value is a Boolean, and the value false is where we’re going to do the processing.

DIN in this case, since the sequence of user actions in a Batch may not all be the same, its real length is stored in keys_length, so masks are generated to select the actual historical behavior.

  • First, set the mask to 0.
  • Then if the data is meaningful, set mask to 1;

The specific code is as follows:

def prepare_data(input, target, maxlen = None, return_neg = False) :
    # x: a list of sentences
    #s[4] is mid_list. Each input item has a different length of mid_list
    lengths_x = [len(s[4]) for s in input] 
    seqs_mid = [inp[3] for inp in input]
    seqs_cat = [inp[4] for inp in input]
    noclk_seqs_mid = [inp[5] for inp in input]
    noclk_seqs_cat = [inp[6] for inp in input]

    if maxlen is not None:
        new_seqs_mid = []
        new_seqs_cat = []
        new_noclk_seqs_mid = []
        new_noclk_seqs_cat = []
        new_lengths_x = []
        for l_x, inp in zip(lengths_x, input) :if l_x > maxlen:
                new_seqs_mid.append(inp[3][l_x - maxlen:])
                new_seqs_cat.append(inp[4][l_x - maxlen:])
                new_noclk_seqs_mid.append(inp[5][l_x - maxlen:])
                new_noclk_seqs_cat.append(inp[6][l_x - maxlen:])
                new_lengths_x.append(maxlen)
            else:
                new_seqs_mid.append(inp[3])
                new_seqs_cat.append(inp[4])
                new_noclk_seqs_mid.append(inp[5])
                new_noclk_seqs_cat.append(inp[6])
                new_lengths_x.append(l_x)
        lengths_x = new_lengths_x
        seqs_mid = new_seqs_mid
        seqs_cat = new_seqs_cat
        noclk_seqs_mid = new_noclk_seqs_mid
        noclk_seqs_cat = new_noclk_seqs_cat

        if len(lengths_x) < 1:
            return None.None.None.None

    # lengthS_x Saves the actual length of the user's historical behavior sequence, maxlen_x indicates the maximum length in the sequence;
    n_samples = len(seqs_mid)
    maxlen_x = numpy.max(lengths_x) Select the largest mid_list length, in this case 583
    neg_samples = len(noclk_seqs_mid[0] [0])

    # Since the length of user history sequence is not fixed, so the mid_HIS matrix is introduced to fix the sequence length to maxlen_x. For sequences whose length is less than maxlen_x, fill them with 0 (note that mid_HIS and other matrices are initialized with zero matrices)
    mid_his = numpy.zeros((n_samples, maxlen_x)).astype('int64') #tuple<128, 583>
    cat_his = numpy.zeros((n_samples, maxlen_x)).astype('int64')
    noclk_mid_his = numpy.zeros((n_samples, maxlen_x, neg_samples)).astype('int64') #tuple<128, 583, 5>
    noclk_cat_his = numpy.zeros((n_samples, maxlen_x, neg_samples)).astype('int64') #tuple<128, 583, 5>
    mid_mask = numpy.zeros((n_samples, maxlen_x)).astype('float32')
    The # zip function takes an iterable object as an argument, packs the corresponding elements of the object into tuples, and returns a list of those tuples
    for idx, [s_x, s_y, no_sx, no_sy] in enumerate(zip(seqs_mid, seqs_cat, noclk_seqs_mid, noclk_seqs_cat)):
        mid_mask[idx, :lengths_x[idx]] = 1.
        mid_his[idx, :lengths_x[idx]] = s_x
        cat_his[idx, :lengths_x[idx]] = s_y
        # noclk_mid_his and noclk_cat_his are both (128, 583, 5)
        noclk_mid_his[idx, :lengths_x[idx], :] = no_sx # is a direct assignment
        noclk_cat_his[idx, :lengths_x[idx], :] = no_sy # is a direct assignment

    uids = numpy.array([inp[0] for inp in input])
    mids = numpy.array([inp[1] for inp in input])
    cats = numpy.array([inp[2] for inp in input])

    # select UID, mid, cat from input (128-long list) Bring it all out, aggregate it, return it
    if return_neg:
        return uids, mids, cats, mid_his, cat_his, mid_mask, numpy.array(target), numpy.array(lengths_x), noclk_mid_his, noclk_cat_his
    else:
        return uids, mids, cats, mid_his, cat_his, mid_mask, numpy.array(target), numpy.array(lengths_x)
Copy the code

3.2.4 Feeding model

Finally, enter the model training, which is this step in train.py:

loss, acc, aux_loss = model.train(sess, [uids, mids, cats, mid_his, cat_his, mid_mask, target, sl, lr, noclk_mids, noclk_cats])
Copy the code

0xEE Personal information

★★★★ Thoughts on life and technology ★★★★★

Wechat official account: Rosie’s Thoughts

If you want to get a timely news feed of personal articles, or want to see the technical information of personal recommendations, please pay attention.

0 XFF reference

Deep Interest Network interpretation

Deep Interest Network (DIN)

DIN paper official implementation analysis

Ali DIN source code how to model user sequence (1) : Base scheme