00. Trampling pits for environmental installation

  • Faiss does not support win10 installation, you need to install the server or VM (see: blog.csdn.net/weixin_4241… , go to version 3.7)
  • The virtual machine needs to give more than 8G of memory, and the test of 6.2g content running YouTube DNN, the task directly strikes
  • Tensorflow needs to specify a PIP installation to version 2.0, specifying failure to create a new environment
Conda create -n tensorflow python=3.7... PIP install --upgrade --ignore-installed tensorflow==2.0Copy the code
  • Install faiss need to match the tsinghua source, the speed of moving or 0.5 k, tsinghua source when installation, after the command to remove pytorch (reference: blog.csdn.net/yuanzhoulvp…).
conda config --append channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch/
conda install pytorch faiss-cpu
Copy the code

01. A brief introduction of the idea of multiple recall

Description: Multi-strategy, using different strategies, features or simple models, recall part of the candidate set separately, and then mix the different candidate sets together for sorting results.

02. Data reading and preparation

# % % 01 guide package

import pandas as pd
import numpy as np
from tqdm import tqdm
from collections import defaultdict
import os, math, warnings, math, pickle
from tqdm import tqdm
import faiss
import collections
import random
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder
from datetime import datetime
from deepctr.feature_column import SparseFeat, VarLenSparseFeat
from sklearn.preprocessing import LabelEncoder
from tensorflow.python.keras import backend as K
from tensorflow.python.keras.models import Model
from tensorflow.python.keras.preprocessing.sequence import pad_sequences

from deepmatch.models import *
from deepmatch.utils import sampledsoftmaxloss
warnings.filterwarnings('ignore')


Prepare to read data

"" 1. Debug mode: randomly sampling some data 2. Offline verification mode: only using data from the training set 3. Online mode: Use all of the training set + test set data

# A sign for recall evaluation. If the evaluation is not carried out, the recall is carried out directly with full data
metric_recall = False

Read data in a CentOS environment

linux_data_path = '/plus/ Aliyun Developer - Tianchi Match /06_ Tianchi News APP recommended /'
save_path = '/ plus/PycharmProjects TianChiProject maple leaf litters/competitions / 006/00 _ mountain _dw_recommandnews/'


max_min_scaler = lambda x : (x-np.min(x))/(np.max(x)-np.min(x))

# Read click data, which is divided into online and offline. Click data in the test set should be merged into the total data in order to obtain online submission results
# If the purpose is to verify the validity of the model or feature offline, only the training set can be used
def get_all_click_df(data_path, offline=True) :
    if offline:
        all_click = pd.read_csv(data_path + 'train_click_log.csv')
    else:
        trn_click = pd.read_csv(data_path + 'train_click_log.csv')
        tst_click = pd.read_csv(data_path + 'testA_click_log.csv')

        all_click = trn_click.append(tst_click)

    all_click = all_click.drop_duplicates((['user_id'.'click_article_id'.'click_timestamp']))
    return all_click

Read the basic properties of the article
def get_item_info_df(data_path) :
    item_info_df = pd.read_csv(data_path + 'articles.csv')
    To concatenate the training set with click_article_id, change the article_id to click_article_id
    item_info_df = item_info_df.rename(columns={'article_id': 'click_article_id'})

    return item_info_df

The Embedding data of the article is read
def get_item_emb_dict(data_path) :
    pickle_file = save_path + 'model/item_content_emb.pkl'
    if os.path.exists(pickle_file):
        print('pickle_file:',pickle_file,'Already exists, load directly.. ')
        i2i_sim = pickle.load(open(pickle_file, 'rb'))
        return i2i_sim
    print('pickle_file:',pickle_file,'Does not exist, need to recalculate... ')

    item_emb_df = pd.read_csv(data_path + 'articles_emb.csv')

    item_emb_cols = [x for x in item_emb_df.columns if 'emb' in x]
    item_emb_np = np.ascontiguousarray(item_emb_df[item_emb_cols])
    # normalize
    item_emb_np = item_emb_np / np.linalg.norm(item_emb_np, axis=1, keepdims=True)

    item_emb_dict = dict(zip(item_emb_df['article_id'], item_emb_np))
    pickle.dump(item_emb_dict, open(pickle_file, 'wb'))

    return item_emb_dict

# % % 03 to read

# Full training set
all_click_df = get_all_click_df(linux_data_path, offline=False)

# normalize timestamps to calculate weights in association rules
all_click_df['click_timestamp'] = all_click_df[['click_timestamp']].apply(max_min_scaler)

item_info_df = get_item_info_df(linux_data_path)
item_emb_dict = get_item_emb_dict(linux_data_path)

Copy the code

03. Preparation of tool functions (dict preparation, etc.)

#%% 04 utility functions



#%% 4.0 gets the history and last click

# This is used when evaluating recall results, feature engineering, and making tabs into supervised learning test sets

Get the history click and last click of the current data
def get_hist_and_last_click(all_click) :

    all_click = all_click.sort_values(by=['user_id'.'click_timestamp'])
    click_last_df = all_click.groupby('user_id').tail(1)

    # If the user has only one click, the hist will be empty, which will cause the user to be invisible during training
    def hist_func(user_df) :
        if len(user_df) == 1:
            return user_df
        else:
            return user_df[:-1]

    click_hist_df = all_click.groupby('user_id').apply(hist_func).reset_index(drop=True)

    return click_hist_df, click_last_df

Get user - article - time function

# this is used when user collaborative filtering based on association rules
User1: [(item1: time1), (item2: time2)..] . }
def get_user_item_time(click_df) :
    click_df = click_df.sort_values('click_timestamp')
    def make_item_time_pair(df) :
        return list(zip(df['click_article_id'], df['click_timestamp']))

    user_item_time_df = click_df.groupby('user_id') ['click_article_id'.'click_timestamp'].apply(lambda x: make_item_time_pair(x))                                                          .reset_index().rename(columns={0: 'item_time_list'})
    user_item_time_dict = dict(zip(user_item_time_df['user_id'], user_item_time_df['item_time_list']))

    return user_item_time_dict

#%% 4.2 Get the post - user - time function

# This is used for collaborative filtering of articles based on association rules
Item1: [user1: time1, user2: time2...] . }
# Here is the time when the user clicks on the current product, as if there is no direct relationship.
def get_item_user_time_dict(click_df) :
    def make_user_time_pair(df) :
        return list(zip(df['user_id'], df['click_timestamp']))

    click_df = click_df.sort_values('click_timestamp')
    item_user_time_df = click_df.groupby('click_article_id') ['user_id'.'click_timestamp'].apply(lambda x: make_user_time_pair(x))\
                                                            .reset_index().rename(columns={0: 'user_time_list'})

    item_user_time_dict = dict(zip(item_user_time_df['click_article_id'], item_user_time_df['user_time_list']))
    return item_user_time_dict

#%% 4.3 Get the article attribute characteristics

Get the basic attributes corresponding to the article ID and save them in the form of dictionary, which is convenient to use directly in the later recall stage and cold start stage
def get_item_info_dict(item_info_df) :
    item_info_df['created_at_ts'] = item_info_df[['created_at_ts']].apply(max_min_scaler)

    item_type_dict = dict(zip(item_info_df['click_article_id'], item_info_df['category_id']))
    item_words_dict = dict(zip(item_info_df['click_article_id'], item_info_df['words_count']))
    item_created_time_dict = dict(zip(item_info_df['click_article_id'], item_info_df['created_at_ts']))

    return item_type_dict, item_words_dict, item_created_time_dict

#%% 4.4 Get the information of the article clicked by the user in history

def get_user_hist_item_info_dict(all_click) :

    Get the user history corresponding to user_id by clicking on the collection dictionary of the article type
    user_hist_item_typs = all_click.groupby('user_id') ['category_id'].agg(set).reset_index()
    user_hist_item_typs_dict = dict(zip(user_hist_item_typs['user_id'], user_hist_item_typs['category_id']))

    Get the set of articles that user_id corresponds to when the user clicks on the article
    user_hist_item_ids_dict = all_click.groupby('user_id') ['click_article_id'].agg(set).reset_index()
    user_hist_item_ids_dict = dict(zip(user_hist_item_ids_dict['user_id'], user_hist_item_ids_dict['click_article_id']))

    Get the average word dictionary of the articles that user_id corresponds to the user's history of clicks
    user_hist_item_words = all_click.groupby('user_id') ['words_count'].agg('mean').reset_index()
    user_hist_item_words_dict = dict(zip(user_hist_item_words['user_id'], user_hist_item_words['words_count']))

    Get the creation time of the last post clicked by the user corresponding to user_id
    all_click_ = all_click.sort_values('click_timestamp')
    user_last_item_created_time = all_click_.groupby('user_id') ['created_at_ts'].apply(lambda x: x.iloc[-1]).reset_index()

    user_last_item_created_time['created_at_ts'] = user_last_item_created_time[['created_at_ts']].apply(max_min_scaler)

    user_last_item_created_time_dict = dict(zip(user_last_item_created_time['user_id'],                          user_last_item_created_time['created_at_ts']))

    return user_hist_item_typs_dict, user_hist_item_ids_dict, user_hist_item_words_dict, user_last_item_created_time_dict

#%% 4.5 Get the Top-k posts with the most clicks

# Get the most clicked articles recently
def get_item_topk_click(click_df, k) :
    topk_click = click_df['click_article_id'].value_counts().index[:k]
    return topk_click

#%% 4.6 Defines a multiway recall dictionary

Get the attribute information of the article, save it in the form of dictionary for easy query
item_type_dict, item_words_dict, item_created_time_dict = get_item_info_dict(item_info_df)
Define a multi-path recall dictionary and store the results of each path recall in this dictionary
user_multi_recall_dict =  {'itemcf_sim_itemcf_recall': {},
                           'embedding_sim_item_recall': {},
                           'youtubednn_recall': {},
                           'youtubednn_usercf_recall': {},
                           'cold_start_recall': {}}
Extract the last click as recall evaluation. If there is no need to do recall evaluation, the full training set is directly used for recall (offline verification model)
# If it is not a recall evaluation, directly use the full data for recall without extracting the last time
trn_hist_click_df, trn_last_click_df = get_hist_and_last_click(all_click_df)


#%% 4.7 Recall effect evaluation

After completing the recall, it is sometimes necessary to adjust the current recall method or parameters to achieve better recall effect, because the result of recall determines the upper limit of the final ranking. The following will also provide a recall evaluation method
# Evaluate the hit rate of the top 10, 20, 30, 40, 50 recalls in order
def metrics_recall(user_recall_items_dict, trn_last_click_df, topk=5) :
    last_click_item_dict = dict(zip(trn_last_click_df['user_id'], trn_last_click_df['click_article_id']))
    user_num = len(user_recall_items_dict)

    for k in range(10, topk+1.10):
        hit_num = 0
        for user, item_list in user_recall_items_dict.items():
            Get the results of the first k recalls
            tmp_recall_items = [x[0] for x in user_recall_items_dict[user][:k]]
            if last_click_item_dict[user] in set(tmp_recall_items):
                hit_num += 1

        hit_rate = round(hit_num * 1.0 / user_num, 5)
        print(' topk: ', k, ':'.'hit_num: ', hit_num, 'hit_rate: ', hit_rate, 'user_num : ', user_num)

Copy the code

Output content:

There is noCopy the code

Calculate the similarity matrix

5.1 Simply calculate the similarity matrix of items

#%% 5 Compute the similarity matrix

# This part mainly obtains the similarity matrix through collaborative filtering and vector retrieval. The similarity matrix is mainly divided into User2user and Item2Item. The similarity matrix of Item2Item based on itemCF is obtained as follows.

#%%5.1 Calculate item similarity itemCF i2i_sim

Referring to the de-biased product recommendation of KDD2020, association rules are used to calculate the similarity matrix of Item2Item, so that the similarity of the calculated articles is also considered as follows: 1. Time weight of user click 2. Order weight of user click 3. Time weight of article creation """
def itemcf_sim(df, item_created_time_dict) :
    """ Param df: data table :item_created_time_dict: dictionary of article creation time return: similarity matrix between articles Item-based collaborative filtering (please refer to team learning based on recommendation System in the previous issue for details) + association rules
    pickle_file = save_path + 'model/itemcf_i2i_sim.pkl'
    if os.path.exists(pickle_file):
        print('pickle_file:',pickle_file,'Already exists, load directly.. ')
        i2i_sim = pickle.load(open(pickle_file, 'rb'))
        return i2i_sim
    print('pickle_file:',pickle_file,'Does not exist, need to recalculate... ')

    user_item_time_dict = get_user_item_time(df)
    # Calculate item similarity
    i2i_sim = {}
    item_cnt = defaultdict(int)
    for user, item_time_list in tqdm(user_item_time_dict.items()):
        # When optimizing collaborative filtering based on goods, time can be considered
         for loc1, (i, i_click_time) in enumerate(item_time_list):
            item_cnt[i] += 1
            i2i_sim.setdefault(i, {})
            for loc2, (j, j_click_time) in enumerate(item_time_list):
                if(i == j):
                    continue

                # Consider forward and reverse order clicks for articles
                loc_alpha = 1.0 if loc2 > loc1 else 0.7
                # Position information weight, where the parameters can be adjusted
                loc_weight = loc_alpha * (0.9 ** (np.abs(loc2 - loc1) - 1))
                Click time weight, which parameters can be adjusted
                click_time_weight = np.exp(0.7 ** np.abs(i_click_time - j_click_time))
                # The weight of the creation time of two articles, where the parameters can be adjusted
                created_time_weight = np.exp(0.8 ** np.abs(item_created_time_dict[i] - item_created_time_dict[j]))
                i2i_sim[i].setdefault(j, 0)
                # Consider the weight of various factors to calculate the similarity between the final articles
                i2i_sim[i][j] += loc_weight * click_time_weight * created_time_weight / math.log(len(item_time_list) + 1)

    i2i_sim_ = i2i_sim.copy()
    for i, related_items in i2i_sim.items():
        for j, wij in related_items.items():
            i2i_sim_[i][j] = wij / math.sqrt(item_cnt[i] * item_cnt[j])

    # Save the resulting similarity matrix locally
    dump_path = save_path + 'model/itemcf_i2i_sim.pkl'
    print('dump_path:', dump_path)
    pickle.dump(i2i_sim_, open(dump_path, 'wb'))
    print('dump_path done')
    return i2i_sim_

# itemcf_sim(all_click_df, item_created_time_dict)

#%% generates the similarity matrix

i2i_sim = itemcf_sim(all_click_df, item_created_time_dict)
Copy the code

5.2 The vector similarity of Faiss was used for optimization calculation

Faiss tool kit is generally used in the vector recall part of the recommendation system. When doing vector recall, it is massive, and the calculation cost of N*N words is unbearable. Faiss is used to speed up the calculation of topK index vectors that are most similar to two query vectors.

  1. PCA dimensionality reduction algorithm details refer to the following link to learn principle of principal component analysis (PCA) summary of www.cnblogs.com/pinard/p/62… (Short note: it is to find the most important aspects of the data, replace the original data with the most important aspects of the data,N dimension down to 1 dimension)
  2. PQ encoding the link below to learn the details of the example to understand the product quantization algorithm www.fabwrite.com/productquan… Quantization is essentially the decomposition of an original high-dimensional space into a finite number of cartesian products of lower-dimensional subspaces, which are then quantized separately. OPQ tries to find an orthogonal matrix and decompose the original matrix after rotation so as to minimize the error of the quantized vector after reconstruction.

code:

# topk refers to each item. Faiss searches for the most similar Topk items
def embdding_sim(click_df, item_emb_df, save_path, topk) :
    Local embedding: Param click_df: Param item_emb_df: Local embedding: Param save_path: patam topk: Param click_df: Param item_emb_df: Local embedding: Param save_path: param topk For each article, topk most similar articles are returned based on the similarity of embedding. If the number of articles is too large, faiss is used for acceleration.
    pickle_file = save_path + 'model/faiss_emb_i2i_sim.pkl'
    if os.path.exists(pickle_file):
        print('pickle_file:',pickle_file,'Already exists, load directly.. ')
        i2i_sim = pickle.load(open(pickle_file, 'rb'))
        return i2i_sim
    print('pickle_file:',pickle_file,'Does not exist, need to recalculate... ')

    Lexicographical mapping of article index to article ID
    item_idx_2_rawid_dict = dict(zip(item_emb_df.index, item_emb_df['article_id']))

    item_emb_cols = [x for x in item_emb_df.columns if 'emb' in x]
    item_emb_np = np.ascontiguousarray(item_emb_df[item_emb_cols].values, dtype=np.float32)
    I'm going to unit the vector
    item_emb_np = item_emb_np / np.linalg.norm(item_emb_np, axis=1, keepdims=True)

    # create faISS index
    item_index = faiss.IndexFlatIP(item_emb_np.shape[1])
    item_index.add(item_emb_np)
    Return topk items and similarity for each vector in the index position
    sim, idx = item_index.search(item_emb_np, topk) # returns a list

    Save the result of vector retrieval as the original id's correspondence
    item_sim_dict = collections.defaultdict(dict)
    for target_idx, sim_value_list, rele_idx_list in tqdm(zip(range(len(item_emb_np)), sim, idx)):
        target_raw_id = item_idx_2_rawid_dict[target_idx]
        # 1 is used to get rid of the item itself, so you only get topk-1 for similar items
        for rele_idx, sim_value in zip(rele_idx_list[1:], sim_value_list[1:]):
            rele_raw_id = item_idx_2_rawid_dict[rele_idx]
            item_sim_dict[target_raw_id][rele_raw_id] = item_sim_dict.get(target_raw_id, {}).get(rele_raw_id, 0) + sim_value

    Save the i2i similarity matrix
    pickle.dump(item_sim_dict, open(pickle_file, 'wb'))

    return item_sim_dict

#%% 5.4 Calculate and pickle the results
item_emb_df = pd.read_csv(linux_data_path + '/articles_emb.csv')
# Topk is configurable
emb_i2i_sim = embdding_sim(all_click_df, item_emb_df, save_path, topk=10)

Copy the code

Output:

pickle_file: / plus/PycharmProjects/TianChiProject maple leaf litters/competitions / 006/00 _ mountain _dw_recommandnews/model/faiss_emb_i2i_sim PKL Does not exist, need to recalculate... 364047 it [00:12, 28353.99 it/s]Copy the code

06 recall

Common recall strategies:

  • Youtube within DNN recall
  • Recall based on the article
    • Collaborative filtering of articles
    • Recall based on article embedding
  • User based recall
    • Collaborative filtering of users
    • User embedding

6.2 YoutubeDNN Recall (this step is to directly get a list of candidate articles for user recall)

# negsample refers to the number of negative samples when building a sample through the slide window
def gen_data_set(data, negsample=0) :
    data.sort_values("click_timestamp", inplace=True)
    item_ids = data['click_article_id'].unique()

    train_set = []
    test_set = []
    for reviewerID, hist in tqdm(data.groupby('user_id')):
        pos_list = hist['click_article_id'].tolist()

        if negsample > 0:
            candidate_set = list(set(item_ids) - set(pos_list))   # Select negative samples from articles the user has not seen
            neg_list = np.random.choice(candidate_set,size=len(pos_list)*negsample,replace=True)  # For each positive sample, select n negative samples

        If the length is only one, the data needs to be added to the training set, otherwise the final learned embedding will be missing
        if len(pos_list) == 1:
            train_set.append((reviewerID, [pos_list[0]], pos_list[0].1.len(pos_list)))
            test_set.append((reviewerID, [pos_list[0]], pos_list[0].1.len(pos_list)))

        # Sliding window structure positive and negative samples
        for i in range(1.len(pos_list)):
            hist = pos_list[:i]

            ifi ! =len(pos_list) - 1:
                train_set.append((reviewerID, hist[::-1], pos_list[i], 1.len(hist[::-1)))[user_id, his_item, pos_item, label, len(his_item)]
                for negi in range(negsample):
                    train_set.append((reviewerID, hist[::-1], neg_list[i*negsample+negi], 0.len(hist[::-1)))[user_id, his_item, neg_item, label, len(his_item)]
            else:
                # use the longest sequence length as test data
                test_set.append((reviewerID, hist[::-1], pos_list[i],1.len(hist[::-1])))

    random.shuffle(train_set)
    random.shuffle(test_set)

    return train_set, test_set

Padding the input data so that the length of the sequence features is consistent
def gen_model_input(train_set,user_profile,seq_max_len) :

    train_uid = np.array([line[0] for line in train_set])
    train_seq = [line[1] for line in train_set]
    train_iid = np.array([line[2] for line in train_set])
    train_label = np.array([line[3] for line in train_set])
    train_hist_len = np.array([line[4] for line in train_set])

    train_seq_pad = pad_sequences(train_seq, maxlen=seq_max_len, padding='post', truncating='post', value=0)
    train_model_input = {"user_id": train_uid, "click_article_id": train_iid, "hist_article_id": train_seq_pad,
                         "hist_len": train_hist_len}

    return train_model_input, train_label

# % % 6.1.2

def youtubednn_u2i_dict(data, topk=20) :
    sparse_features = ["click_article_id"."user_id"]
    SEQ_LEN = 30 # The user clicks on the length of the sequence, short to fill, long to truncate

    user_profile_ = data[["user_id"]].drop_duplicates('user_id')
    item_profile_ = data[["click_article_id"]].drop_duplicates('click_article_id')

    # Category coding
    features = ["click_article_id"."user_id"]
    feature_max_idx = {}

    for feature in features:
        lbe = LabelEncoder()
        data[feature] = lbe.fit_transform(data[feature])
        feature_max_idx[feature] = data[feature].max() + 1

    # Extract the portrait of User and item. Further analysis and consideration are needed to select the specific features
    user_profile = data[["user_id"]].drop_duplicates('user_id')
    item_profile = data[["click_article_id"]].drop_duplicates('click_article_id')

    user_index_2_rawid = dict(zip(user_profile['user_id'], user_profile_['user_id']))
    item_index_2_rawid = dict(zip(item_profile['click_article_id'], item_profile_['click_article_id']))

    # Divide training and test sets
    # Since the amount of data required by deep learning is usually very large, training samples are often expanded in the form of sliding window to ensure the effect of recall
    train_set, test_set = gen_data_set(data, 0)
    To collate input data, see the function above
    train_model_input, train_label = gen_model_input(train_set, user_profile, SEQ_LEN)
    test_model_input, test_label = gen_model_input(test_set, user_profile, SEQ_LEN)

    Specify the dimension of Embedding
    embedding_dim = 16

    Organize the data into a form that the model can input directly
    user_feature_columns = [SparseFeat('user_id', feature_max_idx['user_id'], embedding_dim),
                            VarLenSparseFeat(SparseFeat('hist_article_id', feature_max_idx['click_article_id'], embedding_dim,
                                                        embedding_name="click_article_id"), SEQ_LEN, 'mean'.'hist_len'),]
    item_feature_columns = [SparseFeat('click_article_id', feature_max_idx['click_article_id'], embedding_dim)]

    # Model definition
    # num_sampled: The number of samples sampled negatively
    model = YoutubeDNN(user_feature_columns, item_feature_columns, num_sampled=5, user_dnn_hidden_units=(64, embedding_dim))
    # model compilation
    model.compile(optimizer="adam", loss=sampledsoftmaxloss)

    # Model training, here can define the proportion of validation set, if set to 0, it is the full data directly training
    history = model.fit(train_model_input, train_label, batch_size=256, epochs=1, verbose=1, validation_split=0.0)

    After the model is trained, the trained Embedding is extracted, including user side and item side
    test_user_model_input = test_model_input
    all_item_model_input = {"click_article_id": item_profile['click_article_id'].values}

    user_embedding_model = Model(inputs=model.user_input, outputs=model.user_embedding)
    item_embedding_model = Model(inputs=model.item_input, outputs=model.item_embedding)

    If the local id is not specified, the local ID is not specified. If the local ID is not specified, the local ID is specified
    user_embs = user_embedding_model.predict(test_user_model_input, batch_size=2 ** 12)
    item_embs = item_embedding_model.predict(all_item_model_input, batch_size=2 ** 12)

    If embedding is saved, it will be normalized
    user_embs = user_embs / np.linalg.norm(user_embs, axis=1, keepdims=True)
    item_embs = item_embs / np.linalg.norm(item_embs, axis=1, keepdims=True)

    Embedding is converted to a dictionary for easy query
    raw_user_id_emb_dict = {user_index_2_rawid[k]: \
                                v for k, v in zip(user_profile['user_id'], user_embs)}
    raw_item_id_emb_dict = {item_index_2_rawid[k]: \
                                v for k, v in zip(item_profile['click_article_id'], item_embs)}
    Local: local: local
    pickle.dump(raw_user_id_emb_dict, open(save_path + 'user_youtube_emb.pkl'.'wb'))
    pickle.dump(raw_item_id_emb_dict, open(save_path + 'item_youtube_emb.pkl'.'wb'))

    User_embedding searches topk items with the highest similarity to faiss
    index = faiss.IndexFlatIP(embedding_dim)
    # this is already normalized, so I don't need to normalize here
# faiss.normalize_L2(user_embs)
# faiss.normalize_L2(item_embs)
    index.add(item_embs) # index the item vector
    sim, idx = index.search(np.ascontiguousarray(user_embs), topk) # Query topk most similar items via user

    user_recall_items_dict = collections.defaultdict(dict)
    for target_idx, sim_value_list, rele_idx_list in tqdm(zip(test_user_model_input['user_id'], sim, idx)):
        target_raw_id = user_index_2_rawid[target_idx]
        # 1 is used to get rid of the item itself, so you only get topk-1 for similar items
        for rele_idx, sim_value in zip(rele_idx_list[1:], sim_value_list[1:]):
            rele_raw_id = item_index_2_rawid[rele_idx]
            user_recall_items_dict[target_raw_id][rele_raw_id] = user_recall_items_dict.get(target_raw_id, {})\
                                                                    .get(rele_raw_id, 0) + sim_value

    user_recall_items_dict = {k: sorted(v.items(), key=lambda x: x[1], reverse=True) for k, v in user_recall_items_dict.items()}
    Order the results of the recall

    Save the recall results
    Compared with the above method, the above method only obtains the similarity matrix of I2i and U2U, and collaborative filtering recall is required to obtain the recall result
    # This recall result can be directly evaluated. For convenience, an evaluation function can be uniformly written to evaluate all recall results
    pickle.dump(user_recall_items_dict, open(save_path + 'youtube_u2i_dict.pkl'.'wb'))
    return user_recall_items_dict

The last click in the training set has been extracted because of the recall evaluation

if not metric_recall:
    user_multi_recall_dict['youtubednn_recall'] = youtubednn_u2i_dict(all_click_df, topk=20)
else:
    trn_hist_click_df, trn_last_click_df = get_hist_and_last_click(all_click_df)
    user_multi_recall_dict['youtubednn_recall'] = youtubednn_u2i_dict(trn_hist_click_df, topk=20)
    # Recall effect evaluation
    metrics_recall(user_multi_recall_dict['youtubednn_recall'], trn_last_click_df, topk=20)

#%% user_multi_recall_dict

user_multi_recall_dict

# % %

pickle.dump(user_multi_recall_dict, open(save_path + 'model/user_multi_recall_dict.pkl'.'wb'))

# % %

user_multi_recall_dict['youtubednn_recall']
Copy the code

The output

100% | █ █ █ █ █ █ █ █ █ █ | 250000/250000 [00:30 "00:00, 8261.42 it/s] 250000 it [00:20, 12127.56 it/s] "Train" on 1149673 samples 1149673/1149673 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 622 - s 541 us/sample - loss: 0.1361Copy the code

07. Cold start problem

Cold start problems can be divided into three categories: article cold start, user cold start, and system cold start.

  • Article cold start: for a platform system to join the article, the article does not have any interaction record, how to recommend to the user. (For our scenario, it can be considered as, and articles that do not appear in the log data can be considered as cold-started articles)
  • User cold start: For a new user of the platform system, the user does not have the interactive information of the article, how to recommend the user. (For our scenario, the user in the test set appears in the log data corresponding to the test set. If not, the user can be considered as a cold start user. But sometimes it’s not as strict as that, and we can also set up our own metrics to determine which users are cold start users, such as length of use, click through rate, retention rate, etc.)
  • System cold start: for a platform just launched, there is no relevant historical data, at this time is the system cold start, in fact, is a synthesis of the previous two.

08. Multiple recall consolidation

Multipath recall merge is to combine all the user article lists obtained by the previous recall strategies. The following is a summary of all the previous recall results

  1. Recall based on similarity between items calculated by ItemCF SIM
  2. Recall based on the similarity between the items obtained by embedding search
  3. YoutubeDNN recall
  4. YoutubeDNN gets the similarity between the users for the recall
  5. Cold-start strategy based recall

09. Summary of the tutorial

The recall strategy is as follows:

  1. Itemcf based on association rules
  2. Usercf based on association rules
  3. Youtubednn recall
  4. Cold start recall

In fact, none of the recall strategies mentioned above is the optimal result, we just made a simple attempt, and there are many areas that can be optimized, including the parameters of these recall strategies already implemented, or adding some new ones or modifying some association rules. Of course, more recall strategies could be tried, such as a heat recall of news.

10. Article sources

  • Datawhale Github RS Recommended introductory tutorial