Today we introduce our open source project DeepMatch, which provides the implementation of several mainstream deep recall matching algorithms and supports the rapid export of user and item vectors for ANN retrieval. Very suitable for students to carry out quick experiments and learning, free the hands of algorithm engineers!

Currently supported algorithms

The following is a brief introduction of the project from the aspects of development background, installation and use methods, as well as contributions and exchanges. At the end of the article provides the exchange group interested students do not miss, welcome to find bugs and suggestions ~

background

As is known to all, the current mainstream recommendation advertising algorithm architecture system is a two-stage process of recall sorting. The recall module recalls various candidate materials from the massive candidate pool, and the sorting module gives an ordered list that users are most likely to be interested in according to user preferences and context information.

With the popularization of deep learning technology, more and more deep learning algorithms have been applied to industry. After graduation and entering the company last year, the author was fortunate to participate in the building of the recommendation system of a new business and the optimization of user experience and business indicators. In the recall part, the author also conducted some exploration based on vector recall and obtained some benefits.

I developed a CTR algorithm library based on deep learning (github.com/shenweichen…

Compared with various click rate estimation models in sorting, I still lack a lot of knowledge about recall module. Taking this opportunity, with the mentality of learning, I did the DeepMatch project with several excellent enthusiastic partners, hoping that it could help everyone!

Here is a brief introduction to how to install and use

Installation and use

  • Install via PIP

pip install -U deepmatch

  • The document

deepmatch.readthedocs.io/en/latest/

  • Using the example

This article is based on v0.1.0. If you can upgrade to v0.1.0 and find that it does not work, you can either go back to v0.1.0 or run the latest code in the examples directory in your Git repository. (github.com/shenweichen…

Using YoutubeDNN as an example, we’ll show you how to use DeepMatch for recall model training, user and item vector export, and faiss for approximate nearest neighbor search. The interface and task flow of other algorithms are basically the same ~

The whole code is less than 100 lines, which can be very easy to learn and use

The full code can be found at github.com/shenweichen…

import pandas as pd
from deepctr.inputs import SparseFeat, VarLenSparseFeat
from preprocess import gen_data_set, gen_model_input
from sklearn.preprocessing import LabelEncoder
from tensorflow.python.keras import backend as K
from tensorflow.python.keras.models import Model

from deepmatch.models import *
from deepmatch.utils importSampledsoftmaxloss # Take movielens data as an example200Data = pd.read_csvData = pd.read_csv("./movielens_sample.txt")
sparse_features = ["movie_id"."user_id"."gender"."age"."occupation"."zip", ]
SEQ_LEN = 50
negsample = 01.First, ID codes the features in the data, and then use`gen_date_set` and `gen_model_input`To generate feature data with user history behavior sequence'user_id'.'movie_id'.'gender'.'age'.'occupation'.'zip']
feature_max_idx = {}
for feature in features:
    lbe = LabelEncoder()
    data[feature] = lbe.fit_transform(data[feature]) + 1
    feature_max_idx[feature] = data[feature].max() + 1

user_profile = data[["user_id"."gender"."age"."occupation"."zip"]].drop_duplicates('user_id')

item_profile = data[["movie_id"]].drop_duplicates('movie_id')

user_profile.set_index("user_id", inplace=True)

user_item_list = data.groupby("user_id") ['movie_id'].apply(list)

train_set, test_set = gen_data_set(data, negsample)

train_model_input, train_label = gen_model_input(train_set, user_profile, SEQ_LEN)
test_model_input, test_label = gen_model_input(test_set, user_profile, SEQ_LEN)

# 2.Configure the required feature columns for the model definition, mainly the feature name and the size embedding_DIM = of the embedding word table16

user_feature_columns = [SparseFeat('user_id', feature_max_idx['user_id'], embedding_dim),
                        SparseFeat("gender", feature_max_idx['gender'], embedding_dim),
                        SparseFeat("age", feature_max_idx['age'], embedding_dim),
                        SparseFeat("occupation", feature_max_idx['occupation'], embedding_dim),
                        SparseFeat("zip", feature_max_idx['zip'], embedding_dim),
                        VarLenSparseFeat(SparseFeat('hist_movie_id', feature_max_idx['movie_id'], embedding_dim,
                                                    embedding_name="movie_id"), SEQ_LEN, 'mean'.'hist_len'),
                        ]

item_feature_columns = [SparseFeat('movie_id', feature_max_idx['movie_id'], embedding_dim)]

# 3.Define a YoutubeDNN model, passing in a list of user-side features`user_feature_columns`And item side feature list`item_feature_columns`. Then configure the optimizer and loss function and start training. K.set_learning_phase(True) model = YoutubeDNN(user_feature_columns, item_feature_columns, num_sampled=5, user_dnn_hidden_units=(64.16))
# model = MIND(user_feature_columns,item_feature_columns,dynamic_k=True,p=1,k_max=2,num_sampled=5,user_dnn_hidden_units=(64.16),init_std=0.001)

model.compile(optimizer="adagrad", loss=sampledsoftmaxloss)  # "binary_crossentropy")

history = model.fit(train_model_input, train_label,  # train_label,
                    batch_size=256, epochs=1, verbose=1, validation_split=0.0.) #4.After the training is complete, in actual use, we need to generate user side vector in real time according to the current user characteristics, and construct index of item side vector to carry out approximate nearest neighbor search. Since this is an offline simulation, we export the representation vectors of all users to be tested, and representation vectors of all items. test_user_model_input = test_model_input all_item_model_input = {"movie_id": item_profile['movie_id'].values, "movie_idx": item_profile['movie_id'Values} # The following two lines are common uses of deepMatch. User_embedding_model = Model(Inputs =model.user_input, outputs=model.user_embedding) item_embedding_model = Model(inputs=model.item_input, Outputs =model.item_embedding) echo echo = embedding_model.predict(test_user_model_input) echo echo = embedding_model batch_size=2支那12)
# user_embs = user_embs[:, i, :]  i in [0,k_max) if MIND
item_embs = item_embedding_model.predict(all_item_model_input, batch_size=2支那12)

print(user_embs.shape)
print(item_embs.shape)

# 5.[Optional] If you have the faISS library installed, you can experience the following to build an index of the item vector exported in the previous step, and then use the user vector to perform ANN lookup and evaluate the effect. Test_true_label = {line[0]:[line[2]] for line in test_set}
import numpy as np
import faiss
from tqdm import tqdm
from deepmatch.utils import recall_N
index = faiss.IndexFlatIP(embedding_dim)
# faiss.normalize_L2(item_embs)
index.add(item_embs)
# faiss.normalize_L2(user_embs)
D, I = index.search(user_embs, 50)
s = []
hit = 0
for i, uid in tqdm(enumerate(test_user_model_input['user_id'])):
    try:
        pred = [item_profile['movie_id'].values[x] for x in I[i]]
        filter_item = None
        recall_score = recall_N(test_true_label[uid], pred, N=50)
        s.append(recall_score)
        if test_true_label[uid] in pred:
            hit += 1
    except:
        print(i)
print("recall", np.mean(s))
print("hr", hit / len(test_user_model_input['user_id']))
Copy the code

contributors

One person’s power is limited, thanks to the partners who participated in the development together ~ ~ they are:

  • Zhe Wang is an advertising algorithm engineer of JINGdong
    • Blog: zhuanlan.zhihu.com/c_121884503…
    • github: github.com/wangzhegeek
  • Qingliang CAI bytedance senior advertising algorithm engineer
    • blog: blog.csdn.net/cqlboat
    • github:github.com/LeoCai
  • Yang Jieyu, zhejiang University graduate second job search???? “, frantically begging for shelter from major companies

The last

In fact, it took quite a long time from the project approval to the first release of this project. On the one hand, most of us were working students, and most of us developed on weekends, so the response time would be quite long. On the other hand, I didn’t know what to do at the beginning. I was crossing the river by feeling for stones. Various interfaces were changed many times during the process.

We still hope you can give us a star! Github.com/shenweichen…

Another tip: we have several algorithms developed, waiting for the test will be announced!

Machine Learning Online Manual Deep Learning online Manual AI Basics download (PDF updated to25Set) site QQ group1003271085To join the wechat group, please reply to "add group" to get a discount station knowledge planet coupon, please reply to "knowledge planet" like the article, click on itCopy the code