Today, what we analyze is also a classic paper in the field of recommendation, called Wide & Deep Learning for Recommender Systems. It was published in 2016 by the Google App Store feature team. This was the year deep learning took off. This paper discusses how to use deep learning model to predict CTR of recommendation system, which can be said to be a successful attempt of deep learning in the field of recommendation system.

The famous recommendation model Wide & Deep is from this paper, which is widely used in various companies because of its simple implementation and good effect. Therefore, it can also be considered one of the must-read articles in the recommendation field.

Abstract

In large-scale feature scenarios, we usually (prior to 2016) apply nonlinear features to linear models, with a very sparse vector input. Although we want to achieve such nonlinear features, it can be achieved through some methods of feature transformation and feature crossing, but it needs to consume a lot of human and material resources.

In fact, we have mentioned this problem before when introducing the FM model. For the FM model, it actually solves the same problem. The only solution is different. The FM model introduces a parameter matrix V of N x K to calculate the weight of all features crossed in pairs, so as to reduce the number of parameters and improve the efficiency of prediction and training. In this paper, we discuss the use of neural networks to solve this problem.

The core of solving the problem is embedding, which literally translates to embedding, but it is not easy to understand. In general we can think of it as a vector representation of some feature. For example, in Word2Vec, what we do is we take a word and represent it as a vector. These vectors are called Word embedding. Embedding has a characteristic that its length is fixed, but its value is generally learned by neural network.

We can use the same way of embedding training to train some feature embedding in the neural network, so that the workload of feature engineering we need is greatly reduced. However, it is also wrong to only use embedding, which may cause over-fitting in some scenes. Therefore, we need to combine linear features and sparse features, so that the model can not fall into over-fitting and have enough ability to learn better effects.

Introduction to the

As we shared in previous articles, recommendation systems can also be thought of as a ranking system for search. The input is a user and the context information the user browses, and the result is a sorted sequence.

Because of this, recommendation systems will also face a similar challenge as search ranking systems — the tradeoff between memorization and generalization. Memorization can be simply understood as a kind of learning between pairs of goods or features. Since the historical behavior characteristics of users are very strong features, memorization can bring better effects. But at the same time, there will be problems, the most typical problem is the model generalization ability is not enough.

For generalization, the main sources are the correlation and transitivity between features. Feature A and B may be directly related to label, or feature A may be related to feature B, and feature B may be related to label, which is called transitivity. By using the transitivity of features, we can explore some feature combinations rarely seen in historical data, thus obtaining strong generalization ability.

In large-scale online recommendation and ranking systems, linear models such as LR are widely used because of their simplicity, extensibility, high performance, and good interpretability. These models are often trained with binary data such as one-hot, for example, where user_installed_app= Netflix is 1 if the user has Netflix installed, and 0 otherwise. So, some second-order features are quite explicable.

For example, if the user has also browzed Pandora, then user_installed_app= Netflix and impression_app= Pandora will be 1, and the weight of the combined feature is actually the correlation between the two. However, such features require a lot of manual operation, and due to the sparsity of samples, the model cannot learn the weight of some combinations that have not appeared in the training data.

However, this problem can be solved by embedding based models, such as the FM model introduced earlier, or deep neural networks. It can train the embedding in low dimension and calculate the weight of intersection features with the embedding vector. However, if the features are sparse, it is difficult to guarantee the generated embedding effect. For example, the user’s preference is obvious, or the goods are small. In this case, most of the query-item pairs have no behavior. However, the weight calculated by the embedding may be greater than 0, which leads to overfitting and inaccurate recommendation results. For this special case, the linear model has better ability of fitting and generalization.

In this paper, we will introduce the Wide & Deep model, which is compatible with memory and generalization in one model. It can train both linear model and neural network at the same time, so as to achieve better results.

The main contents of this paper are as follows:

  1. The application of Wide & Deep model in generalized recommendation system includes the embedding part of feedforward neural network and feature transformation of linear model
  2. Implementation and evaluation of the Wide & Deep model in the Context of Google Play, a mobile App store with over 1 billion daily lives and 100 million apps

Introduction to recommendation System

Here is a typical architecture diagram of a recommendation system:

When a user accesses the App Store, a request is generated that contains user and context characteristics. The recommendation system returns a list of apps that the model has selected that users are likely to click on or buy. When users see these information, they will have some behaviors, such as browsing (no behavior), clicking and purchasing. After the behaviors are generated, these data will be recorded in Logs and become training data.

Let’s look at the top section, which is the section from DataBase to Retrieval. Due to the large amount of data in the Database, there are millions. So we want to give all app call models a score within a specified time (10 milliseconds) and then sort it is impossible. So we need to do a Retrieval, or recall, of the request. The Retrieval system will recall user requests. There are many ways to recall requests, including machine learning model and rules. Generally speaking, fast filtering based on rules is first followed by machine learning model filtering.

After the screening and retrieval, the Wide & Deep model was finally called for CTR prediction, and these Apps were ranked according to the predicted CTR. In this paper, we also ignore other technical details and only focus on the implementation of Wide & Deep model.

Principle of Wide & Deep

First of all, let’s take a look at the structure diagram of commonly used models in the industry:

This figure is from the paper, showing Wide model, Wide & Deep model and Deep model respectively from left to right. As can be seen from the figure, the so-called Wide model is actually a linear model, while the Deep model is a Deep neural network model. The following is a detailed introduction of these two parts based on this picture.

Wide part

The Wide part is actually a generalized linear model of the form y=wTx+by = W ^Tx +by =wTx+b, as shown in the left part of the figure above. Y is the result of the us to predict, x is characteristic, it is a d d vector x = (x1, x2,…, xd) x = (x_1, x_2, \ \ cdots, x_d] x = (x1, x2,…, xd). D here is the number of features. W is also a d d weight vector w = [w1, w2,…, wd] w = [w_1 w_2, \ cdots, w_d] w = [w1, w2,…, wd], b is the offset. We’ve seen all of this before in the linear regression model, and you’re probably familiar with it.

Features include two parts, one is the data directly taken from the original data, the other is the features obtained after feature transformation. The most important feature transformation method is cross combination, which can be defined as the following form:


ϕ k ( x ) = Π i = 1 d x i c k i c k i { 0 . 1 } \phi_k(x)=\Pi_{i=1}^dx_i^{c_{ki}}\quad c_{ki}\in\{0, 1\}

Here ckic_{ki}cki is a variable of type bool representing the result of ϕk\phi_kϕk of the ith characteristic. Since we’re using the product form, the final result is 1 only if all of the terms are true, otherwise it’s 0. For example, “AND(gender=female,language=en)” is a crossover feature. The result of this feature is 1 only if the user is female AND the language is English. In this way we can capture the interaction between features and add nonlinear features to a linear model.

Deep part

The Deep part is a feedforward neural network, which is shown on the right in the figure above.

If we look at this graphic, we’ll see details such as a sparse feature input that can be easily interpreted as a multihot array. This input is converted into a low-dimensional embedding at the first layer of the neural network, which is then trained by the neural network. This module is mainly designed to deal with some category features, such as the category of item, the gender of the user and so on.

Compared with the traditional one-hot method, embedding uses a vector to represent a discrete variable, which has stronger expression ability. Moreover, the value of this vector is learned by the model itself, so the generalization ability is greatly improved. This is also common practice in deep neural networks.

Merger of Wide & Deep

Once the Wide and Deep parts are available, they are combined in a weighted way. This is the middle of the figure above.

The uppermost output is actually a Sigmoid layer or a Linear layer, which is a simple linear accumulation. The paper also lists the differences between Joint and ensemble. For ensemble models, each part is trained independently. The different parts of the joint model are jointly trained. The parameters of each part of the ensemble model are not affected by each other, but for the Joint model, the parameters are trained at the same time.

As a result, since the training is separate for each part, the parameter space of each submodel is large so that good results can be achieved. However, joint training does not have this problem. We separate the linear part and the deep learning part, which can complement the defects between them, so as to achieve better results, and there is no need to artificially expand the number of training parameters.

System implementation

The data flow recommended by app consists of three parts: data production, model training, and model service. A picture would look something like this:

Data production

In the stage of data production, we use app to expose a sample in front of users for a period of time. If the app is installed by users, the sample is marked as 1, otherwise, it is marked as 0. This is also the case in most recommended scenarios.

At this stage, the system looks up the table and converts some character of the string category into an ID of type int. For example, 1 for entertainment, 2 for photography, 0 for charging, 1 for free and so on. At the same time, the characteristics of the number type will be standardized, scaled to the range of [0, 1].

Model training

The paper provides a structure diagram of the model:

From the figure above, we can see that on the left are some continuous features, such as age, number of apps installed and so on, while on the right are some discrete features, such as device information, installed apps and so on. These discrete features are transformed into embedding, and then they enter the neural network for learning together with the continuity features on the right. Paper uses 32-dimensional embedding.

The model will use more than 500 billion samples for training each time, and the model will be trained every time new training data is collected. But if we were to start every training session from scratch, it would obviously be very slow and would waste a lot of computing resources. Therefore, the paper chooses an incremental update mode, that is, when the model is updated, the parameters of the old model will be loaded, and then the latest data will be used for update training. Before the new model is updated online, the effect of the model will be verified first, and the update will be carried out after confirming that the effect is ok.

The model service

Once the model is trained and loaded, the server retrieves a list of candidate apps from the Recall system, as well as the user’s characteristics, for each request. Then the model is called to score each APP. After the score is obtained, the server will sort the candidate APPS according to the score from high to low.

In order to ensure the responsiveness of the server and to be able to return results within 10ms, Paper adopts the method of multi-threaded concurrent execution. To be honest, I think this figure is a bit false. There is no model today that does not use concurrent execution, but even with concurrent execution, prediction using deep learning is difficult to guarantee this level of efficiency. Maybe there were other optimizations that were used, but not all of them were in the paper.

The model results

In order to verify the effects of the Wide & Deep model, Paper conducted extensive tests in real life scenarios from two perspectives. This includes app acquisition and service performance.

App access

A/B test was performed for 3 weeks in the online environment, with 1 bucket as A control bucket, using the previous version of the linear model. One bucket used the Wide & Deep model, and the other bucket used only the Deep model, removing the linear part. Each of these buckets accounted for 1% of the flow, resulting in the following result:

The Wide & Deep model not only had a higher AUC, but also saw a 3.9% increase in online APP acquisition.

Service performance

Performance on the server side has always been a big issue for recommendation systems, because of the need to carry a lot of traffic and keep latency very short. However, using machine learning or deep learning model to predict CTR is very complicated. At peak times, their servers carry 10 million QPS, according to paper.

It takes 31 milliseconds to process a batch of data using a single thread. In order to speed up the process, they developed a multi-thread scoring mechanism and split the batch into several parts for concurrent calculation. In this way, the client’s latency is reduced to 14 milliseconds.

Code implementation

Wide & Deep once did well in the recommendation space, and the implementation of the model was not complicated. I used Pytorch to implement a simple version and posted it as a reference.

import torch 
from torch import nn

class WideAndDeep(nn.Module) :
    def __init__(self, dense_dim=13, site_category_dim=24, app_category_dim=32) :
        super(WideAndDeep, self).__init__()
        # Linear part
        self.logistic = nn.Linear(19.1, bias=True)
        # embedding part
        self.site_emb = nn.Embedding(site_category_dim, 6)
        self.app_emb = nn.Embedding(app_category_dim, 6)
        # Fusion section
        self.fusion_layer = nn.Linear(12.6)
    
    def forward(self, x) :
        site = self.site_emb(x[:, -2].long())
        app = self.app_emb(x[:, -1].long())
        emb = self.fusion_layer(torch.cat((site, app), dim=1))
        return torch.sigmoid(self.logistic(torch.cat((emb, x[:, :-2]), dim=1)))
Copy the code

Because my application scene was relatively simple at that time, the network structure only had three layers, but the principle was the same. If it was to be applied in complex scenes, it only needed to add features and network layers.

That’s all for today’s article. I sincerely wish you all a fruitful day. If you still like today’s content, please join us in a three-way support.

Original link, ask a concern