☝ click on the blue word above, follow us!

01

Introduction to the

In the sorting process of recommendation system, we often use CTR(click-through Rate) estimation to build the sorting model. Practical application in the industry, how to through the massive data to extract the effective features of user behavior modeling, generalization has been the working direction of the researchers, because in the practical application of recommendation system, the data are usually very thin, how to go from large-scale sparse features extracted effective generalization is a major challenge of CTR forecast model. In this paper, we first introduce the evolution history of CTR prediction model, and introduce how to use DNN to predict CTR in the ranking scenario of recommendation system, and obtain efficient generalization features through special network structure to improve the prediction ability of the model.

02

Application of DNN in recommendation system

2.1 CTR prediction model

In the news recommendation system, we want to get the news that users are most likely to click. Given a certain user and the set of news, we can get three features by extracting features:

  1. User characteristics (user interests, age, gender, etc.)

  2. Context characteristics (model, network, etc.)

  3. News features to be estimated (categories, labels, etc.)

We need to calculate the click-through rate of users to the estimated news, and get the top K stories that users are most likely to click on by descending the estimated click-through rate.

The initial CTR prediction model was LR(Logistic Regression), and a simple CTR prediction model could be obtained by substituting the above features into LR model. The recommended scenario, however, the user’s behavior is sparse, that is, a user can see the news is limited, how to infer from the limited browse users might like news, is the focus of the recommendation, use LR if here is simply to modeling of the above characteristics, is unable to get special good effect, The reason is that LR model itself cannot generalize features. It can only directly calculate the corresponding weight of a feature 𝔀. If LR is to have generalization ability, we need to “manually generalize” on the data side, for example:

The user’s interests and be estimated news categories are combined, get a interact feature, if users like to see kobe Bryant related news, news to row for sports categories of news should be news value should be higher than the entertainment category, so we can construct characteristic users like kobe Bryant and stay news for sports, and by understanding business, We know that this feature must have a higher weight in the LR model than the category of users who like Kobe Bryant and want to list news as entertainment.

Then, after model convergence, the following should be obtained:

Through this engineering characteristics of the method, we can make the LR forecast user’s preference in the news did not see, which have a certain generalization ability, because the simple structure of LR model at the same time, we can understand the business whether it is right to judge model training, feature extraction is correct, which improves our ability to debug.

2.2 Stronger generalization ability

We can easily see the advantages and disadvantages of this method from the above practice. The advantage is that we can quickly debug the available model. If the training does not meet the expectation, we can directly identify the problems from the weight obtained by training, and help feature engineering to make stronger generalization features by analyzing the weight of features. Disadvantage is due to the model itself expression ability is limited, we can only through the characteristics of the artificial engineering content generalization ability, and the characteristics of projects often require a lot of energy to research, analysis, experiments, each step may be due to careless or sloppy results is not in line with expectations, so we hope that through enhance the generalization ability of the model itself, To reduce the complexity of feature engineering in the later stage.

2.1.1 FM (Factorization Machine)

The calculation method of F(x) of logistic regression only contains linear relation, which limits the expression ability of the model. The expression ability of the model can be enhanced by adding nonlinear relation. Specifically, the formula of FM model is as follows:

The FM model uses a matrixAs an implicit feature, each feature 𝔁 corresponds to a 𝓿 direction of 𝓴 dimension. By calculating the inner product of the implicit vector 𝓿 of the two features, the correlation between the two features is obtained. If the user likes Kobe Bryant and the news tag contains NBA, the implicit vector inner product of the two features should be large after the model converges.

The model parameters were obtained by updating 𝙒 and 𝙑 with gradient descent method. Compared with LR, an implied matrix is added to represent the pair cross relation of features. If there are 𝙉 features and the dimension of the implied matrix is 𝘒, the spatial complexity of the matrix is 𝙊(𝙉 nil). Through simplified calculation, it can be known that the time complexity of the calculation is 𝙊(nil), where nil is the number of non-nil features without significant increase in computation. The inner product enables the model to have predictive ability: even if two features never appear in a sample at the same time in the training data, we can know the correlation of the two features through the implicit vector.

FM model can actually be extended to high order, but only the second order can simplify the calculation, and the third order and above are too complicated to be used in the practical industry. FM, however, illustrates an important idea, by introducing the implicit matrix to express the relationship between features in the way of inner product, thus improving the generalization ability of the model. In view of its generalization idea and the concise and effective ability of the model, FM model is very popular in practical use, and many subsequent DNN-related CTR prediction models also refer to FM’s idea.

2.1.2 within DNN

With the improvement of computing power in recent years, DNN has come back to the public. Due to its very strong fitting ability, DNN has achieved excellent performance improvement in various fields. People first study the application of DNN in THE field of NLP and image, and then develop to the field of Recommender Systems. The Wide & Deep Learning for Recommender Systems algorithm proposed by Google in 2016 provides a way of thinking. The multi-layer fully connected network is defined as Deep Model, and an additional Wide Model similar to LR is added to obtain a fusion Model by taking advantage of their respective advantages.

Due to the large number of DNN parameters, most features in the recommended scenario are sparse features. If the original features are concatenate and directly input to DNN as the input layer, the training parameters will be too large to be trained. Therefore, it needs an Embedding layer to map large-scale sparse features to a dense space first, and then splice them together with other features as input layer to input DNN.

In addition to DNN, a wide model is added to the wide&Deep model. In fact, the LR feature combination process described above is added to the model as an independent module. For example, as shown in the figure above, a combination feature is made between the App installed by users and the App currently displayed by exposure as one of the features of the Wide model. Here we can combine business knowledge to do feature combination to obtain some strong generalization ability of features.

Finally, the last hidden layer of DNN was spliced together with wide feature, and a [0, 1] value was obtained by Logistic, and the CTR prediction model was obtained by gradient update convergence. Wide & Deep model obtains high-order nonlinear relations between features through DNN, which improves the generalization ability of the model. However, it is relatively inefficient to learn simple low-order crossover with DNN structure. A wide layer is added outside DNN to artificially introduce business-related combination features. The model can quickly learn effective low-order cross features. Because of people’s understanding of the business, these features often have strong ability to fit business scenarios, and the effect of DNN combined with DNN will be better than that of single DNN.

At this point, we hope to further expand. Can DNN model learn low-order crossover features by itself without artificial combination of features like FM? This is what the Deep & Cross Network does next.

03

Deep & Cross Network

Deep & Cross Network(DCN) is a recommended model proposed by Google in 2017. Compared with ordinary multi-layer fully connected Network, there is a Cross Netork, which sounds very scary at first glance. In fact, feature Cross is carried out by matrix. The residual idea is used to construct a model module of deep network, and high-order combination features are obtained through model rather than manual combination. Next, we introduce DCN model in detail.

3.1 Model Structure

The structure of DCN model is shown in the figure above. Its basic structure is similar to that of Wide&deep network. The input layer maps sparse class-type features to a low-dimensional density vector by Embedding, which is directly spliced with numerical features as network input features. Since this is a CTR prediction model, the sample is 0/1 dichotomic supervision problem, so the output layer uses sigmoID to limit the output between [0,1] to represent the model’s estimated click-through rate.

The network is divided into two parts: Cross network and Deep network. Deep network is the multi-layer fully connected network, and cross network is the cross layer through a series of calculations output results.

3.2 Cross Network

The specific calculation formula of Cross Network is:

Use a graph to represent this formula:

In the formula.As a column vector, denotes the th of the cross layerLayer, the firstLayer output, then functionYou can think of it as fitting controlLayer and the firstResiduals of layers.

This special structure can make the degree of cross features increase with the cross layerThe layer, with a cross characteristic relative to 𝔁₀, has the order. The number of parameters for the entire cross network is 𝓭✕𝐿✕2, where 𝓭 is the dimension of the input column vector 𝔁₀, and 𝐿 is the number of network layers of the Cross network. Each layer contains a set of 𝓭 and 𝓭, multiplied by 2. DNN due to full connection, the weight of the first layer of the hidden layer 𝔀 needs to be cartesian product with the input layer. However, compared with the fully connected network, the number of parameters introduced by the cross network is very small, which has a linear relationship with the characteristic dimension of the input layer.

3.2.1 In-depth understanding of Cross Network

We can think of a Cross network as each cross layer, it will beWe modeled pairwise with 𝔁₀, and then efficiently spun back to the characteristic dimension of the input layer. Analyze the figure aboveFeature CrossingThe process of making, including, the formula can actually be looked at like this:

We multiply the first two terms of the formula. Each of these values can be viewed as a bitwise product of the current cross layer feature and the input feature, and we can be viewed as a row vector of 𝓭² multiplied by a block diagonal matrix of column vectorsAnd projected the result backDimension, this process does a feature crossover, feature dimension by compressing the size of the input feature 𝓭²→𝓭, greatly reducing the spatial complexity.

3.2.2 DCN and FM

Every feature in the FM modelI have an implied vectorThe intersection of the two features is through the inner product of vectorsGet the; And in DCN, every featureWith a scalarCorrelation, characteristicThe relation of is a setandThe product of, both have the ability of feature cross generalization.

This parameter sharing method not only enables the model to generalize and get some feature interactions that have not occurred in training samples, but also makes the model more robust and less affected by noise data. In our recommended scenario, features are sparse, assuming two featuresAlmost never appear together in the training sample, so the characteristics are directly combinedIt does not bring benefits, but by sharing parameters, we can still get the effect of feature combination from parameter product.

In addition, generally speaking, FM limits the model to only get second-order combinationsWhile DCN, through clever design, changes feature crossing from single layer to multi-layer, and obtains multi-order interaction while the computational amount is still linear growth.

3.2.3 Is it the best?

From the calculation method of Feature Crossing described above, we can see a Feature Crossing method that can efficiently calculate the high-order Crossing between features without bringing a huge amount of calculation. Is this method perfect? Is there anything that can be improved? In the recommendation scenario, many sparse features, such as user ids and item ids, are added into low-dimensional dense features in the DNN model. We then interpolate the features together to obtain the input characteristic 𝔁₀. In our 𝔁₀ feature, there are several local features belonging to the user, representing the attributes of a user.

Going back to the calculation of cross network,Feature CrossingCan be expressed as, namely 𝔁 with zeroI’m going to take the product of each of these vectors(bit-wise)However, we know that the embedding feature isRepresents a field togetherIn this respect, FM does better embedding. We can regard the hidden vector of FM as the characteristic embedding, and the inner product of two hidden vectors is actually the product of the vector of two domains(vector-wise)Does not allow the elements in the embedding to cross combine.

04

conclusion

This paper mainly introduces the key link of recommendation system: sorting. Here we describe a common approach in the industry: the CTR predictive model. The research problem of the click rate prediction model is to calculate the probability of users clicking on the item based on the known user to be recommended, the current context and the item to be calculated, calculate the estimated click rate of all items to be sorted, and then output the order from high to low.

CTR forecast model after years of development, has the original LR, upgrade to FM, has now reached within DNN development model, in general, this development is based on a train of thought, all the way we want to reduce artificial combination characteristic, increase the universal model, from the characteristics of generalized model generalization, study how to make stronger generalization ability of the model.

At the beginning, the LR model itself was not capable of generalization. Based on the business understanding, we made the features of the input LR with generalization capability through a large number of complex feature engineering. Subsequent FM, compared to the LR each feature a 𝓴 d more implicit vector, based on the implicit vector inner product each other, we make the features within the model to realize the two combination, thus our model with the preliminary generalization ability, after the training sample of training convergence model also can infer never appear together the relationship between the characteristics of the. Then we started to use within DNN to solve the problem of click-through rate forecast, in the same way, in order to make the model can quickly and training combination crossed characteristic, the relationship between outside the multilayer to connect to the Internet all extra added a cross network, cross network with clever design will feature combination extends to order more, and the computational complexity is linear growth, in practice, We only need to directly add cross network to the completed DNN model to achieve better results. The input layer can do no feature engineering, and the model can calculate the combination relationship between features within the cross network.

However, although cross network can achieve high order feature intersection under linear complexity, its model design still needs to be improved. As direct matrix product, namely bit-wise product is adopted, the model cannot distinguish whether vector elements belong to the same feature. This will lead to some invalid feature crossover. How to optimize the model structure so that the model can compute vector-wise in different fields according to the field in high-order crossover is also the current frontier research. The xDeepFM algorithm proposed by Microsoft last year achieves this goal. Get the idea.

References:

[1] the HTTPS: / / arxiv.org/pdf/1606.07792.pdf

[2] the HTTPS: / / arxiv.org/pdf/1708.05123.pdf

[3] https://arxiv.org/pdf/1803.05170.pdf


Sohu intelligent platform team other wonderful articles

Sohu news recommendation algorithm principle

Smart news plastic surgeon – for pictures, give news a long face

Introduction to Recommendation System Algorithm (1)

Introduction to Recommendation System Algorithm (2)

The application of Embedding model in recommendation system

Join sohu technology author day group

Thousand yuan payment waiting for you!

Stamp here! ☛