preface

Compared with the traditional machine learning shallow, deep learning have excellent characteristics of the automatic extraction of abstract ability, and with the rapid development of distributed computing platform, large data processing ability improved greatly, makes the DL has been widely used in engineering in recent years, including image recognition, speech recognition, natural language processing, and other fields, And achieve better results than traditional machine learning. On the other hand, intelligent recommendation system, in essence, abstracts users’ interest factors from a pile of seemingly chaotic original flow data and mines users’ preferences. Therefore, the combination of deep learning and recommendation system has become a new hot spot in the development of DL in recent years. In fact, Spotify, Netflix, Facebook, Google and other companies have done a lot of in-depth studies on how to apply deep learning to recommendation system, and have achieved good effect improvement in practical application [1].

This article is the first of a series of articles on the practical application of deep learning in recommendation system. It introduces in detail how to apply Restricted Boltzmann Machine (RBM) to our current online recommendation system, including the principle of RBM. In the details of the application of the recommendation system and its parallelization implementation, the next two articles will introduce in detail the other two deep neural networks that we are currently studying and using. They are Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN), and their principles, how to combine with intelligent recommendation and online model effects are introduced in detail.

On the one hand, RBM is relatively simple in structure. It is a network with only visible layer and hidden layer. On the other hand, in a strict sense, the matter does not belong to deep neural network, it is a two layer structure, is not “deep”, but it is also a depth of the other commonly used neural network level component, therefore, to understand how to put the matter to the recommendation system, will help to understand the depth of a few more complex behind the application of learning algorithms.

One: restricted Boltzmann machine and recommendation system

1.1 RBM network structure definition

We first define the network structure of RBM. RBM is a complete binary graph network structure connected by the visible layer and the hidden layer, as shown in the figure below. Each neuron is a binary unit, that is, the value of each neuron can only be equal to 0 or 1:

The network structure of RBM has such characteristics that every neuron in the visible layer is connected with every neuron in the hidden layer. There is no connection between neurons in the visual layer and neurons in the hidden layer, which provides a good hypothesis for our later training: neurons in the same layer are independent of each other. For the above network structure, we first define the following parameters:

: represents the number of neurons contained in the visible layer and the hidden layer respectively

: State vector of the visual layer, representing the state of the first neuron of the visual layer, 0 or 1 only

: State vector of the hidden layer, representing the state of the first neuron of the hidden layer, 0 or 1 only

: Offset vector parameter of the hidden layer, representing the offset of the first neuron of the hidden layer

: Offset vector parameter of the visual layer, representing the offset of the first neuron of the visual layer

: Weight matrix between hidden layer and visible layer,Represents the first of the visual layerOne neuron and the other hidden layerThe connection weight of 1 neuron

RBM is an energy-based model, we need to define its energy function, and define the corresponding probability distribution based on the energy function :(I only list the corresponding formula below, and the derivation process can be referred to related materials, this paper does not give a detailed description)

Energy function:

Joint probability distribution: , including

Conditional probability distribution:

, includingRepresents the first of the visual layerH represents the vector representation of hidden layer neurons,



Where represents the first of the hidden layerX represents the vector representation of neurons in the visual layer,

Edge distribution:, including

1.2 RBM and collaborative filtration

In the previous section, we briefly explained the structure definition of RBM, so how to apply this model to the recommendation system? RBM is essentially a codec. Specifically, with RBM, we can map the raw input data from the visual layer to the hidden layer, resulting in a latent factor vector representation of the raw input data. This process is also known as the encoding process, and then use the resulting hidden layer vector to map back to the visual layer. Get new visual layer data, this process is called decoding process, our goal is to make the decoding results and original data after nearly as far as possible, so that, in the process of decoding, we not only can get has score items of new score data, also can forecast did not score the item’s score, the score of these did not score items from high to low order form recommended list.

As can be seen from the above analysis, we apply RBM to the recommendation to solve the following two problems:

  1. How to represent user stream data with visual layer?

    The solution proposed by Hinton et al is to extend the original RBM [2]. Taking music recommendation as an example, each neuron in the visual layer represents the score of each song, and each neuron in the visual layer is extended to k Softmax units, indicating that the score of the song is 0 to (K-1), respectively. If the score for this song is 3, the value of the fourth Softmax unit corresponding to this neuron is 1, and the value of the remaining (K-1) Softmax units is 0, as shown in the figure below. The neuron corresponding to missing in the figure means that the user has not heard this song.

  1. How to deal with missing data?

As we mentioned in the previous paragraph, the user’s original input data can only score a few songs, and the neuron corresponding to the songs without score is the missing neuron in the figure above. We do not consider this part of data when training weights, and each user’s data will constitute an independent RBM sub-model. The submodel of each user only contributes to the adjustment of the weight value associated with it. Similarly, the parameter update of each weight value is determined by the user data associated with the weight value.

Let’s take a look at the process of data encoding and decoding in the modified RBM model:

Coding process: Using the original data, we obtain the hidden vector representation of the hidden layer, and this process is obtained by the conditional probability formula:

The following is a dynamic diagram of the coding process:

Decoding process: In contrast to the encoding process, the use of the hidden layer vector obtained by step, we reverse calculating visible layers of probability value, but different from 1.1 mention of conditional probability, to recommend the matter model is an extension of matter model of visible layers of each neuron is not the only one unit, but contains a unit, so we need to adjust conditional probability, Specifically, the conditional probability changes from sigmoID function to Softmax function, and the formula is as follows:

The following is a dynamic diagram of the decoding process:

2: Model Optimization — Contrastive Divergence

Through the first part of the description, we have known the network structure of RBM and how to combine with the recommendation system, so now our problem is how to train the model, for RBM, it is to train three weight parameters: Connect the weight W of the visible layer and the hidden layer, the offset visbias of the visible layer node, and the offset hidbias of the hidden layer node.

For machine learning models, we first need to determine what our target training function is. For RBM models, it is essentially a process of encoding and decoding, so it is natural for us to think: It is expected that the data encoded and decoded by RBM will be as close as possible to the original input data, which is also the idea of maximum likelihood parameter estimation, that is, our optimization objective function is:

Where, M in the above formula represents the size of training samples. Substitute the edge distribution of the first section into the above formula and take partial derivatives of W, Visbias and Hidbias respectively, and we get:

The right-hand side of each of the three gradient formulas is made up of two terms, the first of which is calledIs the gradientThe latter term is calledThe negative gradient, the solution of the positive gradient is very simple, it only depends on the current input dataBut for the calculation of negative gradient, we need to consider all possible combinations of visual layers, assuming that the visual layers shareIf, then there is a combination. Obviously, the time complexity of this formula is unacceptable, and we need further optimization. The right-hand side of each of the three gradient formulas is made up of two terms, the first of which is calledIs the gradientThe latter term is calledThe negative gradient, the solution of the positive gradient is very simple, it only depends on the current input dataBut for the calculation of negative gradient, we need to consider all possible combinations of visual layers, assuming that the visual layers shareOne neuron, so totalIfSo there isObviously, the time complexity of this formula is unacceptable, so we need further optimization.

The biggest difficulty in training RBM lies in the calculation of negative gradient. Professor Hinton proposed the algorithm of contrastive divergence in 2002, which effectively solves the problem of RBM training speed and is also the current standard training algorithm of RBM. The idea of contrastive divergence is to adopt a reasonable sampling method.All combinatorial Spaces of negative gradients are approximated with fewer samples, the specific implementation process is: we from the original input dataStart, after encoding and decoding to get a new visual layer input, this process is called 1-step Gibbs sampling and then usingI go through the same process, and I get, this process is repeated k times, and finally, this value is the approximation of the final negative gradient. This process is called step Gibbs sampling process, and the following is the dynamic diagram of step Gibbs sampling process: The biggest difficulty in training RBM lies in the calculation of negative gradient. Professor Hinton proposed the algorithm of contrastive divergence in 2002, which effectively solves the problem of RBM training speed and is also the current standard training algorithm of RBM. The idea of contrastive divergence is to adopt a reasonable sampling method.All combinatorial Spaces of negative gradients are approximated with fewer samples, the specific implementation process is: we from the original input dataStart, after encoding and decoding to get a new visual layer input, this process is called 1-step Gibbs sampling and then usingI go through the same process, and I get, repeat this process k times, and finally get, the valueIs the approximation of the final negative gradient, which is called the K-step Gibbs sampling process. The following is the dynamic diagram of the K-step Gibbs sampling process:

In the implementation process, the value of K is generally between 1 and 10, so good results can be achieved, and the value of K can change dynamically with the number of iterations [3].

In this way, we modify the derivation formula above as follows, where S0 represents the original input data of the s-th user, Sk represents the data of the s-th user after k steps by Gibbs

Three: parallel realization of contrast divergence

At present, there are mature platforms for RBM parallelization training, such as Theano, Caffe, TensorFlow, etc. In the implementation process, we do not use the above platform architecture, but use Spark cluster for training. The RBM process of contrast divergence training is essentially an optimization process of gradient descent. The following is the execution flow chart of the algorithm in Spark:

According to the previous description, each user’s data forms an RBM subnetwork. The cluster first divides the data and the submodel, and the model and data will be distributed to different executors during the running process to obtain the positive gradient of each submodel and the negative gradient after k steps of Gibbs sampling. Son then merge the results sent back to the driver end, usually in order to prevent all data back to the network communication pressures on driver and memory pressure, we can adopt the way of the tree aggregation to optimize, below is the depth of 2, and tree depth of rendering, pay attention to the different line colors represent data transmission in different units, As shown in the figure, the calculation steps become longer with greater depth, but the data transferred to the driver decreases each time, preventing memory overflow on the driver side.

Iv. Online model fusion

Through the analysis of the front three, we have to matter how role in recommender systems, matter model of training and so on all have more in-depth understanding, finally, we need to use the trained model to generate the recommended data and integration with other algorithm model, data calculation, the result of the recommendation is to use has been training the weights of parameters, In the process of encoding and decoding the input data, the new data after decoding can not only regenerate the score of the data that has been operated, but also predict the score of the data that has not been operated, that is, the missing data in Section 1.2. The score of the missing data helps us sort the recommended data.

To achieve this process, we have the following two methods: one is to directly generate all users’ data in batches offline, but this method requires a huge amount of calculation; Second, the trained weights are saved separately, and the generation of recommendation data is put into the online layer for real-time calculation, as shown in the figure below:

In the application, we adopt the second method, which has two advantages compared with the first method: first, the results do not depend on the offline task, preventing the failure of the offline task from affecting the online data; Second, the result data of online real-time calculation can be fused with existing algorithm model data in a more flexible way, including rule filtering and reordering of subsequent data processing.

Five: summary

This paper analyzes the application of RBM in the recommendation system in detail. From the analysis in the paper, it can be seen that the improvement of RBM to the recommendation system mainly benefits from its ability to automatically extract abstract features, which is also the basis of deep learning in the recommendation system. In the following articles, RNN and CNN will continue to analyze how to further improve the performance of the recommendation system based on the extraction of abstract features.

reference

[1] www.quora.com/Has-there-b…

[2] www.cs.toronto.edu/~rsalakhu/p…

[3] www.cs.toronto.edu/~hinton/abs…

The second article in the series has been published on October 11 in Tencent big data official account, interested readers can pay attention to Tencent big data official account pay attention to the following article, thank you.