“This is the third day of my participation in the First Challenge 2022, for more details: First Challenge 2022”. This article translation proceedings. Neurips. Cc/paper / 2007 /… zhuanlan.zhihu.com/p/146853803 zhuanlan.zhihu.com/p/76298058

The author information

Ruslan Salakhutdinov and Andriy Mnih, Department of Computer Science, University of Toronto.

The paper was published in 2008.

Abstract

Many existing collaborative filtering algorithms cannot handle large data sets, nor can they easily handle users with low ratings. The probability matrix decomposition model proposed in this paper is such a model: as the observation data set grows larger, the complexity of this algorithm expands linearly, especially in the large, sparse and very unbalanced Netflix data set (movie data), this model can work well. We then extend the PMF model further: adaptive model parameters, and demonstrate how model capabilities can be automatically controlled by the parameters. Finally, based on the fact that users with similar movie ratings may have the same preferences, we introduce a constrained version of the PMF model that may work better for users with fewer ratings. When the predictions of multiple PMF models were linearly combined through the limited Boltzmann machine model, we achieved an error rate of 0.8861, which was almost 7% higher than the score of Netflix’s recommendation system.

1 introduction

One of the most popular collaborative filtering algorithms is based on low-dimensional eigenmatrix models. The idea behind this model is that user attitudes or preferences are determined by a few hidden factors. In the linear feature model, the user’s preference is modeled as the product of a linear combination of item feature vectors and the user’s preference coefficient. For example, for N users and M movies, the preference matrix R of N× M is obtained by multiplying the user coefficient matrix U of N×D by the factor matrix V of D ×M [7]. Training such a model is equivalent to the observed N × M target matrix R under a given loss function, and finding the rank D that minimizes the loss.


R = U T V R=U^TV

Various models based on probabilistic features have been proposed recently [2, 3, 4]. All of these models can be viewed as graphical models in which hidden characteristic variables are directly related to variables representing user ratings. The main disadvantage of such models is that it is difficult to infer accurately [12], which means that calculating the posterior distribution of hidden features in such models may be slow or not accurate approximations.

You can use singular value decomposition (SVD) to find low-rank approximations based on minimizing the sum of squares and distances. SVD finds a matrix R^=UTV\widehat{R}=U^TVR =UTV for a given rank and minimizes the sum of squares from the target matrix R. Since most real-world datasets are sparse, most entries in R will be lost. In these cases, the squares and distances can only be calculated for the observed entries in the matrix. As shown in [9], this seemingly small problem makes it difficult to solve non-convex optimization problems using standard SVD.

Unconstrained approximation matrices R^\widehat{R}R, such as the number of features, [10] proposed a method of U and V punishment paradigm. However, learning in this model requires solving a sparse semidefinite program (SDP), which makes this approach infeasible for involving millions of observations.

Many of the collaborative filtering algorithms mentioned above have been applied to model user ratings on the Netflix Prize dataset, which contains 480,189 users, 17,770 movies, and more than 100 million views (user/movie/rating triads). However, none of these methods has proved particularly successful for two reasons: first, none of the above, except for algorithms based on matrix factorization, scale well to large data sets; Second, most existing algorithms are difficult to make accurate predictions for users with low ratings. Common practice in collaborative filtering communities is to remove users with less than some minimum rating score. However, the results reported on standard datasets such as MovieLens are actually quite useful. Netflix’s data set is very lopsided, with “infrequent” users rating fewer than five movies and “frequent” users rating more than 10,000 movies. However, because the standardized test set includes the full range of users, the Netflix data set provides a more realistic and useful benchmark for collaborative filtering algorithms.

The aim of this paper is to propose a probabilistic algorithm that is linearly dependent on quantity. Observe and perform well on very sparse and unbalanced datasets, such as Netflix datasets. In Section 2, we introduce a probabilistic matrix decomposition (PMF) model for modeling users: the preference matrix as the product of two lower-level user and movie matrices. In Section 3, we extend the PMF model to include adaptive priors for movie and user feature vectors, and show how these priors can be used to automatically control model complexity. In Section 4, we introduce a constrained version of the PMF model, which is based on the assumption that user movie sets with similar ratings have similar preferences. In Section 5, we report the experimental results shown that PMF significantly outperforms the standard SVD model. We also show that constrained PMFS and PMFS with learnable priors significantly improve model performance. Our results suggest that constrained PMF is particularly effective at making better predictions for users with low ratings.

2 probability matrix decomposition

M movies, N users, rating 1-K. RijR_{ij}Rij represents the rating given by the ith user to the JTH movie, U∈RD×NU \in R^{D\times N}U∈RD×N is the characteristic matrix of the user, V∈RD×MV \in R^{D\times M}V∈RD×M is the characteristic matrix of the movie, UiU_iUi represents the feature vector of the ith user, and VjV_jVj represents the feature vector of the JTH movie. The performance of the model is expressed by the root mean square difference of the test set. We first adopt a probabilistic linear model using Gaussian observation noise, and define the probability distribution of the observation value as:


p ( R U . V . sigma 2 ) = i = 1 N j = 1 M [ N ( R i j U i V j . sigma 2 ) ] I i j p(R|U,V,\sigma^2)= \prod\limits_{i=1}^N \coprod\limits_{j=1}^M {[N(R_{ij}|\mathbf{U_i}^\top V_j,\sigma^2)]}^{I_{ij}}

N (x ∣ u, sigma 2) N (x | u, \ sigma ^ 2) N (x ∣ u, sigma 2) is the mean for uuu, variance of sigma 2 \ sigma ^ 2 sigma 2 gaussian probability density function. IijI_{ij} if Iij is 1, it means that user I’s evaluation of user J exists. We also define the spherical Gaussian distribution prior of 0 mean values of two user and movie features:


p ( U sigma U 2 ) = i = 1 N N ( U i 0 . sigma U 2 I ) . p ( V sigma V 2 ) = i = 1 M N ( V i 0 . sigma V 2 I ) . p(U|\sigma_U^2)=\prod\limits_{i=1}^NN(U_i|0,\sigma_U^2I), p(V|\sigma_V^2)=\prod\limits_{i=1}^MN(V_i|0,\sigma_V^2I),

Then the logarithm of the posterior probability distribution of movie and user features can be calculated as


l n p ( U . V R . sigma 2 . sigma U 2 . sigma V 2 ) = 1 2 sigma 2 i = 1 N j = 1 M I i j ( R i j U i T V j ) 2 1 2 sigma U 2 i = 1 N U i T U i 1 2 sigma V 2 j = 1 M V j T V j lnp(U,V|R,\sigma^2,\sigma_U^2,\sigma_V^2)=-\frac1{2\sigma^2}\sum\limits_{i=1}^N\sum\limits_{j=1}^MI_{ij}(R_ij-U_i^TV_j)^ 2-\frac1{2\sigma_U^2}\sum\limits_{i=1}^NU_i^TU_i -\frac1{2\sigma_V^2}\sum\limits_{j=1}^MV_j^TV_j

1 2 ( ( i = 1 N j = 1 M I i j ) l n sigma 2 + N D l n sigma U 2 + M D l n sigma V 2 ) + C -\frac12((\sum\limits_{i=1}^N\sum\limits_{j=1}^MI_{ij})ln\sigma^2+NDln\sigma_U^2+MDln\sigma_V^2)+C

In order to maximize the posterior probability, log value ln should also be the largest, and E should be the smallest, where E is:


E = 1 2 i = 1 N j = 1 M I i j ( R i j U i T V j ) 2 + Lambda. U 2 U i F r o 2 + Lambda. V 2 V j F r o 2 E=\frac12\sum\limits_{i=1}^N\sum\limits_{j=1}^MI_{ij}(R_ij-U_i^TV_j)^2+\frac{\lambda_U}2||U_i||_{Fro}^2 +\frac{\lambda_V}2||V_j||_{Fro}^2

The lambda U = 2 / sigma sigma U2, lambda V = sigma/sigma 2 V2, ∣ ∣. ∣ ∣ Fro2 \ lambda_U = {\ sigma ^ 2} / {\ sigma_U ^ 2}, \ lambda_V = {\ sigma ^ 2} / {\ sigma_V ^ 2}, | |. | | _ {Fro} ^ 2 Lambda U = 2 / sigma sigma U2, lambda V = sigma/sigma 2 V2, ∣ ∣. ∣ ∣ Fro2 said Frobenius norm Frobenius norm. So our mother is minimizing E, and we can find U and V by gradient descent. This model can be seen as a probabilistic extension of the SVD model, since all ratings are observed. If the range of prior variances can go up to infinity, this goal degrades to an SVD model goal.

In our experiment, the linear Gaussian model was not used, which would make the prediction beyond the range of valid evaluation values. The dot product of the user and the movie feature vector is bound to the range of prediction by the regression equation G (x)=1/(1+exp(−x)) G (x)=1/(1+exp(-x))g(x)=1/(1+exp(−x)) :


p ( R U . V . sigma 2 ) = i = 1 N j = 1 M [ N ( R i j g ( U i T V j ) . sigma 2 ) ] I i j p(R|U,V,\sigma^2)=\prod\limits_{i=1}^N \coprod\limits_{j=1}^M {[N(R_{ij}|g(U_i^TV_j),\sigma^2)]}^{I_{ij}}

We use a mapping function t (x) = (x – 1)/(K – 1) t (x) = (x – 1)/(K – 1) t (x) = (x – 1)/(K – 1) will score 1,… ,K maps to [0,1], so the valid score corresponds to the predicted value. The objective function of the minimum gradient method that minimizes the above formula is that the computational complexity increases linearly.

A simple implementation of this algorithm in Matlab allowed us to do a scan on the Netflix dataset (using 30 features for training) in less than an hour.

Automatic complexity control for PMF models

For PMF models to perform well, capability control is necessary. Given enough features, we can better approximate any given scoring matrix. The simplest way to control the model is to change the number of features. However, this approach fails when the data set is unbalanced, meaning that the number of observations for different rows or columns varies greatly. Because a single value of a feature dimension can be too high for some feature vectors and too low for others. The regularization parameters λU\lambda_UλU and λV\lambda_VλV provide better normalization methods. Perhaps the easiest way to find the most appropriate parameter is to consider a reasonable set of parameter values and train the model to find the parameter that makes the model perform best in the validation set. The downside of this approach is that it is computationally expensive because we have to train multiple models. We will demonstrate how the method proposed in [6] (originally for neural networks) does not take too much time to find a suitable parameter value.

As mentioned above, the problem of the method of approximating a matrix by the dot product of two low-latitude matrices (regularized by penalizing their Frobenius normal form) can be considered a MAP estimation problem in the sense of L2L_2L2, a MAP estimation of probability models with characteristic rows having spherical Gaussian distributions. The complexity of the model is controlled by hyperparameters: noise σ2\sigma^2σ2, and prior parameters σU2\sigma_U^2σU2 and σV2\sigma_V^2σV2. By introducing the prior values of these hyperparameters, the method [6] maximizes the log posterior value of the model so that the complexity of the model can be controlled automatically according to the training data. Using the framework of the user and the spherical priors of the movie eigenvectors, the parameters λU\lambda_UλU and λV\lambda_VλV of PMF can be automatically selected. This regularization method allows us to use more complex methods than the Frobenius norm punishment method. For example, we can use diagonal or even full covariance matrices and adjustable eigenvector mean as priors. The comprehensive processing of gaussian distribution prior is also very convenient.

In summary, by maximizing the log value of the target probability, we find a key point for estimating parameters and hyperparameters:


l n p ( U . V . sigma 2 . Θ U . Θ V R ) = l n p ( R U . V . sigma 2 ) + l n p ( U Θ U ) + l n p ( V Θ V ) + l n p ( Θ U ) + l n p ( Θ V ) + C . LNP (U, V, sigma 2, Θ _U, Θ _V | R) = ln the p (R | U, V, sigma 2) + ln p (U | Θ _U) + ln p (V | Θ _V) + ln p (Θ _U) + ln p (Θ _V) + C,

θ U θ _U θ U and θ V θ _V θ V are hyperparameters of user and movie feature vectors, and C is a constant.

When prior is gaussian, this optional hyperparameter can be found in a closed form if the user and movie feature matrices remain unchanged. To minimize the learning process, we switch between selecting hyperparameters and updating the eigenmatrix (gradient descent fitted with fixed hyperparameters). This optional hyperparameter can be found in a closed form if prior is gaussian, if prior is Gaussian, and if the user and movie feature matrices remain unchanged. To minimize the learning process, we switch between selecting hyperparameters and updating the eigenmatrix (gradient descent fitted with fixed hyperparameters). When the prior is a mixed Gaussian model, the hyperparameters can be obtained by EM single-step method. In all of the experiments, the hyperparameters we chose were less appropriate priors, but it was easy to extend the closed form update to handle the combined prior hyperparameters.

4. Restricted PMF

When a PMF model has been trained, the feature vectors of users with few ratings may be close to the prior average, or the rating users, so the predicted ratings for these users will be close to the average movie ratings. In this section we will introduce a method of constraining user feature vectors, which has a strong impact on infrequent users.

Introducing W∈RD×MW \in R^{D\times M}W∈RD×M is an implicit similarity constraint matrix, we define the user eigenmatrix vector as follows:


U i = Y i + k = 1 M I i k W k k = 1 M I i k U_i=Yi+\frac{ \sum\limits_{k=1}^M I_{ik}W_k}{ \sum\limits_{k=1}^M I_{ik}}

III is an observation matrix, Iij=1I_{ij}=1Iij=1 indicates that if user I has rated movie J, otherwise it is 0. Intuitively, the i-th column of the W-matrix will have a greater prior influence on the user feature vector if a user has a rating for this particular movie. As a result, users who rated the same movie had a similar prior distribution. W is also a spherical Gaussian prior of the mean of 0:


p ( W sigma W 2 ) = k = 1 M N ( W k 0 . sigma W 2 I ) . p(W|\sigma_W^2)=\prod\limits_{k=1}^MN(W_k|0,\sigma_W^2I),

, then the conditional probability distribution of the observed score conforms to


p ( R U . V . W . sigma 2 ) = i = 1 N j = 1 M [ N ( R i j g ( Y i + k = 1 M I i k W k k = 1 M I i k ) T V j . sigma 2 ) ] I i j p(R|U,V,W,\sigma^2)= \prod\limits_{i=1}^N \coprod\limits_{j=1}^M {[N(R_{ij}| g(Yi+\frac{ \sum\limits_{k=1}^M I_{ik}W_k}{ \sum\limits_{k=1}^M I_{ik}})^T}V_j,\sigma^2)]^{I_{ij}}

Same thing to calculate


E = 1 2 i = 1 N j = 1 M I i j ( R i j g ( Y i + k = 1 M I i k W k k = 1 M I i k ) T V j ) 2 + Lambda. U 2 U i F r o 2 + Lambda. V 2 V j F r o 2 + Lambda. W 2 W j F r o 2 E=\frac12\sum\limits_{i=1}^N\sum\limits_{j=1}^MI_{ij}(R_{ij}-g(Yi+\frac{ \sum\limits_{k=1}^M I_{ik}W_k}{ \sum\limits_{k=1}^M I_{ik}})^TV_j)^2+\frac{\lambda_U}2||U_i||_{Fro}^2 +\frac{\lambda_V}2||V_j||_{Fro}^2 +\frac{\lambda_W}2|| W_j||_{Fro}^2

It is also a linearly increasing time complexity, and the relevant experimental results will be given later. For infrequent users, this model will perform better than the simple PMF model.

5 Results

5.1 Describing Netflix Data

Netflix data was collected in 1998, up to 2005. The training data included more than 10 million ratings from 480,189 randomly selected anonymous users on 17,770 movies. Netflix also provided more than 1.4 million validation sets as part of the training data. In addition to training and validation sets, Netflix offers 2.82 million test sets. The validation and test sets are selected from the set of the most frequently rated users of the training set. To prevent overfitting (experience in machine learning), Netflix announced that predictive performance was evaluated by RMSE. As a benchmark, Netflix provided its own system’s training score of 0.9514.

We selected 50,000 users and 1,850 movies to test our model algorithm, the training set protected 1.08 million ratings, and the validation set contained 2,462 ratings. This is a data set where 50% of users have less than 10 ratings.

5.2 Details of training

To speed up training rather than batch learning. We divided the Netflix data into 10 thousand groups and updated the characteristic matrix after each small batch. Using different learning rate and milestone value, we experimented with different D, and finally chose 0.005 learning rate and 0.9 milestone value. Because these Settings work well for all D’s.

5.3 Results of adaptive prior

To evaluate the adaptive priori PMF model, we use a 10D eigenmatrix PMF model. This dimension is chosen to show that even though the dimension of feature matrix is very small, SVD-type models will overfit and merge and it is difficult to improve performance. We compare one SVD model, two PMF models with fixed priors and two PMF models with adaptive priors. SVD models are trained by minimizing the scoring root distance to the observation target matrix, and the feature vectors of SVD models cannot be regularized. The two PMF models with fixed priors have different regularization parameters: PMF1 lambda U = 0.01, lambda \ lambda_U = V = 0.001 0.01, \ lambda_V = 0.001 lambda U = 0.01, lambda = 0.001 V, PMF2 parameter is Lambda U = 0.001, lambda \ lambda_U = V = 0.0001 0.001, \ lambda_V = 0.0001 lambda U = 0.001, lambda V = 0.0001. For adaptive, the user and movie feature vectors of PMFA1 are gaussian priors of the spherical covariance matrix. PMFA2 are diagonal variance matrices. Here, adaptive priors all have adjustable mean values. The prior parameters and noise variance are updated every 10 and 100 times after the eigenmatrix is updated. The validation set of this model is also measured by RMSE.

The results are compared on the left side of Figure 2. The curves of the spherical skew variance PMF model are not shown because they coincide with the diagonal skew variance.

With the increase of training time, the effect of the model based on the minimum RMSE is better. The SVD model we looked at had almost the same performance as PMF2 before over-fitting, with the best RMSE distribution being 0.9258 and 0.9253. PMF1 could not over-fit, and the fitting effect of PMF1 was poor, with RMSE only reaching 0.9434. Adaptive prior models performed significantly better. RMSE of 0.9197 (diagonal skew variance) and 0.9204 (spherical skew variance) are obtained. These results show that automatic regularization performed best in our experimental data. Moreover, our preliminary model (feature vector at high latitude) experimental results show a performance gap, because the adaptive priors used may increase as the dimension of feature vector increases. Diagonal skew variance matrix performance is worse than spherical skew variance matrix performance, diagonal skew variance may be more suitable for greedy version of the PMF model training, let feature vectors learn only one dimension at a time.

5.4 Results of constrained PMF

For the constrained PMF model we used a 30-day experimental test because this option works best in the validation set. D values are about the same between [20,60]. The performance results of SVD, PMF and constrained PMF are shown in Figure 3. The initial eigenmatrices of these three models are the same. PMF and constraints of PMF model regularization parameter is set to lambda U equals lambda Y = lambda V = lambda W = 0.002 \ lambda_U = \ lambda_Y = \ lambda_V = \ lambda_W = 0.002 lambda U equals lambda Y = lambda V = lambda W = 0.002. It is obvious that the SVD model is seriously overfitted. Constrained PMF model has better performance and faster convergence.

Figure 3 (right) shows the influence of constrained user characteristics on infrequent voting users. For users with less than 5 votes, the average score of PMF model and movie is the same. Constrained PMF models, however, perform better. As user ratings increase, PMF and constrained PMF realization become closer and closer.

Another interesting point about constrained PMF models is that they can make better predictions even if we only know how users rate movies, not how they rate them. For the training data, we randomly sampled the data of 50,000 additional users. For each user, there were rated movies, but we discarded the specific rating data, and the RMSE of constrained PMF model reached 1.0510, while that of ordinary PMF model was 1.0726. This experiment confirmed that even if only one user voted for I movies without knowing the actual score, the model could also better know the user’s preference degree.

The performance structure on the full Netflix data set is similar to the toy data set we chose. For both PMF and constrained PMF models, the regularization parameters are set to λU=λY=λV=λW=0.001\lambda_U=\lambda_Y=\lambda_V=\lambda_W=0.001, Figure 2 (right) shows that the constrained PMF model performs better than the simple PMF model, achieving an RMSE of 0.9016. A simple SVD model achieved 0.9280 RMSE, but was overfitted after 10 iterations. Figure 4 (left) shows that the constrained PMF model has better performance for users with lower ratings. More than 10% of the users in the training set had less than 20 ratings. As the rating score increases, the influence of Formula 7 gradually decreases, and PMF and constraint PMF can achieve approximate performance.

The Netflix data set has more subtle sources of information. Netflix tells us in advance which user/movie pairs the test set will have, so we have an additional category: movies we’ve seen but their ratings are unknown. This is a valuable source of information for users who appear several times in the test set, especially if they have only a few ratings in the training set. Constrained PMF models can easily take advantage of this information. Figure 4 (right) shows that this additional information further improves the performance of the model.

When we combine the prediction linearity of several PMF models, adaptive PMF and constrained PMF, we get an error rate of 0.8970 in the test set. When combined with the predictions from the RBM model, we get an error rate of 0.8861, which is 7% higher than Netflix’s own score.

6 summarizes

In this paper, we propose a probability matrix decomposition model and its two extensions: priori PMF with self-learning and constrained PMF. We have also shown that such a model can be used well to train large data sets of over 100 million ratings, such as movie ratings. The efficiency of PMF models comes from finding point estimates and hyperparameters of model parameters rather than infering its full posterior distribution. If we adopt a complete Bayesian approach, we place the superpriors above the hyperparameters and resort to the MCMC method [5] to perform reasoning. Although the computational cost of this approach is higher, preliminary results strongly suggest that a fully Bayesian treatment of the PMF models presented will lead to a significant improvement in prediction accuracy.

reference

  • [1] Delbert Dueck and Brendan Frey. Probabilistic sparse matrix factorization. Technical Report PSI TR 2004-023, Dept. of Computer Science, University of Toronto, 2004.
  • [2] Thomas Hofmann. Probabilistic latent semantic analysis. In Proceedings of the 15th Conference on Uncertainty in AI, Pages 289 — 296, San Fransisco, California, 1999. Morgan Kaufmann.
  • [3] Benjamin Marlin. Modeling user rating profiles for collaborative filtering. In Sebastian Thrun, Lawrence K. Saul, And Bernhard Sch¨ Olkopf, editors, NIPS. MIT Press, 2003.
  • [4] Benjamin Marlin and Richard S. Zemel. The multiple multiplicative factor model for collaborative filtering. In Machine Learning, Proceedings of the Twenty-first International Conference (ICML 2004), Banff, Alberta, Canada, July 4-8, 2004. ACM, 2004.
  • [5] Radford M. Neal. Probabilistic inference using Markov chain Monte Carlo methods. Technical Report CRG-TR-93-1, Department of Computer Science, University of Toronto, September 1993.
  • [6] S. J. Nowlan and G. E. Hinton. Simplifying Neural Networks by Soft Weight-sharing. Neural Computation, 4:473 — 493, 1992.
  • [7] Jason D. M. Rennie and Nathan Srebro. Fast maximum margin matrix factorization for collaborative prediction. In Luc De Raedt and Stefan Wrobel, editors, Machine Learning, Proceedings of the TwentySecond International Conference (ICML 2005), Bonn, Germany, August 7-11, 2005, pages 713–719. ACM, 2005.
  • [8] Ruslan Salakhutdinov, Andriy Mnih, and Geoffrey Hinton. Restricted Boltzmann machines for collaborative filtering. In Machine Learning, Proceedings of the Twenty-fourth International Conference (ICML 2004). ACM, 2007.
  • [9] Nathan Srebro and Tommi Jaakkola. Weighted low-rank approximations. In Tom Fawcett and Nina Mishra, editors, Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003), August 21-24, 2003, Washington, DC, USA, Pages 720 — 727. AAAI Press, 2003.
  • [10] Nathan Srebro, Jason D. M. Rennie, and Tommi Jaakkola. Maximum-margin matrix factorization. In Advances in Neural Information Processing Systems, 2004.
  • [11] Michael E. Tipping and Christopher M. Bishop. Probabilistic principal component analysis. Technical Report NCRG/97/010, Neural Computing Research Group, Aston University, September 1997.
  • [12] Max Welling, Michal Rosen-Zvi, and Geoffrey Hinton. Exponential family harmoniums with an application to information retrieval. In NIPS 17, Pages 1481 — 1488, Cambridge, MA, 2005. MIT Press.