• How to build a Recommendation System in a graph database using a Latent Factor Model
  • Changran Liu
  • The Nuggets translation Project
  • Permanent link to this article: github.com/xitu/gold-m…
  • Translator: Prince Stuart
  • Proofreader: TrWestdoor, LSvih

How to use cryptic semantic model to build recommendation system in graph database

What is a recommendation system?

Recommendation systems are arbitrary scoring systems that predict individual choice preferences based on available data. Recommendation systems are used for services as diverse as video streaming, online shopping and social media. Typically, the system makes recommendations based on a user’s prediction of how an item will be rated. Recommendation system can be divided into two aspects, that is, using information and forecasting model.

Figure 1 (image courtesy of the author). This sample dataset includes two users: Alice and Bob, two movies:Toy StoryandIron manAnd three scores (solid line). Each user and movie uses their attributes as tags.

Content filtering vs collaborative filtering

Content filtering and collaborative filtering are the two most important recommendation methods. The difference between them is that the information required for prediction is different. Figure 1 shows a set of movie rating data and some tags for users and movies. Note that the user-rated movie data here is derived from the graph structure: the user and the movie are the vertices of the graph, and the ratings are the edges of the graph. In this case, the content filtering method leverages the tag attributes of the movie and the user. By checking the tag, we learned that Alice is a huge fan of Marvel movies, and Iron Man is a Marvel movie, so the Iron Man series is a good recommendation for her. A specific example of a content filtering scoring system is TF-IDF, which is used to sort document searches.

User and movie tags may not always be available. When label data is sparse, content filtering methods can be unreliable. Collaborative filtering methods, on the other hand, rely primarily on user behavior data (such as ratings or movie viewing history). In the example above, Alice and Bob both liked Toy Story and gave it a 5 and 4.5 respectively. Based on these ratings records, it can be concluded that the two users may have similar preferences. Now, given Bob’s fondness for Iron Man, we predict Alice will behave similarly and recommend Iron Man to Alice. K-nearest Neighbor algorithm (KNN) is a typical collaborative filtering method. However, collaborative filtering has a so-called cold start problem — it can’t make recommendations to users without a record of ratings.

Memory-based vs model-based

Recommendation systems can also be divided into memory-based and model-based approaches based on explicit or implicit usage characteristics. In the above example, all user and movie characteristics are explicitly given, which allows for a direct match based on the movie and user tags. But sometimes deep learning models are needed to extract implied features from user and movie information (for example, NLP models that classify movies according to their Outlines). Model-based recommendation systems use machine learning models to make predictions, while memory-based recommendation systems mainly use explicit features.

Here are some typical use cases for different types of recommendation systems:

How does factorization work in a recommendation system

As mentioned above, collaborative filtering methods, such as KNN, can predict movie ratings without knowing user attributes and movie attributes. However, this method may not produce accurate predictions when the rating data is sparse, i.e., when a particular user has rated only a few movies. In Figure 1, if Bob did not rate Toy Story, KNN could not predict Alice’s rating of Iron Man, because there is no connection between Alice and Bob, and Alice herself has no neighbors. To address this challenge, factorization combines model-based methods with collaborative filtering methods to improve the accuracy of prediction when scoring records are sparse.

Figure 2 illustrates the intuitiveness of the factorization approach. Factorization is also known as the cryptic or matrix factorization method. The goal is to get a vector for each user and movie that represents their underlying characteristics. In this example (Figure 2), the dimension of the vector is specified as 2. The two elements of movie vector x(I)x(I)x(I) represent the romantic degree and action content of the movie, and the two elements of user vector θ(j)\theta(j)θ(j) indicate the user’s preference for romance and action content, respectively. User JJJ’s rating prediction for movie III is the dot product of x(I)x(I)x(I) and theta(j) \theta(j)θ(j) as expected, the more aligned the two vectors are, the more users like the movie.

In Figure 2 (picture provided by the author), the table shows the rating records of 6 movies by 4 users, with the rating range from 0 to 5. θ(j) and x(I) represent cryptic vectors, and Alice’s score predictions are calculated based on potential factors and shown in orange font next to the real values.

The example in Figure 2 merely illustrates the intuitiveness of the factorization approach. In practice, the meaning of each vector element is usually unknown. These vectors are actually randomly initialized and “trained” by minimizing the following loss function [1].

Where MMM represents the number of scoring records, NNN represents the dimension of the argot vector, y(I,j)y(I,j)y(I,j) is user JJJ’s rating of movie III, and λ\lambda lambda is the regular factor. The first term is essentially the predicted square error (θ(j)k\theta(j)_kθ(j) K and x(I)kx(I)_kx(I) K sum of KKK), and the second term is regularization to avoid overfitting. The first term of the loss function can also be expressed as a matrix,


J = R U V T 2 J=\left\|R-U V^{\mathrm{T}}\right\|_{2}

Among them, RRR is y(I,j) Y (I,j) Y (I, J)y(I, J) as the scoring matrix, UUU and VVV are the matrix composed of user and movie cryptic vector respectively. Minimizing the squares of predictions is equivalent to minimizing the Frobenius normal form of R− Uvtr-UV ^TR−UVT. Therefore, this method is also called matrix decomposition method.

You might be wondering why we call this factorization. As discussed earlier, the rating matrix RRR is likely to be a sparse matrix because not all users will rate all movies. Considering storage efficiency, these sparse matrices are usually stored in the form of graphs, where each vertex represents a user or a movie, each edge represents a score record, and the weight of the edge represents the score. As shown in the figure below, the gradient descent algorithm that obtains the cryptic meaning vector by optimizing the loss function can also be expressed as the graph algorithm.

The partial derivatives of θ(j) K θ(j) _K θ(j) K and x(I)kx(I)_kx(I)k can be expressed as follows:


partial J partial Theta. k ( j ) = i : r ( i . j ) = 1 M ( ( Theta. ( j ) ) T x ( i ) y ( i . j ) ) x k ( i ) + Lambda. Theta. k ( j ) partial J partial x k ( i ) = j : r ( i . j ) = 1 M ( ( Theta. ( j ) ) T x ( i ) y ( i . j ) ) Theta. k ( j ) + Lambda. x k ( i ) \begin{array}{l} \frac{\partial J}{\partial \theta_{k}^{(j)}}=\sum_{i: r(i, j)=1}^{M}\left(\left(\theta^{(j)}\right)^{T} x^{(i)}-y^{(i, j)}\right) x_{k}^{(i)}+\lambda \theta_{k}^{(j)} \\ \frac{\partial J}{\partial x_{k}^{(i)}}=\sum_{j: r(i, j)=1}^{M}\left(\left(\theta^{(j)}\right)^{T} x^{(i)}-y^{(i, j)}\right) \theta_{k}^{(j)}+\lambda x_{k}^{(i)} \end{array}

The formula above shows that the partial derivative of an implicit vector at a vertex depends only on its edges and neighbor vertices. For our example, the partial derivative of the user vector is determined only by the implicit vector of the movie that the user has previously rated and the rating of that movie. This property allows us to store the rating data in a graph (as shown in Figure 5) and use a graph algorithm to get a vector of cryptic meanings between the movie and the user. You can perform the following steps to obtain user and movie characteristics.

It is worth mentioning that computing derivatives and updating cryptic vectors can be completed in the Map-Reduce framework. The second step can be performed in parallel for each edge edge(I,j)edge_(I,j)edge (I,j) (i.e. scoring). The third step can also be performed in parallel for each vertex verteiverte_ivertei (that is, the user or the movie).

Why is a graph database recommended

For industrial applications, the database can hold hundreds of millions of users and movies and billions of rating records, which means that rating matrices AAA, characteristic matrices UUU and VVV plus other intermediate variables consume terabytes of memory during model training. These problems can be solved by training latent characteristics in a graph database, which can be distributed over a multi-node cluster, with some of the score graphs stored on disk. In addition, it is preferred to store users of the graph structure (i.e. score records) in the database management system. Training the model in the database avoids migrating graph data from the database management system to the machine learning platform, thus better supporting the updating of the model as the training data continues to evolve.

How to build a recommendation system in TigerGraph

For this part, we loaded a graph of movie ratings onto a free graph database provided by the TigerGraph cloud, and then trained the recommendation model in the database. Follow these steps and you’ll have a movie recommendation system in just 15 minutes.

Create a free instance of TigerGraph on the TigerGraph cloud by referring to creating your first TigerGraph instance (the first three steps). In Step 1, select in-database Machine Learning Recommendation as the starter kit. In step 3, select TG.free.

Refer to the article to get started using TigerGraph Cloud Starter Edition and log in to GraphStudio. On the Map data to graph page, you will see how data files are mapped to graph structures. In the getting Started kit, the Movielens-100k file has been uploaded to the instance. The Movielens-100K dataset has two files:

  • The movielist.csv file has two columns containing the ID and name (movie year) of each movie.
  • The rating. CSV file is a list of movie ratings by users. There are three columns, representing the user ID, movie ID, and rating.

Go to the Load Data page and click the Start/Resume Loading button. After the dataset is loaded, you can see the statistics of the graph on the right. The Movielens-100K dataset contains 10,011 ratings from 944 users on 1,682 movies.

After loading, you can see the statistics on the right side of the graph. On the Explore Graph page, you can see the two-part Graph we just created, with information about each user and movie stored in the vertices, and ratings stored on the edges.

On the Write Queries page, you will find that the Queries required by the recommendation system have been added to the database. These queries are written using the TigerGraph query language, GSQL. Click Install all queries to compile all GSQL statements into C++ code. You can also see a README directive on this page. Refer to the steps below to build a recommendation system using these movie ratings records.

Execute split data instructions

The query instruction divides the scoring data into validation sets and test sets. The default ratio of test data is 30% (that is, 30% of the scoring data will be used to validate the model and the remaining 70% will be used to train the model). The query command also outputs the size of the total data set, validation data set, and training data set.

Execute the normalization instruction

The query command is normalized by subtracting each score from the average score of the movie, which is calculated from the training set.

Execute the initialization instruction

This command initializes the cryptic vector of the user and the movie. The elements in the cryptic vector are initialized by a normally distributed random number generator. The input to this instruction is the standard deviation and mean of the normal distribution.

Execute training instruction

The instruction uses the gradient descent algorithm to train the recommendation model. By default, the number of features is set to 19. This quantity must correspond to the dimension of the cryptic vector in the initialization instruction. The input of the instruction is the learning rate, regularization factor and the number of training iterations. The query output is the root mean square error (RMSE) for each iteration.

After the cryptic vector is calculated, we can test and use the model to make recommendations.

Execute test instruction

This query command outputs the real score data provided by the user as well as the predicted score of the model. The input of the query command is the user ID, and the output is all the ratings and predicted ratings of the user.

Perform recommended instructions

This query command outputs top-10 recommended movies for the user. The reference films are recommended for is predicted ratings.

conclusion

Training machine learning models in graph databases has great potential for real-time updating of recommendation models. In this article, we introduce the mechanics of factorization and show you how to build your movie recommendation system step by step using the TigerGraph cloud service. Once you are familiar with this example, you should be confident that you can customize a recommendation system based on your usage.

reference

[1] Ahmed, Amr, et al. “Distributed large-scale natural graph factorization” Proceedings of the 22nd international conference on World Wide Web (WWW 2013)

If you find any mistakes in your translation or other areas that need to be improved, you are welcome to the Nuggets Translation Program to revise and PR your translation, and you can also get the corresponding reward points. The permanent link to this article at the beginning of this article is the MarkDown link to this article on GitHub.


The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, front-end, back-end, blockchain, products, design, artificial intelligence and other fields. If you want to see more high-quality translation, please continue to pay attention to the Translation plan of Digging Gold, the official Weibo, Zhihu column.