Above, the matrix decomposition algorithm, Spark big data processing framework, and spark distributed matrix implementation and related operations are introduced. Please click the bold blue link: How to Implement Collaborative Filtering Recommendation based on Spark Distributed? . This article focuses on the most popular collaborative filtering algorithms and simulates collaborative filtering recommendations based on spark distributed matrix.


Introduction to recommendation System

Personalized recommendation is to recommend interested information and commodities to users according to their interest characteristics and purchasing behaviors. Personalized recommendation system is an advanced business intelligence platform based on massive data mining to help provide customers with completely personalized decision support and information services. Personalized recommendation has developed many mature algorithms, some commonly used algorithms are listed below.


▲ Collaborative filtering recommendation algorithm

Collaborative filtering recommendation is one of the earliest and most successful technologies in recommendation systems. It generally adopts the nearest neighbor technology, calculates the distance between users by using the historical preference information of users, and then predicts the preference degree of target users to specific goods by using the weighted value of the evaluation of the nearest neighbor users, and the system recommends the target users according to this preference degree. Collaborative filtering recommendation technology can be divided into three types, including user-based recommendation, project-based recommendation and model-based recommendation.


▲ Recommendations based on association rules

Recommendation based on association rules is based on association rules, with purchased goods as the rule head and the rule body as the recommendation object. Association rules are used in a transaction database to find out what percentage of transactions that purchase set X also purchase set Y.


▲ Recommendations based on utility

The utility based recommendation is calculated based on the utility of the user’s use of the project. The core problem is how to create a utility function for each user. Therefore, the size of user data is largely determined by the utility function adopted by the system.


▲ Recommendations based on knowledge

Knowledge-based recommendation can be viewed as an inference technique to some extent. Utility knowledge is a kind of knowledge about how a project satisfies a particular user, thus explaining the relationship between needs and recommendations. Therefore, user profile can be any knowledge structure that can support reasoning. It can be a normalized query by users, or it can be a more detailed representation of user needs.


Bring about
Combination is recommended


Due to the advantages and disadvantages of various recommendation methods, combined recommendation is often adopted in practice. The combination of content recommendation and collaborative filtering recommendation is a hot topic in industry and academia.


Collaborative filtering recommendation system

Collaborative filtering recommendation algorithm is one of the most widely used recommendation algorithms in recommendation systems. The essence of collaborative filtering is to predict a user’s preference for an item by predicting the missing score in the user-item matrix. To be more specific, collaborative filtering algorithms are mainly divided into memery-based CF and Model-based CF, while memory-based CF includes user-based CF and item-based CF.


I. User-based collaborative filtering algorithm user-based CF

User-based collaborative filtering algorithm generates recommendations for target users according to the preference information of similar users. It is based on the assumption that if some users rate one type of project more closely, they rate others more closely. Collaborative filtering recommendation system uses statistical calculation method to search for similar users of target users, and predicts the rating of target users on the specified project according to the rating of similar users on the project. Finally, the scores of the first several similar users with high similarity are selected as the recommendation result and fed back to users.

This algorithm is not only simple and accurate, but also widely used in existing collaborative filtering recommendation systems. The core of the User-based collaborative filtering recommendation algorithm is to calculate the nearest neighbor set through the similarity measurement method and return the score of the nearest neighbor as the recommendation prediction result to the User.

For example, in the user-item rating matrix shown in the following table, rows represent users, columns represent movies, and values in the table represent user ratings for a particular movie. Now we need to predict user Xiao Li’s rating of the movie “Ghost Blows out the Light”.

Can be seen from the table above small farmhouse and xiao li score comparison of movies is close, small farmhouse on the “resident evil” avengers alliance “king kong Wolf” grade 3, 4, and 4, respectively, xiao li’s score of 3, 5, 4, respectively, the highest degree of similarity between them, so small farmhouse and xiao li is the closest neighbors, both of them taste similar to see a movie, Therefore, Zhuang’s score of “Ghost Blows out the Light” has the greatest influence on the predicted value.

In a real forecast, the recommendation system searches only the first few neighbors and assigns the project’s score to the target user’s forecast based on the ratings of these neighbors.

2. Item-based collaborative filtering algorithm item-based CF

Project-based collaborative filtering recommendation predicts the rating of the target item according to the rating data of similar items. It is based on the assumption that if most users rate certain items more or less alike, current users will rate those items more or less alike. Item-based collaborative filtering algorithm mainly studies a group of items evaluated by the target User, calculates the similarity between these items and the target, and then outputs the items with the highest similarity among the first K items, which is different from user-based collaborative filtering.

Take the previous user-item rating matrix as an example to predict user Xiao Li’s rating of the movie “Ghost Blows out the Light”. Through data analysis, the movie “resident evil” and “ghost blows out the ratings of the score is very similar to that of the first three users on the” resident evil “score 4, 3, 2, respectively, of” ghost blows out the score 4, 3, 3, respectively, they both the highest similarity, so the film “resident evil” is “ghost blows out the best neighbors, Resident Evil therefore had the largest impact on the predicted score of Ghost Blows Out the Lights. In a real forecast, the recommendation system searches only the first few neighbors and assigns the project’s score to the target user’s forecast based on the ratings of these neighbors.

The main content of item-based collaborative filtering recommendation algorithm is nearest neighbor query and recommendation generation. Therefore, the Item-based collaborative filtering recommendation algorithm can be divided into two stages: nearest neighbor query and recommendation generation. The nearest neighbor query stage is to calculate the similarity between projects and search the nearest neighbor of the target project. The recommendation generation stage is to predict the rating of the target project according to the rating information of the nearest neighbor of the target project, and finally generate the first N recommendation information.

In reality, the user-item matrix is very large, while the user’s interest and consumption ability are limited. For the items consumed by a single user, there are very few items that generate scoring records. As a result, the user-item matrix contains a large number of empty values and data is extremely sparse. It is assumed that the user’s interest is affected by only a few factors, so the sparse and high-dimensional user-project scoring matrix can be decomposed into two low-dimensional matrices, representing the user’s feature vector and the project’s feature vector respectively.

The feature vector of the user represents the user’s interest, the feature vector of the item represents the characteristics of the item, and each dimension corresponds to each other. The inner product of the two vectors represents the user’s preference for the item. Matrix decomposition is one of the most critical technologies in Model-based collaborative filtering recommendation, which is to reconstruct the low-dimensional matrix through user characteristic matrix U and item characteristic matrix V to predict user’s rating of items. The commonly used collaborative filtering matrix decomposition algorithms include singular value decomposition, regularization matrix decomposition and biased matrix decomposition.


Collaborative filtering recommendation algorithm based on Spark Distributed Matrix

I. Example of implementing User-based collaborative filtering




1. Read the scoring data into the CoordinateMatrix.

// Read the file

// Format the data

// Create the coordinate matrix


2. CoordinateMatrix is converted into RowMatrix to calculate the similarity of two users.

Since the RowMatrix can only be the similarity of that column and the number of users is represented by rows, CoordinateMatrix needs to calculate the transpose first:

// Compute the transpose matrix and convert it to a row matrix

// find the similarity


3. Suppose it is necessary to predict user 1’s rating of Item 1, then the predicted result is the average rating of user 1 plus the weighted average of other users’ rating of Item 1 in terms of similarity.

// Calculate the average score of user 1

// Calculate the weighted average score of other users for item 1

// Sum to output forecast results


2. An example of implementing item-based collaborative filtering

Similar to the user-based collaborative filtering algorithm, the following is an example of Spark implementing item-based collaborative filtering on the test data set:

// Read the score data

// Format the data

// Create the coordinate matrix

// Calculate Item similarity

// Calculate the average score of item 1

// Calculate the weighted average score of user 1 for other items

// Sum to output forecast results


3. Examples of implementing Model-based collaborative filtering

Singular value decomposition (SVD) is the simplest method of Model-based collaborative filtering. The matrix decomposition singular value decomposition (SVD) operation in Spark can be used to easily implement the SVD collaborative filtering algorithm, as shown in the following example:

1. First, the scoring data is read into CoordinateMatrix:

// Read file data

// Format the data

// Create the coordinate matrix


2. CoordinateMatrix was converted to RowMatrix, and computeSVD was called to calculate the singular value decomposition of rank 2 of the scoring matrix:


3. Assume that the score of user 1 on item 1 needs to be predicted:


4. Output prediction results: