The introduction

Following search engines, recommendation systems have changed the way users interact with websites, and have important applications in enhancing user engagement and diversifying recommended products. Amazon gets 35% of its revenue from recommendations, and Netflix gets 75% of its users to select movies based on recommendations.

Recommendation systems are a very large topic, and this article introduces the use of a common model-based collaborative filtering algorithm, SVD (Singular Value Decomposition), in Python.

Suppose we take m users, n items, and each user’s rating of each item can form an M by N two-dimensional matrix. Of course, there will be a lot of unknown values in this matrix, either because the user has not used the product, or because the user has not rated the product after using it. As shown in the figure below

Blank Spaces in the graph are unknown values. Next, what we need to do is to predict the unknown values, that is, how each user will rate each item, based on the known values in this incomplete two-dimensional matrix.

It can be imagined that after the matrix is completed by the predicted value, each row of the matrix represents a user’s score for all commodities. The commodities with the highest scores can be extracted from these scores and recommended to users. In this way, we have completed a recommendation system model.

Now, how do we predict unknown values from known values, using matrix factorization, as shown here

Above intermediate matrix can be split into the left and the product of two matrices, this is the singular value decomposition, a matrix is always can be split into two matrix multiplication, the principle of SVD can see this blog post, the principle of SVD method applied in the recommendation system can refer to this blog, here, we mainly talk about how to use in python.

Initialization and modeling

First you need to install the Surprise library with this command

pip install scikit-surprise
Copy the code

Suppose we now have this data.

movielens_df: pd.DataFrame = load_movielens()
movielens_df.head(5)

        user_id	    movie_title	            rating
36649	User 742	Jerry Maguire (1996)	4
2478	User 908	Usual Suspects, The (1995)	3
82838	User 758	Real Genius (1985)	4
69729	User 393	Things to Do in Denver when You're Dead (1995) 3 36560 User 66 Jerry Maguire (1996) 4Copy the code

That is, the corresponding relationship between users and movies. Import the required modules below

from surprise import SVD
from surprise import Dataset, Reader
from surprise.model_selection import cross_validate, train_test_split
Copy the code

Let’s begin the formal modeling

Initialize reader with a score range of 1 to 5

reader = Reader(rating_scale=(1, 5))
Copy the code

[user_id, product_id, rating] [product_id, product_id]

data = Dataset.load_from_df(movielens_df, reader)
Copy the code

Note here: The data variable is no longer a DataFrame type, but a data type in the Surprise library. As can be seen from the movielens_df result above, our data is not in the matrix form originally mentioned in this paper, so this step of transformation will transform the data into the form required by the Surprise library, which is convenient for the subsequent algorithm to solve.

Step 3: Split the training set and the test set, with 75% of the samples as the training set and 25% as the test set

trainset, testset = train_test_split(data, test_size=.25)
Copy the code

Here the type of trainset is surprise. The dataset. Type trainset, we can see the basic information of the data

trainset.n_users # 943
trainset.n_items # 596
Copy the code

This means that the sample we’re going to use for training is 943 users, 596 items.

Step 4: Training model, specify 100 hidden features, use training set for training

model = SVD(n_factors=100)
model.fit(trainset)
Copy the code

It needs to be explained here that 100 hidden features means that the original 943*596 matrix will be divided into two matrix products of 943*100 and 100*596. The value of n_factors can be arbitrarily specified as long as it is no more than 596. However, different values will fit different models, and the optimal value should be selected.

We can also look at the split matrices

model.pu.shape # (943, 100)
model.qi.shape # (596, 100)
Copy the code

Make recommendations based on model results

Predict a user’s rating of a movie

Specify the user name and movie name

a_user = "User 196"
a_product = "Toy Story (1995)"
model.predict(a_user, a_product)

# Prediction(uid='User 196', iID ='Toy Story (1995)', r_UI =None, est=3.93380711688207, details={'was_impossible': False})
Copy the code

Intermovie correlation

Here we need to write get_vector_by_movie_title and cosine_distance.

Then we can enter two movie names and get the correlation between them

toy_story_vec = get_vector_by_movie_title('Toy Story (1995)', model)
wizard_of_oz_vec = get_vector_by_movie_title('Wizard of Oz, The (1939)', model)

similarity_score = cosine_distance(toy_story_vec, wizard_of_oz_vec)
similarity_score
# 0.9461284008856982
Copy the code

This is a movie similarity calculated without taking into account the director and other characteristics of the movie, because we only used rating data.

Find the movie that most closely resembles a movie

First, we need to implement get_top_similiar function to get the five most similar movies. The final effect is as follows

get_top_similarities('Star Wars (1977)'Star Wars (1977) 1 0.262668 Empire Strikes Back The (1980) 2 0.295667 Return of The Jedi (1983) 3 0.435423 Raiders of The Lost Ark (1981)Copy the code

The resources

1. Video Daniel Pyrathon – A Practical Guide to Singular Value Decomposition in Python-Pycon 2018

2. Surprise the help document

3. Video matching code