Recommendation problem is a typical application of big data, which makes use of known user browsing history, rating and other behaviors to guess user interest and make personalized recommendation

The data set

User rating of movies (1-5), with 10,000 users and 10,000 movies.

  • User ID list Users 10000 row, one column, indicates the USER ID
  • Movie_titles 17770 rows, 3 columns, in the format of movie ID, year, title
  • Score data Netflix_data 8.61 million lines, each action is scored once, including user ID movie ID score score date, separated by Spaces, where user ids appear in users.txt, movie ids are integers from 1 to 10000

80% of the scoring data is netflix_train (6.89 million), and the remaining 20% is Netflix_test (1.72 million).

The data processing

The experiment used full data. The input file is organized into a matrix X with dimensions of user * movie, where Xij corresponds to user I’s rating of movie J. For the term with unknown score, the method of zero is adopted.

  1. Get the relationship between userID and I from users.txt:
data_user = importdata('users.txt');
user_map = containers.Map({0}, {0});
for i = 1:size(data_user)
    user_map(data_user(i))= i;
end
Copy the code
  1. Read Netflix data

Read the netflix_train.txt file and organize it into a matrix X_train with dimensions for the user * movie

X_train=zeros(10000.10000);
[u,m,f,d] = textread('netflix_train.txt'.'%d %d %d %s');
for i = 1:size(u)
    X_train(user_map(u(i)),m(i)) = f(i);
end
Copy the code

Read the netflix_test. TXT file and organize it into a matrix X_test with dimensions as user * movie

X_test = zeros(10000.10000);
[u,m,f,~] = textread('netflix_test.txt'.'%d %d %d %s');
for i = 1:size(u)
    X_test(user_map(u(i)),m(i)) = f(i);
end
Copy the code
  1. To sparse matrix,

Since the X matrix is very sparse in this experiment, sparse matrix operation is used to avoid the use of for loop. In Matlab, spones() is used to obtain the indicator variable matrix, which is used to record which records the user has scored.

S=sparse(X) — Convert matrix X to sparse matrix form, i.e. any zero elements in matrix X are removed, and non-zero elements and their subscripts (indexes) form matrix S. Sparse (X) returns S if X itself is sparse

For example:

>> a=[1.0.2;0.0.1;0.0.6];
>> a
a =
     1     0     2
     0     0     1
     0     0     6
>> b=sparse(a)
b =
   (1.1)        1
   (1.3)        2
   (2.3)        1
   (3.3)        6
Copy the code

Experimental code: R = spones(S) Generates a matrix R with the same sparse structure as S, but 1 is in a non-zero position

X_test = sparse(X_test); X_train = sparse(X_train); % indicates variable A_test = spones(X_test); A_train = spones(X_train);Copy the code
  1. Let’s save the data

Save X_train, X_test and A_train, A_test data variables in the X_matrix_sparse. Mat file

save('X_matrix_sparse'.'X_train'.'X_test'.'A_test'.'A_train');
clear u m f d i data_user user_map;
Copy the code

Collaborative filtering algorithm

Collaborative Filtering is one of the most classical recommendation algorithms, including user-based Collaborative Filtering and item-based Collaborative Filtering. In this paper, the user-based collaborative filtering algorithm is implemented. The idea of the algorithm is very simple. When we need to judge whether user I likes movie J or not, we only need to look at users similar to I to see whether they like movie J and make weighted average of their scores according to the similarity. Expressed in the formula, it is:

  1. Construct the similarity matrix

On Euclidean distance and cosine similarity calculation, there are corresponding functions in MATLAB. Euclidean distance(‘ Euclidean ‘) Standardized Euclidean Distance (‘seuclidean’)

The cosine distance is obtained by using the pdist() function, cos_sim is obtained by 1-pdist(), which is transformed into the similarity matrix X_sim, X_sim(I, j) represents the similarity between user I and user J. Code:

load('X_matrix_sparse.mat'); % Calculate cosine similarity matrix cos_sim =1-pdist(X_test,'cosine');

X_sim = zeros(N,N);
cursor = 1;
for i = 1:N
    X_sim(i,i) = 0;
    for j = i+1:N
        X_sim(i,j) = cos_sim(cursor);
        X_sim(j,i) = cos_sim(cursor);
        cursor = cursor + 1;
    end
end
Copy the code
  1. Scoring result matrix

Collaborative filtering algorithm was used to obtain the predicted scoring results. The similarity matrix X_sim is multiplied by the test set X_test and divided by the result of the column sum of X_sim. ! [Insert picture description here](p1-jj.byteimg.com/tos-cn-i-t2… X_score = (X_sim*X_t)./sum(X_sim,2);

In the weighted average of this step, considering that the scores of many users on movies are unknown, only the users who have scores on J movies are counted in the group average. Modified formula:

X_score = (X_sim*X_test)./(X_sim*A_test);
X_score = X_score.*(1-A_test) + X_test;
X_score(isnan(X_score))=0;
Copy the code

View forecast results (top 10 rows, top 10 columns) :

RMSE(Root Mean Square Error) is adopted as the evaluation index, and the calculation formula is as follows:

Test is a collection of all Test samples, 𝑋𝑖𝑗 are the predicted values and 𝑋̃𝑖𝑗 are the actual values. In the calculation process of RMSE, only the scored records are also concerned, that is, only the scores recorded in Train are calculated:

n = sum(sum(A_train==1));
RMSE = sqrt(sum(sum((X_score.*A_train-X_train).^2))/n);
Copy the code

Output: RMSE= 1.0225

Matrix decomposition algorithm based on gradient descent

  1. Given k,λ parameter for given k=50, λ = 0.01, the changes of the objective function value and RMSE on the test set in the iterative process are drawn, the final RMSE is given, and the results are simply analyzed. For a given behavior matrix X, it is decomposed into the product of two matrices U and V, so that the product of UV approaches X at the known value, that is, Xm∗n ≈ Um∗k 𝑉𝑛∗𝑘𝑇. Their product matrix can be used to predict the unknown part of X. The gradient descent method is used to solve the problem optimally. The objective function is:

When the objective function gets the minimum value, the algorithm gets the optimal solution. First, we take partial derivatives of U and V respectively, and the results are as follows:

i = 1; test_len = sum(sum((A_test)~=0)); times = 300; % Maximum number of iterations RMSE = zeros(times,1); J = zeros(times,1); J (I) = 0.5 * (norm (A_train. * (X_train - U * V '), 'fro')) ^ 2 + lambda * (norm (U, 'fro) ^ 2 + norm (V,' fro) ^ 2); RMSE(i)= sqrt(sum(sum((U*V'.*A_test-X_test).^2))/test_len); while i < times tic i = i+1; J = 0.5 * (norm (A_train. * (X_train - U * V '), 'fro')) ^ 2 + lambda * (norm (U, 'fro) ^ 2 + norm (V,' fro) ^ 2); J(i) = j; if j > J(i-1) break; end U = U - alpha*( A_train.*(U*V'-X_train)*V + 2*lambda*U ); V = V - alpha*( (A_train.*(U*V'-X_train))'*U + 2*lambda*V ); RMSE(i)= sqrt(sum(sum((U*V'.*A_test-X_test).^2))/test_len); time_toc(i)=time_toc(i-1)+toc; endCopy the code

Save forecast result information:

i = i - 1;
X_score = U*V';
iterations = i-1; save('X_md.mat','k','alpha','lambda','X_score','RMSE','J','iteratio ns');
Copy the code

Index amount: min RMSE = 0.91865, J = 2.8920e+06 number of iterations :137 times, consuming 727.826 seconds

The difference between the actual data and the forecast data for the top 100 movies is shown in the chart

The intersection of RMSE and time curve was taken, and 40 iterations were taken as the termination condition of the algorithm.

(2) adjust k and λ parameters adjust k values (e.g., 20,50) and λ values (e.g., 0.001,0.1) to compare the effect of the final RMSE and select the optimal parameter combination.

contrast

Compared with matrix decomposition algorithm, the RMSE value of collaborative filtering algorithm is larger, but the time consumption is smaller. When new user information or movie information is added, the collaborative filtering algorithm only performs a relatively small number of calculations. Matrix decomposition algorithm, the high-dimensional User – Item rating matrix is mapped to the two lower dimensional matrix users and items, which reduces the dimensions of the matrix, and can get the User’s preferences, and the characteristics of film matrix solution method, has solved the data sparseness, data and convenient in User vector feature vector and the items added to other factors.

Both algorithms have the problem of cold start, so they cannot predict the score of new users, so they cannot recommend movies to new users.

Data and code resources: you can download and run them in my resources