“This is the first day of my participation in the Gwen Challenge in November. See details of the event: The last Gwen Challenge in 2021”.

I. Introduction to recommendation system

1.1 Concepts and Background

  • What is a recommendation system

    Users without clear requirements visit our service, and the items in the service constitute information overload to users. The system sorts the items according to certain rules and shows the items in front to users. Such system is the recommendation system

  • Information overload & unclear user needs

    • Category: covers a small number of popular sites. Typical application: Hao123 Yahoo
    • Search engines: Identify requirements through search terms. Typical application: Google Baidu
    • Recommendation system: it does not need users to provide clear needs, and models users’ interests by analyzing users’ historical behaviors, so as to actively recommend information that can meet their interests and needs
  • Recommendation systems & search engines

    search recommended
    behavior Take the initiative to passive
    intentions clear The fuzzy
    personalized weak strong
    The flow distribution Matthew effect The long tail
    The target Quick to meet Continuous service
    Evaluation indicators concise complex

1.2 The working principle and function of the recommendation system

  • How the recommender system works

    • Social recommendation
    • Content-based recommendations
    • Recommendations based on popularity
    • Collaborative filtering based recommendations: Find users with similar historical interests
  • The role of the recommendation system

    • Connect users and objects efficiently
    • Increase user stay time and user activity
    • Effectively help products to achieve their commercial value

1.3 Differences between recommendation systems and Web projects

  • Achieving goals through information filtering improves V.S. stable information flow systems

    • Web projects: Handle complex business logic, handle high concurrency, and build a stable information flow service for users
    • Referral system: Growth metrics, retention/reading time/Gross Merchandise Volume (GMV)/ Video View
  • Definite V.S. uncertain thinking

    • Web projects: Have firm expectations of results
    • Recommendation systems: The result is a matter of probability

Ii. Recommendation system design

2.1 Recommendation system elements

  • The UI and UE
  • Data (Lambda architecture)
  • Business knowledge
  • algorithm

2.2 Recommended System architecture

  • Lambda architecture for Big data

    • Lambda architecture is a real-time big data processing framework proposed by Nathan Marz, author of The real-time big data processing framework Storm.

    • Lambda architecture integrates offline computing and real-time computing, and designs an architecture that can meet the key characteristics of real-time big data systems, including high fault tolerance, low latency and scalability.

    • Layered architecture

      • Batch layer
        • Data is immutable, computable in any way, and horizontally scalable
        • High latency of several minutes to several hours (calculation and data volume vary)
        • Log collection: Flume
        • Distributed storage: Hadoop
        • Distributed computing: Hadoop and Spark
        • View storage database
          • osql(HBase/Cassandra)
          • Redis/memcache
          • MySQL
      • Real time processing layer
        • Stream processing, continuous calculation
        • Store and analyze data within a window period (top sales over time, real-time hot searches, etc.)
        • Real-time data collection flume & Kafka
        • Real-time data analysis Spark Streaming/Storm/Flink
      • The service layer
        • Random read support
        • Results need to be returned in a very short time
        • Read and merge batch and real-time layer results
  • Recommendation algorithm Architecture

    • Recall Stage (Audition)

      • Recall determines the ceiling of the final recommendation
      • Commonly used algorithm
        • Collaborative filtering
        • Based on the content
    • Sorting stage (Select)

      • Recall determines the ceiling of the final recommendation results, while sorting approaches this limit and determines the final recommendation effect
      • CTR estimation (CTR estimation using LR algorithm) estimates whether a user will click on an item and requires user click data

Recommendation algorithm

3.1 Recommendation model construction process

Data ->Features ->ML Algorithm ->Prediction Output

  • Data cleaning/data processing

    • The data source
      • Explicit data
        • Rating scale
        • A: What are your Comments
      • Contact data
        • Order history
        • Add a shopping cart
        • Page views
        • Click on the
        • Search records
    • Data quantity/Whether the data meets the requirements
  • Characteristics of the engineering

    • Filter characteristics from the data

      • A given item may be purchased by users with similar tastes or needs

      • Use user behavior data to describe goods

    • Represent features with data

      • Combine all user actions together to form a user-item matrix
  • Choose the right algorithm

    • Collaborative filtering
    • Based on the content
  • Generate recommended results

    • Evaluate the recommendation results, and go online after passing the evaluation

3.2 The most classical recommendation algorithm: collaborative filtering recommendation algorithm

Collaborative Filtering

Algorithm thought: birds of a feather flock together

The basic collaborative filtering recommendation algorithm is based on the following assumptions:

  • “You are likely to like what others like like you like” : User-based Collaborative Filtering Recommendation (USER-based CF)
  • “You’re likely to like something similar to what you like” : Item-based Collaborative Filtering Recommendation

There are several steps to implement collaborative filtering recommendations:

  1. Find the most similar person or thing: top-n Similar person or thing

    By calculating the similarity of two pairs to sort, you can find top-N similar people or items

  2. Generate recommendations based on similar people or items

    Use top-n results to generate initial recommendation results, and then filter out items that the user already has a record of or explicitly expresses no interest in

As a simple example, the data set is equivalent to a user’s purchase record of an item: a tick indicates that the user has a purchase record of the item

  • On similarity calculation here with a simple idea: if you have two classmates X and Y, X classmates hobbies/soccer, basketball, table tennis, Y classmates hobbies/tennis, football, basketball, badminton, is their common hobby has 2, then can use their similarity: two-thirds of * 2/4 = 0.33 to represent a third material.

3.3 Similarity calculation

  • The calculation method of similarity

    • Euclidean distance is a method of measuring distance in Euclidean space. Two objects, both represented as two points in the same space, if called P and q, are n coordinates, then the Euclidean distance measures the distance between these two points. Euclidean distances do not apply between Boolean vectors

    The value of Euclidean distance is a non-negative number, and the maximum value is infinity. Usually, the result of similarity calculation is expected to be between [-1,1] or [0,1], which can be used generally

    The transformation formula is as follows:

    • Cosine similarity
    • It measures the Angle between two vectors, and uses the cosine of the Angle to measure similar cases
      • If the Angle between the two vectors is 0, the cosine is 1, if the Angle is 90 degrees, the cosine is 0, and if the Angle is 180 degrees, the cosine is -1
      • Cosine similarity is more commonly used to measure text similarity, user similarity and object similarity
      • The characteristics of cosine similarity have nothing to do with vector length. The calculation of cosine similarity should be normalized to vector length. As long as two vectors have the same direction, no matter how strong or weak, they can be regarded as’ similar ‘.
  • Pearson correlation coefficient

    • It’s actually cosine similarity, but you’ve just centralized the vectors, subtracted the mean of each of the vectors a and B, and then you calculate cosine similarity
    • Pearson similarity calculation results are between -1, 1, -1 means negative correlation, 1 means positive correlation
    • Measures whether two variables increase and decrease in the same way
    • Pearson correlation coefficient measures whether the change trend of two variables is consistent, which is not suitable for calculating the correlation between Boolean value vectors
  • Jaccard similarity

    • The proportion of the number of elements in the intersection of two sets in the union is very suitable for Boolean vector representation

    • The numerator is the dot product of two Boolean vectors, and you get the number of elements that intersect

    • The denominator is two Boolean vectors and you do the or and you sum the elements

  • How do I choose cosine similarity

    • Cosine similarity/Pearson correlation coefficient fits user rating data (real values),
    • Jekard similarity applies to implicit feedback data (0,1 Boolean bookmark, click, add cart)

3.4 Code implementation of collaborative filtering recommendation algorithm

  • Importing tool Packages

    import pandas as pd
    import numpy as np
    Copy the code
  • Building a data set

    users = ["User1"."User2"."User3"."User4"."User5"]
    items = ["Item A"."Item B"."Item C"."Item D"."Item E"]
    # Build the dataset
    datasets = [
        ["buy".None."buy"."buy".None],
        ["buy".None.None."buy"."buy"],
        ["buy".None."buy".None.None],
        [None."buy".None."buy"."buy"],
        ["buy"."buy"."buy".None."buy"]]Copy the code
  • In calculation, our data usually need to be processed or encoded, so as to facilitate us to process the data. For example, here is a relatively simple case, we use 1 and 0 respectively to indicate whether the user has bought the product, so our data set should actually be like this:

    users = ["User1"."User2"."User3"."User4"."User5"]
    items = ["Item A"."Item B"."Item C"."Item D"."Item E"]
    # User purchase record data set
    datasets = [
        [1.0.1.1.0],
        [1.0.0.1.1],
        [1.0.1.0.0],
        [0.1.0.1.1],
        [1.1.1.0.1]]import pandas as pd
    
    df = pd.DataFrame(datasets,
                      columns=items,
                      index=users)
    print(df)
    Copy the code
  • With the data set, we can then calculate the similarity, but there are many special similarity calculation methods for similarity calculation, such as cosine similarity, Pearson correlation coefficient, Jacquard similarity and so on. Here we choose to use the jeckard similarity coefficient [0,1]

    from sklearn.metrics import jaccard_similarity_score
    # Directly calculate the Jacquard similarity coefficient of some two terms
    # Calculate the similarity between Item A and Item B
    print(jaccard_similarity_score(df["Item A"], df["Item B"]))
    
    Calculate the Jacquard similarity coefficient for all data pairs
    from sklearn.metrics.pairwise import pairwise_distances
    # Calculate similarity between users
    user_similar = 1 - pairwise_distances(df, metric="jaccard")
    user_similar = pd.DataFrame(user_similar, columns=users, index=users)
    print("Pairwise similarity between users:")
    print(user_similar)
    
    # Calculate the similarity between items
    item_similar = 1 - pairwise_distances(df.T, metric="jaccard")
    item_similar = pd.DataFrame(item_similar, columns=items, index=items)
    print("The similarity between two objects:")
    print(item_similar)
    Copy the code

With pairwise similarity, you can then filter top-N similarity results and make recommendations

  • User-Based CF

    import pandas as pd
    import numpy as np
    from pprint import pprint
    
    users = ["User1"."User2"."User3"."User4"."User5"]
    items = ["Item A"."Item B"."Item C"."Item D"."Item E"]
    # User purchase record data set
    datasets = [
        [1.0.1.1.0],
        [1.0.0.1.1],
        [1.0.1.0.0],
        [0.1.0.1.1],
        [1.1.1.0.1],
    ]
    
    df = pd.DataFrame(datasets,
                      columns=items,
                      index=users)
    
    Calculate the Jacquard similarity coefficient for all data pairs
    from sklearn.metrics.pairwise import pairwise_distances
    Calculate the similarity between users 1- Jackard distance = Jackard similarity
    user_similar = 1 - pairwise_distances(df, metric="jaccard")
    user_similar = pd.DataFrame(user_similar, columns=users, index=users)
    print("Pairwise similarity between users:")
    print(user_similar)
    
    topN_users = {}
    Iterate over each row of data
    for i in user_similar.index:
        Fetch each column and delete itself, then sort the data
        _df = user_similar.loc[i].drop([i])
        #sort_values Sort by descending similarity
        _df_sorted = _df.sort_values(ascending=False)
        # Slice the first two (the two most similar) from the sorted results
        top2 = list(_df_sorted.index[:2])
        topN_users[i] = top2
    
    print("Top2 similar users:")
    pprint(topN_users)
    
    Prepare a blank dict to store recommendations
    rs_results = {}
    # iterate over all the most similar users
    for user, sim_users in topN_users.items():
        rs_result = set(a)# Store recommendation results
        for sim_user in sim_users:
            # Build initial recommendation results
            rs_result = rs_result.union(set(df.ix[sim_user].replace(0,np.nan).dropna().index))
        # Filter out items that have already been purchased
        rs_result -= set(df.ix[user].replace(0,np.nan).dropna().index)
        rs_results[user] = rs_result
    print("Final recommendation:")
    pprint(rs_results)
    Copy the code
  • Item-Based CF

    import pandas as pd
    import numpy as np
    from pprint import pprint
    
    users = ["User1"."User2"."User3"."User4"."User5"]
    items = ["Item A"."Item B"."Item C"."Item D"."Item E"]
    # User purchase record data set
    datasets = [
        [1.0.1.1.0],
        [1.0.0.1.1],
        [1.0.1.0.0],
        [0.1.0.1.1],
        [1.1.1.0.1],
    ]
    
    df = pd.DataFrame(datasets,
                      columns=items,
                      index=users)
    
    Calculate the Jacquard similarity coefficient for all data pairs
    from sklearn.metrics.pairwise import pairwise_distances
    # Calculate the similarity between items
    item_similar = 1 - pairwise_distances(df.T, metric="jaccard")
    item_similar = pd.DataFrame(item_similar, columns=items, index=items)
    print("The similarity between two objects:")
    print(item_similar)
    
    topN_items = {}
    Iterate over each row of data
    for i in item_similar.index:
        Fetch each column and delete itself, then sort the data
        _df = item_similar.loc[i].drop([i])
        _df_sorted = _df.sort_values(ascending=False)
    
        top2 = list(_df_sorted.index[:2])
        topN_items[i] = top2
    
    print("Top2 similar items:")
    pprint(topN_items)
    
    rs_results = {}
    # Build recommendation results
    for user in df.index:    Pass through all users
        rs_result = set(a)for item in df.ix[user].replace(0,np.nan).dropna().index:   Fetch a list of items that each user has currently purchased
            # Build the initial recommendation by finding the most similar top-N item for each item
            rs_result = rs_result.union(topN_items[item])
        Filter out items that users have already purchased
        rs_result -= set(df.ix[user].replace(0,np.nan).dropna().index)
        Add to the result
        rs_results[user] = rs_result
    
    print("Final recommendation:")
    pprint(rs_results)
    Copy the code

3.5 Data set used by collaborative filtering algorithm

In the previous demo, we only used a purchase record of an item, which could be a browsing record, a listening record, etc. In this way, the result of data prediction is actually equivalent to predicting whether users are interested in a certain item, and the degree of preference cannot be well predicted.

Therefore, the collaborative filtering recommendation algorithm actually makes more use of the “rating” data of users on items for prediction. Through the rating data set, we can predict the rating of users on items that they have not rated before. The principle and idea is the same, but the data set is user-item rating data.

About the user-item rating matrix

The user-item scoring matrix will have different solutions according to the sparsity of the scoring matrix

  • Dense scoring matrix

  • Sparse scoring matrix

The processing of dense scoring matrix is introduced here, while the processing of sparse matrix is relatively complicated.

Collaborative filtering algorithm is used to predict user scores

  • The data set

    Objective: To predict user 1’s rating of item E

  • Build the data set: Note that when building the score data here, we need to leave the missing part as None, and if set to 0 it will be treated as a score value of 0

    users = ["User1"."User2"."User3"."User4"."User5"]
    items = ["Item A"."Item B"."Item C"."Item D"."Item E"]
    # User purchase record data set
    datasets = [
        [5.3.4.4.None],
        [3.1.2.3.3],
        [4.3.4.3.5],
        [3.3.1.5.4],
        [1.5.5.2.1]]Copy the code
  • Calculation of similarity: Pearson correlation coefficient [-1,1] is used to calculate the score data, -1 represents strong negative correlation, +1 represents strong positive correlation

    The CORR method in Pandas can be directly used to calculate Pearson correlation coefficients

    df = pd.DataFrame(datasets,
                      columns=items,
                      index=users)
    
    print("Pairwise similarity between users:")
    Calculate Pearson correlation coefficient directly
    The default is to calculate by column, so if the similarity between users is calculated, it is currently required to transpose
    user_similar = df.T.corr()
    print(user_similar.round(4))
    
    print("The similarity between two objects:")
    item_similar = df.corr()
    print(item_similar.round(4))
    Copy the code

    Running results:

    # run result:Similarity between users: User1 User2 User3 User4 User5 User11.0000  0.8528  0.7071  0.0000 -0.7921
    User2  0.8528  1.0000  0.4677  0.4900 -0.9001
    User3  0.7071  0.4677  1.0000 -0.1612 -0.4666
    User4  0.0000  0.4900 -0.1612  1.0000 -0.6415
    User5 -0.7921 -0.9001 -0.4666 -0.6415  1.0000Item A Item B Item C Item D Item E Item A1.0000 -0.4767 -0.1231  0.5322  0.9695
    Item B -0.4767  1.0000  0.6455 -0.3101 -0.4781
    Item C -0.1231  0.6455  1.0000 -0.7206 -0.4276
    Item D  0.5322 -0.3101 -0.7206  1.0000  0.5817
    Item E  0.9695 -0.4781 -0.4276  0.5817  1.0000
    Copy the code

    You can see that users 2 and 3 are most similar to user 1; The items most similar to item A are item E and item D respectively.

    Note: We tend to predict ratings based on users or items with which we have a positive correlation. If there is no positive correlation, we cannot predict ratings. This is especially true in sparse scoring matrices, where positive correlation coefficients are difficult to derive.

  • Score predicts

    User-based CF score prediction: predicts Based on the similarity between users

    There are also many schemes for scoring prediction. The following is a scheme with good effect, which takes into account the scoring of users themselves and the weighted average similarity score of neighboring users for prediction:


    p r e d ( u . i ) = r u i ^ = v U s u m ( u . v ) r v i v U s i m ( u . v pred(u,i)=\hat{r_{ui}}=\frac{\sum_{v\in U}sum(u,v)*r_{vi}}{\sum_{v\in U}|sim(u,v|}

    We want to predict user 1’s score on item E, so we can make prediction based on user 2 and user 3 closest to user 1, and calculate as follows:


    p r e d ( u 1 . i 5 ) = 0.85 3 + 0.71 5 0.85 + 0.71 = 3.91 Mr Pred (u_1 i_5) = \ frac {3 + 0.71 * 0.85 * 5} {0.85 + 0.71} = 3.91

    The final prediction is that the score of user 1 on item 5 is 3.91

    Item-based CF score prediction: the similarity between items is used for prediction

    Here, the calculation of similarity prediction of items is the same as above, and the average scoring factor of users is also taken into account, and the prediction is made by combining the weighted average similarity scoring of predicted items with similar items:


    p r e d ( u . i ) = r u i ^ = j I r a t e d s i m ( i . j ) r u j j I r a t e d s i m ( i . j ) pred(u,i)=\hat{r_{ui}}=\frac{\sum_{j\in I_{rated}}sim(i,j)*r_{uj}}{\sum_{j\in I_{rated}}sim(i,j)}

    We want to predict user 1’s score on item E, so we can make prediction based on item A and item D closest to item E, and calculate as follows:


    p r e d ( u 1 . i 5 ) = 0.97 5 + 0.58 4 0.97 + 0.58 = 4.63 Mr Pred (u_1 i_5) = \ frac {5 + 0.58 * 0.97 * 4} {0.97 + 0.58} = 4.63

    As can be seen from the comparison, the scoring results of user-based CF prediction score and item-based CF are also different, because they actually belong to two different recommendation algorithms in a strict sense, and both of them have better effects than the other one in different fields and scenarios. However, which one is better? Therefore, in the implementation of the recommendation system, these two algorithms are often needed to be implemented, and then the recommendation effect is evaluated and analyzed to select a better scheme.

Case study – Film recommendation based on collaborative filtering

4.1 User-based CF predicts movie ratings

  • Data set download

  • Download address

  • Load ratings.csv, convert it into a user-movie score matrix and calculate the similarity between users

    import os
    
    import pandas as pd
    import numpy as np
    
    DATA_PATH = "./datasets/ml-latest-small/ratings.csv"
    
    dtype = {"userId": np.int32, "movieId": np.int32, "rating": np.float32}
    # Loading data, we only use the first three columns of data, which are the user ID, the movie ID, and the corresponding rating of the movie by the user
    ratings = pd.read_csv(data_path, dtype=dtype, usecols=range(3))
    PivotTable, which converts the Movie ID to the column name, into a user-movie score matrix
    ratings_matrix = ratings.pivot_table(index=["userId"], columns=["movieId"],values="rating")
    # Calculate the similarity between users
    user_similar = ratings_matrix.T.corr()
    Copy the code
  • Predict user’s rating of items (take user 1’s rating of movie 1 as an example)


    Scoring formula: p r e d ( u . i ) = r u i ^ = v U s i m ( u . v ) r v i v U s i m ( u . v Score formula: Mr Pred (u, I) = \ hat {r_ {UI}} = \ frac {\ sum_ v \ {u} in sim r_ (u, v) * {n}} {\ sum_ v \ {u} in | | sim (u, v}
    # 1. Find similar users for the UID user
    similar_users = user_similar[1].drop([1]).dropna()
    # Similar users filter rule: positive related users
    similar_users = similar_users.where(similar_users>0).dropna()
    # 2. Screen out the nearest neighbor users who have scored item 1 from the nearest neighbor similar users of user 1
    ids = set(ratings_matrix[1].dropna().index)&set(similar_users.index)
    finally_similar_users = similar_users.ix[list(1)]
    # 3. Predict uid users' ratings of IID items based on their similarity to their nearest neighbors
    numerator = 0    # Score predicts the value of the numerator part of the formula
    denominator = 0    # The value of the denominator of the scoring prediction formula
    for sim_uid, similarity in finally_similar_users.iteritems():
        # Nearest neighbor user rating data
        sim_user_rated_movies = ratings_matrix.ix[sim_uid].dropna()
        # IID item rating by nearest neighbor users
        sim_user_rating_for_item = sim_user_rated_movies[1]
        # Compute the numerator
        numerator += similarity * sim_user_rating_for_item
        # Evaluate the denominator
        denominator += similarity
    # 4 Calculate the predicted score value
    predict_rating = numerator/denominator
    print("Predicted user <%d> rating of movie <%d> : %0.2f" % (1.1, predict_rating))
    Copy the code
  • Encapsulated into a method to predict the rating of any user on any movie

    def predict(uid, iid, ratings_matrix, user_similar) :
        Uid: user ID: Param iID: Item ID: Param ratings_matrix: user-item rating matrix: param user_similar: P2-user similarity matrix :return: predicted score value
        print("Start predicting user <%d> ratings for movie <%d>..."%(uid, iid))
        # 1. Find similar users for the UID user
        similar_users = user_similar[uid].drop([uid]).dropna()
        # Similar users filter rule: positive related users
        similar_users = similar_users.where(similar_users>0).dropna()
        if similar_users.empty is True:
            raise Exception("User <%d> no similar user" % uid)
    
        # 2. Select the nearest neighbor users with scores for iID items from the uid user's nearest neighbor similar users
        ids = set(ratings_matrix[iid].dropna().index)&set(similar_users.index)
        finally_similar_users = similar_users.ix[list(ids)]
    
        # 3. Predict uid users' ratings of IID items based on their similarity to their nearest neighbors
        numerator = 0    # Score predicts the value of the numerator part of the formula
        denominator = 0    # The value of the denominator of the scoring prediction formula
        for sim_uid, similarity in finally_similar_users.iteritems():
            # Nearest neighbor user rating data
            sim_user_rated_movies = ratings_matrix.ix[sim_uid].dropna()
            # IID item rating by nearest neighbor users
            sim_user_rating_for_item = sim_user_rated_movies[iid]
            # Compute the numerator
            numerator += similarity * sim_user_rating_for_item
            # Evaluate the denominator
            denominator += similarity
    
        # Calculate the predicted score value and return it
        predict_rating = numerator/denominator
        print("Predicted user <%d> rating of movie <%d> : %0.2f" % (uid, iid, predict_rating))
        return round(predict_rating, 2)
    Copy the code
  • Predict all movie ratings for a user

    def predict_all(uid, ratings_matrix, user_similar) :
        Uid: user ID :param ratings_matrix: user-item scoring matrix: param user_similar: similarity between two users :return: generator, return predicted score ""
        Prepare a list of ids for items to predict
        item_ids = ratings_matrix.columns
        # One by one prediction
        for iid in item_ids:
            try:
                rating = predict(uid, iid, ratings_matrix, user_similar)
            except Exception as e:
                print(e)
            else:
                yield uid, iid, rating
    if __name__ == '__main__':
        for i in predict_all(1, ratings_matrix, user_similar):
            pass
    Copy the code
  • Recommend topN movies to specified users according to their ratings

    def top_k_rs_result(k) :
        results = predict_all(1, ratings_matrix, user_similar)
        return sorted(results, key=lambda x: x[2], reverse=True)[:k]
    if __name__ == '__main__':
        from pprint import pprint
        result = top_k_rs_result(20)
        pprint(result)
    Copy the code

4.2 Item-based CF predicts movie ratings

  • Load ratings.csv, convert it into a user-movie score matrix and calculate the similarity between users

    import os
    
    import pandas as pd
    import numpy as np
    
    DATA_PATH = "./datasets/ml-latest-small/ratings.csv"
    
    dtype = {"userId": np.int32, "movieId": np.int32, "rating": np.float32}
    # Loading data, we only use the first three columns of data, which are the user ID, the movie ID, and the corresponding rating of the movie by the user
    ratings = pd.read_csv(data_path, dtype=dtype, usecols=range(3))
    PivotTable, which converts the Movie ID to the column name, into a user-movie score matrix
    ratings_matrix = ratings.pivot_table(index=["userId"], columns=["movieId"],values="rating")
    # Calculate the similarity between users
    item_similar = ratings_matrix.corr()
    Copy the code
  • Predict user’s rating of items (take user 1’s rating of movie 1 as an example)


    Scoring formula: p r e d ( u . i ) = r u i ^ = v U s i m ( u . v ) r v i v U s i m ( u . v Score formula: Mr Pred (u, I) = \ hat {r_ {UI}} = \ frac {\ sum_ v \ {u} in sim r_ (u, v) * {n}} {\ sum_ v \ {u} in | | sim (u, v}
    # 1. Find similar items for iID items
    similar_items = item_similar[1].drop([1]).dropna()
    # Similar items filter rule: positive related items
    similar_items = similar_items.where(similar_items>0).dropna()
    # 2. Select items rated by the UID user from the iID item's nearest neighbors
    ids = set(ratings_matrix.ix[1].dropna().index)&set(similar_items.index)
    finally_similar_items = similar_items.ix[list(ids)]
    
    # 3. Predict the rating of UID on IID by combining the similarity of IID items and similar items and the rating of UID users on similar items
    numerator = 0    # Score predicts the value of the numerator part of the formula
    denominator = 0    # The value of the denominator of the scoring prediction formula
    for sim_iid, similarity in finally_similar_items.iteritems():
        # Nearest neighbor item rating data
        sim_item_rated_movies = ratings_matrix[sim_iid].dropna()
        # 1 User's rating of similar items
        sim_item_rating_from_user = sim_item_rated_movies[1]
        # Compute the numerator
        numerator += similarity * sim_item_rating_from_user
        # Evaluate the denominator
        denominator += similarity
    
    # Calculate the predicted score value and return it
    predict_rating = sum_up/sum_down
    print("Predicted user <%d> rating of movie <%d> : %0.2f" % (uid, iid, predict_rating))
    Copy the code
  • Encapsulated into a method to predict the rating of any user on any movie

    def predict(uid, iid, ratings_matrix, user_similar) :
        Uid: user ID: Param iID: Item ID: Param ratings_matrix: user-item rating matrix: param user_similar: P2-user similarity matrix :return: predicted score value
        print("Start predicting user <%d> ratings for movie <%d>..."%(uid, iid))
        # 1. Find similar users for the UID user
        similar_users = user_similar[uid].drop([uid]).dropna()
        # Similar users filter rule: positive related users
        similar_users = similar_users.where(similar_users>0).dropna()
        if similar_users.empty is True:
            raise Exception("User <%d> no similar user" % uid)
    
        # 2. Select the nearest neighbor users with scores for iID items from the uid user's nearest neighbor similar users
        ids = set(ratings_matrix[iid].dropna().index)&set(similar_users.index)
        finally_similar_users = similar_users.ix[list(ids)]
    
        # 3. Predict uid users' ratings of IID items based on their similarity to their nearest neighbors
        numerator = 0    # Score predicts the value of the numerator part of the formula
        denominator = 0    # The value of the denominator of the scoring prediction formula
        for sim_uid, similarity in finally_similar_users.iteritems():
            # Nearest neighbor user rating data
            sim_user_rated_movies = ratings_matrix.ix[sim_uid].dropna()
            # IID item rating by nearest neighbor users
            sim_user_rating_for_item = sim_user_rated_movies[iid]
            # Compute the numerator
            numerator += similarity * sim_user_rating_for_item
            # Evaluate the denominator
            denominator += similarity
    
        # Calculate the predicted score value and return it
        predict_rating = numerator/denominator
        print("Predicted user <%d> rating of movie <%d> : %0.2f" % (uid, iid, predict_rating))
        return round(predict_rating, 2)
    Copy the code
  • Predict all movie ratings for a user

    def predict_all(uid, ratings_matrix, item_similar) :
        Uid: user ID :param ratings_matrix: user-item rating matrix: Param item_similar: Item similarity :return: generator, return ""
        Prepare a list of ids for items to predict
        item_ids = ratings_matrix.columns
        # One by one prediction
        for iid in item_ids:
            try:
                rating = predict(uid, iid, ratings_matrix, item_similar)
            except Exception as e:
                print(e)
            else:
                yield uid, iid, rating
    
    if __name__ == '__main__':
        for i in predict_all(1, ratings_matrix, item_similar):
            pass
    Copy the code
  • Recommend topN movies to specified users according to their ratings

    -def top_k_rs_result(k) :
        results = predict_all(1, ratings_matrix, item_similar)
        return sorted(results, key=lambda x: x[2], reverse=True)[:k]
    if __name__ == '__main__':
        from pprint import pprint
        result = top_k_rs_result(20)
        print(result)
    Copy the code

V. Recommendation system evaluation

5.1 Recommended evaluation indicators of the system

  • Evaluate data sources for explicit and implicit feedback

    Explicit feedback Implicit feedback
    example Movie/book rating/If you like this recommendation Play/click/comment/download/buy
    accuracy high low
    The number of less more
    Acquisition costs high low
  • Common evaluation indicators

    • Accuracy • trust • satisfaction • real-time • coverage • robustness • diversity • scalability • novelty • business goals • Surprise • retention

    • Accuracy (theoretical perspective)

      • Score predicts
        • RMSE MAE
      • TopN recommended
        • Recall rate accuracy rate
    • Accuracy (business perspective)

    • coverage

      • The greater the entropy of information for recommendation, the better
      • coverage
    • Diversity & Novelty & surprise

      • Diversity: Dissimilarity between two items on a recommended list. (How is similarity measured?
      • Novelty: a category or author not previously considered; The average popularity of recommended results
      • Surprise: Historical dissimilarity (surprise) but satisfaction (joy)
      • Accuracy is often sacrificed
      • Use historical behavior to predict how much users will like an item
      • The system overemphasizes real time
    • Exploration &Exploration issues of exploration and exploitation

      • Exploitation: Choose the best possible solution
      • Exploration: Select options that are uncertain now, but may yield high returns in the future
      • In the process of making the two kinds of decisions, the cognition of the uncertainty of all decisions should be constantly updated to optimize the long-term goals
    • EE problem practice

      • Interest expansion: Similar topics, collocation recommendation
      • Crowd algorithm: userCF user clustering
      • Balance personalized recommendations and popular recommendations
      • Randomly discards user behavior history
      • Random disturbance model parameters
    • Possible problems with EE

      • Exploration hurts the user experience and can lead to user churn
      • Exploration brings long term revenue (retention) evaluation cycle and KPI pressure
      • How to balance real-time and long-term interests
      • How to balance short-term product experience with long-term ecosystem
      • How to balance popular tastes and niche needs

5.2 Recommended system evaluation methods

  • Evaluation methods
    • Questionnaire survey: high cost
    • Offline evaluation:
      • It can only be evaluated on the candidate set that users have seen, and it is not consistent with the online reality
      • Only a few indicators can be assessed
      • Fast speed, no damage to user experience
    • Online evaluation: Grayscale release & A/B test 50% full online
    • Practice: combine offline evaluation with online evaluation and make questionnaire survey regularly

Six. Recommended system cold start problem

6.1 The concept of cold startup is recommended

  • User cold start: How to make personalized recommendations for users
  • Item Cold start: How to recommend new items to users (collaborative filtering)
  • System cold start: User cold start + item cold start
  • The essence is that the recommendation system relies on historical data, without which it cannot predict user preferences

6.2 Common methods for troubleshooting the recommended cold startup problem

  • User cold start
    • Collecting User Characteristics
      • User registration information: gender, age, region
      • Device information: location, phone model, app list
      • Social information, promotional material, installation sources
    • Guide users to fill in interests
    • Use behavioral data from other sites
    • Differences between the recommended policies of old and new users
      • New users are more likely to be attracted to hot leaderboards during the cold start, and existing users are more likely to need long tail recommendations
      • Efforts to Explore exploits
      • Use individual features and model projections
  • Item cold start
    • Label items
    • Use the content information of the item to first drop the new item to users who have liked other items with similar content.
  • System cold start
    • Early days of content-based recommendation systems
    • Content-based recommendations are gradually transitioning to collaborative filtering
    • The results of content-based recommendation and collaborative filtering are calculated by weighted sum to obtain the final recommendation result