preface

To achieve collaborative filtering, the following steps are required <1> to collect user preferences <2> to find similar users or items <3> recommendationsCopy the code

1/ Collect user preferences

Discover patterns in users' behavioral preferences and make recommendations based on them. Therefore, collecting user preference information becomes the most basic determinant of system recommendation effect. There are many ways for users to provide their preference information to the system, and different applications may vary greatly. Here are some examples:Copy the code

The user behaviors listed above are generic, and recommendation engine designers can add specific user behaviors based on the characteristics of their applications and use them to represent user preferences. In general applications, we extract more than one user behavior. As for how to combine these different user behaviors, there are the following two ways: <1> Group different behaviors: generally can be divided into "view" and "buy", etc., and then calculate different user/item similarity based on different behaviors. Similar to dangdang or Amazon's "the person who bought the book also bought..." "And" The person who checked the book also checked..." <2> According to the extent to which different behaviors reflect users' preferences, they are weighted to get users' overall preferences for items. Generally speaking, the explicit user feedback has a larger weight than the implicit one, but it is relatively sparse. After all, only a small number of users display feedback. The "buy" behavior is also more likely to reflect user preferences than the "view" behavior, but this also varies from app to app. After collecting user behavior data, we also need to preprocess the data to some extent, among which the core work is: noise reduction and normalization. Noise reduction: The user behavior data is generated by the user in the process of using the application, which may have a lot of noise and user misoperation. We can filter out the noise in the behavior data through the classical data mining algorithm, so that our analysis can be more accurate. Normalization: As mentioned earlier, different behavioral data may need to be weighted when calculating user preferences for items. However, it can be imagined that the data values of different behaviors may vary greatly. For example, the data viewed by users is bound to be much larger than the data purchased. Therefore, how to unify the data of each behavior in a same value range, so as to make the overall preference obtained by weighted sum more accurate, requires normalization processing. The simplest normalization process is to divide each type of data by the maximum value in this class to ensure that the normalized value is in the range [0,1]. After preprocessing, grouping or weighting can be selected according to different applied behavior analysis methods. After that, a two-dimensional matrix of user preferences can be obtained, with one dimension being the list of users and the other dimension being the list of items. The value is the preference of users for items, generally [0,1] or [-1, 1] floating point value.Copy the code

Find similar users or items

After analyzing user behaviors and obtaining user preferences, we can calculate similar users and items according to user preferences, and then make recommendations based on similar users or items. These are the two most typical branches of CF: user-based CF and item-based CF. Both methods need to calculate the similarity. 2.1 Similarity Calculation Several existing basic methods for similarity calculation are based on vectors. In fact, the closer the distance between two vectors is, the greater the similarity is. In the recommended scenario, in the two-dimensional matrix of user-item preference, we can take one user's preference for all items as a vector to calculate the similarity between users, or take all users' preference for a certain item as a vector to calculate the similarity between items. 2.2 Methods: Euclidean distance, cosine distance, Pearson correlation coefficient, etc. 2.3 Calculation of similar Neighbors After introducing the calculation method of similarity, let's look at how to find users' neighbors or items' neighbors according to similarity. The commonly used principles for selecting neighbors can be divided into two categories:Copy the code

A fixed number of neighbor: K - neighborhoods or Fix - size neighborhoodsCopy the code

Take the nearest K neighbors as their neighbors, regardless of their distance. As shown in Figure 1, suppose we want to calculate the 5-neighbor of point 1, then according to the distance between points, we take the five nearest points, namely point 2, point 3, point 4, point 7 and point 5. But clearly we can see, this kind of method for the calculation of isolated point effect is bad, because the neighbors of the fixed number, when it is not enough the similar point nearby, was forced to take some not very similar points as neighbors, thus influence the neighbor similar degree, such as in figure 1, points 1 and 5 are not very similar.

Neighborliness: threshold-based Neighborhoods

And calculation to the principle of different fixed number of neighbours, neighbor calculation based on the similarity threshold is to the maximum limit of neighbor distance, fall in the current point as the center, a distance of K all points as the current point in the area of neighbor, this kind of method to calculate the number of neighbors are not sure, but the similarity of won’t appear larger error. B in Figure 1, starting from point 1, calculates the neighbors with similarity within K, and obtains point 2, point 3, point 4 and point 7. The similarity degree calculated by this method is better than the previous one, especially for the processing of isolated points.

3 / recommend

Neighboring users and neighboring items have been obtained through early calculation. The following is how to recommend users based on these information. The last review article of this series has briefly introduced that the recommendation algorithm based on collaborative filtering can be divided into CF based on users and CF based on items. The following are the calculation methods, application scenarios, advantages and disadvantages of these two methods. 3.1 User-Based CF (User CF)Copy the code

The basic idea of user-based CF is quite simple: find the neighboring users based on the user’s preferences for items, and then recommend what the neighbor users like to the current user. In terms of calculation, a user’s preference for all items is used as a vector to calculate the similarity between users. After finding K neighbors, the unknown items that the current user has no preference are predicted according to the similarity weight of neighbors and their preference for items, and a sorted list of items is calculated as a recommendation. Figure 2 shows an example. For user A, only one neighbor, user C, is calculated according to user A’s historical preference, and then the item D favored by user C is recommended to user A.

3.2 Item-based CF (Item CF)Copy the code

The principle of item-based CF is similar to that of user-based CF, except that the item itself is used in the calculation of neighbors, rather than from the user’s perspective, that is, similar items are found based on the user’s preference for items, and then similar items are recommended to the user according to his historical preference. From the perspective of calculation, the preference of all users for an item is used as a vector to calculate the similarity between items. After obtaining the similar items of the item, the item that the current user has not expressed preference is predicted according to the user’s historical preference, and a sorted item list is calculated as a recommendation. Figure 3 shows an example. For item A, according to the historical preferences of all users, users who like item A all like item C, so it can be concluded that item A and item C are relatively similar, while user C likes item A, so it can be inferred that user C may also like item C.

4/ Advantages and disadvantages and application scenarios of user-CF and item-CF

4.1/ Computational complexity Item CF and User CF are the two most basic algorithms based on collaborative filtering recommendation. User CF was proposed a long time ago. Item CF became popular after Amazon published its papers and patents (around 2001). It is generally believed that Item CF is superior to User CF in terms of performance and complexity. One of the main reasons is that for an online website, the number of users often greatly exceeds the number of items. Meanwhile, the data of items is relatively stable, so the calculation of similarity of items is not only small, but also does not need to be updated frequently. But we often ignored this kind of situation only adapted to provide goods of e-commerce sites, to news, blog or micro content recommendation system, and is often the opposite, the number of items is huge, is also a frequently updated, so the single from the perspective of complexity, these two algorithms have their own advantages in different systems, Designers of recommendation engines need to choose a more suitable algorithm according to the characteristics of their own applications. 4.2 Application Scenarios In non-social network websites, the internal connection of content is an important recommendation principle, which is more effective than the recommendation principle based on similar users. For example, on a book buying website, when you read a book, the recommendation engine will recommend relevant books to you, which is far more important than the comprehensive recommendation for the user on the homepage of the website. It can be seen that in this case, Item CF recommendation has become an important means to guide users to browse. At the same time, Item CF is convenient to explain the recommendation. In a non-social network website, it recommends a book to a user and explains that someone with similar interests and yours also read the book, which is hard to convince the user, because the user may not know the person at all. But if the explanation is that the book is similar to something you've read before, the user may feel justified and adopt the recommendation.Copy the code

On the contrary, among today’s popular social network sites, User CF is a better choice. The combination of User CF and social network information can increase the User’s belief in the recommendation explanation. In general, social networking sites use user-CF, while shopping sites have item-CF