Author: Forward

This article is original, please indicate the author and source

 

1. Why does the recommendation system come from?

 

In the 1990s, wal-mart executives were analyzing sales data when they made a puzzling discovery: under certain circumstances, two seemingly unrelated items — “beer” and “diapers” — often appeared in the same shopping basket.

Conventional thinking, diapers and beer goes and, wal-mart is with the help of the analysis of the data mining of association rules technology, huge amounts of transaction data mining and analysis, found the hidden in the data behind the strange value (in the United States, it is said, have a baby in the family, usually the mother at home to look after baby, young father went to the supermarket to buy diapers. When dad goes to buy diapers after work, he often buys beer for himself.

 

Similarly, Recommender System (RS) is also a data mining technology that is used in the commercial field to discover the peculiar value hidden behind the data. To be precise, the recommendation system is a recommendation technology that suggests useful items to users without clear requirements, and guides users to find things they are interested in among numerous choices in a personalized way. For example, if I buy a copy of A Brief History of Time from Amazon and go back to it, it might suggest that I might like a Brief History of the Future; Two friends turn on headlines in the same place at the same time and see different listings on the front page. Recommendation systems have actually become very common in our life.

 

The Internet age is an age of information overload. How should we deal with the all-embracing information? ClayShirkey, an American writer who studies the social and economic impact of Internet technologies, said “It’s Not Information Overload. It’s Filter Failure” in his talk at the 2010 Web2.0 Expo. Yes, in order to solve the problem of information overload, there have been many excellent solutions (classified directory, search engine, recommendation system, etc.), among which recommendation system is one of the most representative solutions.

 

In addition, as Wired editor Chris Anderson pointed out in his book The Long Tail, the traditional 80/20 principle (80% of sales come from 20% of popular products) has been challenged by the advent of the Internet. With the Internet, e-commerce can often sell more goods than traditional retail stores because of extremely low shelf costs. Although the vast majority of these items are not hot, the number of items that are not hot is so large that the total sales of these long tail items will be a significant number, perhaps exceeding the sales of hot items. Hot items often represent the needs of the vast majority of users, while long tail items often represent the personalized needs of a small number of users. Therefore, if we want to increase sales by exploiting the long tail, we must fully study the interests of users, which is the main problem that personalized recommendation system solves.

 

2. Basic principle of recommendation system

 

Recommendation algorithm is the basis for recommendation system, so far, there have been many methods to choose from, each kind of recommendation algorithm has its advantages and disadvantages, of course, also has its limitations, in practice, the reasonable choice of appropriate recommendation algorithm is not an easy thing, need to understand the working principle of all kinds of algorithms, and then attempts to and careful consideration. Comparatively commonly used algorithm used in this section are introduced to recommend the popularity of recommendation algorithm, based on the content in the field of recommendation algorithm, based on collaborative filtering algorithm and hybrid recommendation algorithm (in addition to this, and that are widely used in the collaborative filtering algorithm based on model and unconventional senior recommendation algorithm, limited to the length and complexity, this article does not introduce).

 

Setup scenario: there are 6 users (User1,2,3,4,5,6) and the following 6 books (Book1,2,3,4,5,6).

These users have purchased some of these 6 books and rated them (note: the rating function is denoted as), as shown in table 2 below :(among them, the higher the user likes a book, the higher the evaluation. The lowest score is 1, the highest score is 5).

2.1 Popularity recommendation algorithm

 

The popularity recommendation algorithm makes recommendations based on the popularity metrics of the public (highest ratings, most purchases, most downloads, most views, etc.). Before recommending to specific users, we need to calculate the popularity index of each commodity. For any book, we take the average score of User1, User2, User3, User4, User5 and User6 as the popularity index. There are:

There are:

From this, the ranking of popular indicators can be recommended to users. Taking the recommendation for User1 as an example, remove the books User1 has already purchased (Book1, Book2, Book5) from the ordered list of popular metrics to get the actual recommendation list for User1:

That is, the popularity recommendation algorithm will be sequential as User1 recommends Book6,Book3, and Book4.

 

The popularity recommendation algorithm only relies on the use of idiomatic data (some systems may rely on content data such as catalogue), which is easy to implement and does not have the problem of cold start of users. The disadvantages are the need to unify and standardize the customary data for products, the cold start of new products, and the lack of personalized recommendation results.

 

2.2 Content-based recommendation system

 

Content-based recommendation system mainly uses the known correlation and matching degree between user preferences, user interests and the attributes of the contents of the items, and tries to recommend similar items of the items that users used to like.

 

First, data is preprocessed on the content (in this case, the book title) to find the vector representation that describes the book. Since grammatical words are common in book titles, ignore grammatical connectives (to, and, the) in content. The 6 books can be described as a vector matrix as follows :(where 1 indicates that there are words corresponding to column elements in the book corresponding to the row)

 

With the vector representation matrix in each book, one question remains: how can similarity be measured?Common similarity measures in recommendation algorithms include Euclidean distance ( ), cosine similarity ( ), Pearson correlation coefficient and so on. Cosine similarity is the most commonly used measurement method, and the similarity in the following algorithms is expressed by cosine similarity.For any two vectors Their cosine similarity is:

Among them .The numberThe larger the value, the more similar X and Y are.

With a content vector matrix for each book and a measure of similarity, we can compare how similar the book is to the one before it. In the first book, for example, based on the cosine similarity formula respectively calculated Book1 and every other book (Book2, Book3, Book4 Book5, Book6) similarity.

As above, a similarity matrix can be obtained by calculating the similarity of all books (the similarity matrix is a diagonal matrix, i.e. For example, Book1 is similar to Book2 as Book2 is similar to Book1).

 

With a similarity matrix between books, we can make recommendations for a specific user. To do this, you select the books that the user has already rated, and then you find the three books that are most similar to them and recommend them to the unrated books.

Taking the recommendation for User1 as an example, select the books (Book1, Book2 and Book5) that User1 has scored before according to the user’s rating matrix (Table 2), and select three books that are most similar to them respectively, as shown in the following table:

Then use the following formula to calculate the expected ratings of similar books User1 has not purchased, and recommend the two books with the highest expected ratings to the user.

From Table 5, we can see that User1 has not purchased similar books as Book3,Book4, and Book6, i.e. I =3,4, and 6 in the above formula. Calculation:

User1’s expected score for Book3

User1’s expected score for Book4

User1’s expected score for Book6

byThe content-based recommendation algorithm will first recommend Book6, Book3, and Book4 to the user in order.

Content-based recommendation system takes the content or description of goods as input, does not need idiomatic data, has no cold start problem, has no popularity bias, and the content characteristics are easy to explain. The disadvantage is that the content of goods is difficult to standard words, colleagues recommend the lack of diversity.

 

2.3 Collaborative filtering algorithm

 

As the saying goes, “Birds of a feather flock together”, collaborative filtering algorithm is based on the principle of “birds of a feather flock together” to recommend users. Collaborative filtering algorithms infer users’ preferences from their customary data (ratings, downloads, streams, etc.).

 

Collaborative filtering methods can be roughly divided into two categories: neighborhood – based and model – based methods. The neighborhood based method directly predicts the user’s score on the new item by using the user’s score on the existing item, including item-based collaborative filtering algorithm and user-based collaborative filtering algorithm. Model-based method is to use the history score data to identify the user interaction with the model, through specific patterns in the data as a prediction model to predict the score of users for the new item (build model algorithm such as bayesian networks, clustering, classification, regression, matrix decomposition, restricted boltzmann machine, etc., this section does not make the discussion).

 

2.3.1 Item-based collaborative filtering algorithm

 

As can be seen from the evaluation matrix of books by users in Table 2, the performance of books can be regarded as the evaluation vector of User1, User2, User3, User4, User5 and User6.

Similar to content-based recommendation algorithms, with a vector representation matrix for each book, the similarity between books can be compared through cosine similarity measurement. Book1, for example, based on the cosine similarity formula are calculated respectively Book1 with other users (Book2, Book3, Book4 Book5, Book6) similarity.

As above, a similarity matrix between books can be obtained by calculating the similarity of all books.

With a similarity matrix between books, we can make recommendations for a specific user. The specific approach is to select three books that have been rated by users, and then find three books that are most similar to them respectively. The similarity of books is taken as the weight to calculate users’ expected ratings of books they have not purchased, and finally make recommendations according to the expected ratings from high to the end.

Taking the recommendation for User1 as an example, three books (Book1, Book2 and Book5) that User1 has scored before are selected according to the user’s rating matrix (Table 2), and three books that are most similar to them are selected respectively, as shown in the following table:

Then use the following formula to calculate the expected ratings of similar books User1 has not purchased, and recommend the two books with the highest expected ratings to the user.

From Table 5, we can see that User1 has not purchased similar books as Book3,Book4, and Book6, i.e. I =3,4, and 6 in the above formula. Calculation:

User1 expects Book3 to have a rating of

User1 expects Book4 to have a rating of

User1’s expected rating for Book6 is

byThe item-based collaborative filtering algorithm will recommend Book3, Book4, and Book6 to the user in order.

 

2.3.1 User-based Collaborative Filtering

 

As can be seen from the user evaluation matrix of books in Table 2, user behavior can be viewed as the evaluation vector of Book1, Book2, Book3, Book4, Book5, and Book6.

 

Similar to content-based recommendation algorithms, with a vector representation matrix for each user, similarity between book users and users can be compared through cosine similarity measurement. User1, for example, based on the cosine similarity formula are calculated respectively User1 with other users (User2, and User3, User4 User5, User6) similarity.

As above, an inter-user similarity matrix can be obtained by calculating the similarity between all users.

 

With a matrix of similarity between users, we can make recommendations for a specific user. The specific approach is to select three users who are most similar to the user, and then generate a list of books to be recommended from the books that similar users have purchased but the user has not purchased. Then, the similarity between the user and similar users is taken as the weight to calculate the expected score of the user for the recommended books. Finally, the recommendation can be made according to the expected score from high to the end.

Taking the recommendation for User1 as an example, select the three users who are most similar to User1 (User2 with a similarity of 0.75, User3 with a similarity of 0.63, and User5 with a similarity of 0.30) and list the books that similar users have bought respectively, as shown in the following table:

Then use the following formula to calculate the expected ratings of similar books User1 has not purchased, and recommend the two books with the highest expected ratings to the user.

From table 8, we can see that User1 has not yet purchased similar books as Book3,Book4, and Book6, i.e. I =3,4, and 6 in the formula above. Calculation:

User1 expects Book3 to have a rating of

User1 expects Book4 to have a rating of

User1’s expected rating for Book6 is

byThe content-based recommendation algorithm will first recommend Book3, Book6, and Book4 to the user in order.

 

Neighborhood based methods are well known for their simplicity and efficiency, but also for their ability to produce accurate and personalized recommendations. Collaborative filtering algorithm only relies on idiomatic data (evaluation, purchase, download and other user preference behaviors), does not require user and product attribute information, has few input requirements, and can produce good enough results in most scenarios. The disadvantages are the cold start of new users and new products, and the difficulty of providing an explanation for the resulting recommendations.

 

2.4 Hybrid recommendation algorithm

 

Hybrid recommendation algorithm is a recommendation algorithm that uses multiple algorithm methods in a weighted way and tries to find a better recommendation effect by “exploiting strengths and avoiding weaknesses”.

For example, for the recommendation of User1, we introduced the content-based recommendation algorithm, item-based collaborative filtering algorithm and user-based collaborative filtering algorithm respectively. However, it can be seen from the specific recommendation results that different algorithms have their own attributes, so different algorithms have different recommendation results for User1. How to use the hybrid recommendation algorithm to balance the above three different recommendation results? The content-based recommendation Algorithm is Algorithm 1, the item-based collaborative filtering Algorithm is Algorithm 2, and the user-based collaborative filtering Algorithm is Algorithm 3. There are:

It is assumed that the weight of content-based recommendation algorithm, item-based collaborative filtering algorithm and user-based collaborative filtering algorithm configured in the weighted hybrid recommendation system are w1, W2, W3 and w4 respectively, and the expected score of the KTH algorithm for the ith book is, the expected ratings of similar books User1 has not purchased can be calculated according to the following formula, and the two books with the highest expected ratings can be recommended to users.

Assuming thatCalculation:

User1 expects Book3 to have a rating of

User1 expects Book4 to have a rating of

User1’s expected rating for Book6 is

byIt can be seen that the hybrid recommendation algorithm will recommend Book3, Book6, and Book4 in order for the user.

 

The hybrid recommendation algorithm comprehensively utilizes a variety of other sub-algorithms, and its input is determined by all the dependent sub-algorithms. It can obtain the advantages of the response of each sub-algorithm. Its disadvantage is that it is difficult to find the balance point of each sub-algorithm combination mode (weighting, exchange, etc.), and it usually requires a lot of effort to obtain the balance between different algorithms through the integration method.

 

3. Recommendation system cases

 

Knowing the above common traditional recommendation algorithms, we can already write a good recommendation system, but a good recommendation system is far from so simple. Let’s take Netflix recommendation system as an example to observe the architectural details of Netflix recommendation system under the film, as shown in the following figure. Netflix’s recommendation system is composed of many different kinds of recommendation algorithm, one of the most core the two algorithms is restricted boltzmann machine and singular value decomposition (given their complexity and space limitations, we do not have in the previous algorithm principle chapter he them), they all belong to the category of collaborative filtering algorithms, we can understand their working characteristics of derivatives. Netflix recommendation system has three different working modes: offline mode, near mode and online mode. In particular, the algorithm service module in online mode is responsible for the application strategy of different recommendation algorithms.

 

Netflix Recommended System architecture diagram

 

Online computing uses the latest data to respond to online events in real time. The real-time response limits online computing to deal with highly complex computing algorithms, and the size of data set cannot be too large. On the contrary, off-line computing has less restrictions on computational complexity and data set size, so off-line computing can have more algorithm choices. Near-line computing can be regarded as a combination of online and offline computing, which is similar to online computing, but the calculation results do not respond in real time. Instead, the data is temporarily cached, and then the response is loaded asynchronously. In particular, near-online computing has natural support for incremental computing algorithms.

Part of the calculation of the recommendation algorithm (especially the machine learning recommendation algorithm) can be completed offline. However, the characteristic of the recommendation system is individuation, which requires rapid response to new data and user behavior so that the recommendation results can be updated in time. In order to achieve the effect, in addition to periodic scheduling algorithm training model, also need to consider different algorithms in the specific circumstances of time complexity, and how to get the best effect of the recommended fastest delay time, data storage, speaking, reading and writing, how to ensure transactional consistency and so on, this needs us to in-depth analysis of the specific application requirements, carefully choose all kinds of recommendation algorithm and technology, Carefully weigh and optimize the working benefits of different algorithms on different strategies to achieve the best recommendation effect.

Technical Salon Recommendation

Click on the image below to read it

A story about Nolock

The translation | Android O seccomp filter

What does a comfortable front-end development environment look like?

No password? Some thoughts on account systems

Application of deep learning to emotion analysis in natural language processing