Recommendation system is a very popular technology in recent years. No matter e-commerce software or news app, it is claimed that accurate recommendation system can push the content you are most interested in to you. “Toutiao”, a phenomenal-level information app, has become a product with a very strong momentum thanks to this. This paper describes some related concepts and practical experience for the recommendation system.

First of all, it is necessary to clarify the goals of the recommendation system, generally speaking, nothing more than the following:

  • User satisfaction: First of all, the recommendation system is mainly to meet the needs of users, so the accuracy rate is the most critical indicator to judge the quality of a recommendation system.
  • Diversity: although the recommendation system mainly meets the interests of users, it should also take into account the diversity of content and the interests of different weights.
  • Novelty: Users see content they haven’t heard of before. A simple way to do this is to remove from the recommendation list any content that the user has previously acted on.
  • Surprise: Similar to novelty, but novelty is something the user has never seen but is actually related to his behavior, while surprise is something the user has never seen and is not related to his previous behavior, but the user actually likes it.
  • Real-time: The recommendation system should update the recommendation content in real time according to the user’s context. The user’s interest also changes with time, so it needs to be updated in real time.
  • Recommendation transparency: Let the user know why the content was recommended for the end result they see. For example, “The person who bought this book also bought it”, “You bought xx similar to this product”.
  • Coverage: Mining the long tail is also an important goal for recommendation systems. Therefore, the more content the recommendation covers, the better.

Based on these objectives, the recommendation system includes four recommendation methods:

  • Popular recommendations: The concept of a popular leaderboard. This kind of recommendation not only exists in IT system, but also everywhere in daily life. This should be the most effective recommendation method, after all, popular recommended items are located in the high exposure position.
  • Manual recommendation: The recommended content of manual intervention. Rather than relying on popular and algorithmic recommendations. Some hot events, such as the World Cup and NBA finals, need to be added to the recommendation list. On the other hand, the recommendation effect brought by hot news is also very high.
  • Related recommendation: Related recommendation is similar to personalized recommendation of association rules, that is, when you read a content, you will be prompted to read relevant content.
  • Personalized recommendation: Content recommendation based on the user’s historical behavior. It is also the main content of this article.

The first three have nothing to do with machine learning, but they are the three best recommended. Generally speaking, this should make up about 80% of the total recommendations, with the other 20% being personalized recommendations for the long tail.

Personalized recommendation system

Personalized recommendation is a typical scenario for machine learning applications. It’s essentially the same thing as a search engine, the same problem of information overload. A search engine is also a personalized recommendation system in a sense, but its input characteristics can be obtained directly from the search keywords. But the general recommendation system, the input feature is to need machine learning to get.

Personalized recommendation system generally consists of three parts: log system, recommendation algorithm and content display UI.

  • Logging system: This is the input source of the recommender system and the source of all information for a recommender system.
  • Recommendation algorithm: this is the core of the recommendation system, and the specific process of obtaining the final recommendation result according to the input data is here.
  • Content presentation UI: There is also a trade-off for how recommendations are presented. In order to better meet the target of the recommendation system, and can better collect user behavior information.

Among them, the most popular recommendation algorithms in personalized recommendation are as follows:

  • Content-based recommendations: Recommendations based on the attributes (feature vectors) of the content itself.
  • Recommendation based on association rules: the “beer and diaper” approach is a kind of dynamic recommendation, which can make recommendations on users’ behavior in real time. It is the recommendation based on the characteristic correlation between items, and in some cases will degenerate into collaborative filtering recommendation.
  • Collaborative filtering recommendation: compared with recommendation based on association rules, it is a static recommendation based on the analysis of users’ existing historical behaviors. It can be divided into item collaborative filtering, user collaborative filtering and model-based collaborative filtering. Model-based collaboration can be divided into the following types: distance-based collaborative filtering; Matrix factor-based collaborative filtering, namely Latent Factor Model(SVD); Graph model based collaboration, or Graph, also known as social network Graph model.

The typical architecture of personalized recommendation system is shown in the figure below:

recommend-sys

The logs of the online business system are connected to the data superhighway, and then the data superhighway runs quickly to the offline data processing platform and online streaming computing platform. Offline data processing platform periodically in batch mode of data processing in the past a period of time, get the crowd labels and other model parameters, and stored in the cache, for the use of online business system, at the same time, the online flow computing platform for online log data processing, real-time data to supplement and correction of off-line calculate, etc.; The online service system integrates offline features and online features and uses certain logic to obtain the output for service use, and the generated logs flow into the data highway.

Based on this framework, the typical process of personalized recommendation system is shown as follows:

recommend

It can be seen that a recommendation system mainly consists of the following modules:

  • User behavior logs: Stores user behavior logs, which are part of data statistics and stored in Hive. I will not repeat it here.
  • Data ETL-1: Converts user logs into the data format required by the recommendation algorithm.
  • Recommendation algorithm: it is the most important part of personalized recommendation, including the calculation of relevant content and recommendation results through user behavior.
  • Data ETL-2: The results obtained by the recommendation algorithm are further processed into the input data of the storage module.
  • User portrait storage: stores user preferences and behavioral data, such as preferences for content keywords, what content has been clicked, etc.
  • Store of recommendation results: Store the recommendation results generated by various recommendation algorithms, which can be divided into two parts: {user: itemList} Recommendation results, which are the list of contents recommended by users; {item: itemList} Recommended results, a list of items related to the item.
  • Service invocation module: integrates the recommendation structure and provides the call interface to provide recommendations externally.

ETL data – 1

Clean and process the original user behavior data, such as fields, attributes and formatting, as the input of the next recommendation algorithm.

Recommendation algorithm

For personalized recommendation system, recommendation algorithm should be the most core part. There are many popular algorithms, such as:

  • Based on recommendation: a picture of the content and the user of this algorithm, an article before visible: www.rowkey.me/blog/2016/0… .
  • Recommendation based on matrix decomposition: Content recommendation is made to users based on SVD/ALS algorithm. Compared with SVD, ALS is more suitable for solving the problem of sparse matrix. Spark Mlib has integrated the ALS algorithm. You need to convert data to the required ALS data format in ETL-1 and adjust the PARAMETERS of the ALS algorithm. Here’s a more specific article describing how spark can be used to make ALS-BASED recommendations: colobu.com/2015/11/30/… .
  • User-item collaborative filtering is recommended, including user-based CF and Item-based CF. For the two, different algorithms need to be selected according to different businesses. When there are a large number of users, considering the cost of maintaining the user matrix, it is generally not recommended to choose user collaborative filtering, and when there are a large number of candidate items, it is not recommended to use item collaborative filtering.

The output result of recommendation algorithm is usually one user to one item list or one item to one item list. This part mainly considers the time complexity of the algorithm. No matter what kind of algorithm, once users or content data reaches millions, distributed computing such as MapReduce and Spark is needed to solve the problem.

The basic process of the recommendation algorithm is shown in the figure below:

Data ETL – 2

The results generated by the recommendation algorithm are cleaned and formatted as the input of the next storage module.

User portrait storage

Store information such as user preferences and behavioral data. Preferences are represented by label quantization, a value that decays over time. For user portraits, it is batch write and real-time read, so storage should focus on read performance. Redis cluster can be used as the technical solution to maximize read performance. However, Redis is expensive and does not support Auto Index. You can also use Hbase as the storage and use ElasricSearch to build a secondary index to address the need to aggregate users based on multiple dimensions (such as filtering all users under a tag).

Recommendation result store

Storage of recommendation results calculated by various recommendation algorithms. Large storage space and complex format. High requirements on storage capacity and read/write performance. You can choose to use a Redis cluster as the storage solution for this section.

The service call

Integrate user portrait and recommendation results to provide the interface of recommendation call. This is mainly the overhead of database IO calls.

  1. Gets a list of recommended items based on the user ID.
  2. Gets a list of associated items, based on item.
  3. Get user portrait based on user ID.

This module needs to adopt certain strategies to aggregate the recommendation results of various recommendation algorithms and face business directly. Policies need to be configurable because they vary from service to service. At the same time, it also provides an interface for exposing user portraits so that service providers can use user portraits for specific processing. RPC mechanism can be used to expose the service interface.

Questions to consider

For a recommendation system, there are also some issues that need to be considered in conjunction with its goals.

Real time problem

Since the calculation of user and item matrix or matrix decomposition needs to be carried out offline and time-consuming, it is difficult for collaborative recommendation algorithm to achieve real-time performance. The real-time part of the recommendation mainly relies on the recommendation based on user portrait. The final recommendation list is the result of aggregating the two parts according to certain policies.

Timeliness content problem

Time-sensitive content refers to content that is strongly relevant to time, such as news and current events. If a news that xx player won the championship 10 days ago is now recommended, users must be confused or very disappointed. Therefore, time-sensitive content needs to be distinguished from ordinary content to be recommended, and individual recommendation or personalized recommendation should not be made.

Cold start problem

No matter which recommendation algorithm is used, you will face the problem of cold start: how to recommend items to users when they are new users? When content is new, how do you recommend it to users?

  • For new users, one strategy is to use popular recommendations or manual recommendations to recommend the content that most people care about.
  • For content, you can divide it into a pool of new content and a pool of content to be recommended. When new content is generated, the new content pool is entered first. For each recommendation, a candidate recommendation should be made from the new content pool, and the spread degree of this content should be +1. When the spread degree of this content is greater than a threshold value, it will be moved to the content pool to be recommended. This can not only solve the problem of cold start of new content but also ensure the amount of exposure of new content to a certain extent.

Diversity problem

In the user image-based recommendation algorithm, multiple labels of users are taken out, and then different amounts of content are taken from different labels according to their relevance. In this way, users’ various interests are taken into account and diversity problems can be solved to a certain extent.

For example, if the user has tag:A, B, C and D, correlation is wA wB wC wD, and Total recommendation is the number of recommended items, then

RecommendList(u) = A[Total recommendation * wA] + B[Total recommendation * wB] + C[Total recommendation * wC] + D[Total recommendation * wD]Copy the code

Content quality

Whether it’s popular recommendations, manual recommendations, or a list of content under a hashtag, the question is: how do you rank the content?

When users have different preferences for content, they can rank them according to the degree of interest. But when it is impossible to distinguish interest (e.g., the user is new; The content is new; Users are equally interested in the content under a particular TAB), and can use content quality to rank the content. Click/PV is a way to judge the quality of content. In addition, convolution neural network correlation algorithms can also be used to build content quality models.

Surprise problem

The surprise target of recommendation system has always been a difficult problem, known as the Exploit & Explore (EE) problem. Bandit algorithm is a faction to solve this problem, which is to estimate the confidence interval and then make recommendations according to the upper bound of the confidence interval, represented by UCB and LinUCB. To put it simply, I recommend high-quality content to you regardless of whether you like it or not, and then adjust the recommended content according to the user’s behavior feedback. For details, see this article: The Pros and cons of recommendation systems.

conclusion

Borrow the recommendation system that point of a few words as a conclusion:

  • Algorithm engineers with real power are always coding. Such algorithm engineers can build models or rule bases according to actual problems, and they can really solve problems. Often, some researchers with research background and rich experience pay more attention to engineering, because some appropriate and reasonable design of engineering architecture can often achieve much higher results than model algorithm optimization.
  • Academic [algorithm engineers] tend to be algorithms for algorithms’ sake, rather than to find the most suitable algorithm for solving the problem of the recommendation system. That’s why big companies often hire algorithms engineers with Ph.D.s, and instead of working on algorithms, they sit around all day looking at spreadsheets. Right? Because the discovery algorithm has no good research, can only let them look at the report to find the rule.
  • Almost all so-called smart recommendation algorithms are slick.
  • When a department of recommendation system starts to pay attention to the so-called dirty work of [data cleaning, data labeling, effect evaluation, data statistics, data analysis], such a recommendation system will be saved.

The above is some experience of recommendation system practice

Read the original