Introduction | recall of the module in the face of hundreds of millions of recommended pool material size, the candidate set is very large. Due to the subsequent sorting module as a guarantee, so it does not need to be very accurate, but must ensure no omissions and low latency. At present, it is mainly realized by multi-path recall. On the one hand, each path can calculate in parallel, and on the other hand, learn from each other. Recall pathways mainly fall into two categories: non-individuation and individuation.

I. Overall architecture of recommendation algorithm

(I) Significance of recommendation algorithm

With the vigorous development of the Internet in the past decade, both the scale of users and the scale of content have shown rapid development. It is nothing new for the user side to live more than 100 million a day. Due to the popularity of UGC production mode on the content side, platforms with billions of content libraries are also common. How to let mass users find their favorite content in mass content, and how to make mass content accurately consumed by mass users, has always been the core problem of every company. Under this background, search system and recommendation system came into being. The search system mainly solves the user to look for the content that is interested in, slant active type consumption. Recommendation system is mainly to solve the content push to the right user, partial passive consumption. The two traction users on one side, traction content on the other side, is the realization of user and content matching medium. Recommendation system is a very core position in every company, and its significance mainly includes:

  • On the user side, we can timely and accurately push personalized content of interest to users, constantly discover and cultivate potential interests of users, meet user consumption needs, improve user experience, and thus improve user activity and retention.

  • On the content side, as a traffic distribution platform, it has positive feedback stimulation ability to producers (such as UGC authors and e-commerce sellers). By supporting potential small and medium-sized producers, it can promote the prosperity and development of the overall content ecology.

  • On the platform side, recommendation systems are critical to both the traffic and efficiency of content distribution. By improving the user experience, you can improve user retention and therefore daily life. By improving user conversion and traffic efficiency, e-commerce platform order volume and content platform user per capita hours and other core indicators can be improved. By increasing the depth of user spending, the platform can increase overall traffic, lay the foundation for commercialization goals such as advertising, and improve core metrics such as ARPU (average revenue per user). The recommendation system is closely related to many core indicators of the company, which has a great traction and promotion effect and is of great significance.

(2) Basic modules of the recommendation algorithm

Currently, considering computing power and storage, there is no way to implement the overall end-to-end recommendation. Generally speaking, the recommendation system is divided into the following main modules:

  • Recommendation pool: Generally, based on some rules, some items are selected from the overall material database (which may be several billion or even ten billion in size) into the recommendation pool, and then updated regularly through the change rules. For example, e-commerce platforms can build recommendation pools based on the trading volume in the last 30 days and the price range of goods in the category they belong to, and short video platforms can build recommendation pools based on the release time and the number of plays in the last 7 days. Recommendation pools are usually built offline at regular intervals.

  • Recall: select thousands of items from the recommendation pool and send them to the subsequent sorting module. Because the candidate set facing recall is very large and online output is generally required, recall module must be lightweight, fast and low delay. Due to the subsequent sorting module as a guarantee, recall does not need to be very accurate, but can not be omitted (especially in the search system recall module). At present, multi-path recall solution paradigm is basically adopted, which is divided into non-personalized recall and personalized recall. Personalized recall also includes content-based, behavior-based, feature-based and other methods.

  • Coarse arrangement: obtain the result of recall module, select thousands of items from it and send them to the fine arrangement module. Coarse discharge can be understood as a round filtering mechanism before finishing discharge to reduce the pressure of finishing discharge module. Coarse discharge is between recall and fine discharge, with both precision and low latency. The general model should not be too complicated.

  • Fining: Obtain the results of rough sorting modules, score and sort candidate sets. As a crucial module in the whole system, precision calculation needs to ensure the accuracy of scoring under the condition of maximum delay, which is also the most complex and studied module. The construction of precision system generally involves three parts: sample, feature and model.

  • Reorder: Take the results of the refined order and make another fine tuning based on the operational strategy, diversity, context, etc. For example, section 38 of the United States cosmetics category goods rights, category break, with the map break, with the seller to ensure user experience measures. There are many rules in rearrangement, but there are also many schemes to improve rearrangement effect based on model.

  • Mashup: Multiple lines of business that want exposure in the feed stream need to mashup their results. For example, advertisement is inserted in recommendation stream, text and banner is inserted in video stream, etc. This can be done based on rule-based strategies (e.g., advertising targeting) and reinforcement learning.

Recommendation system contains a lot of modules, papers are emerging in endlessly, relatively speaking is very complex. The most important thing for us to master the algorithm of recommendation system is to sort out the whole algorithm architecture and big picture, know how each module is done, what limitations and problems to be solved, and what means can be used for optimization. All modules are connected through the algorithm architecture diagram. So as not to sink into a detail, unable to extricate themselves. When reading the paper, you should first understand what problems it is to solve and what solutions have been available before, and then understand how it is solved and what improvements and optimization points it has compared with other solutions. This paper mainly introduces the architecture of the recommendation algorithm to help readers grasp the overall picture.

Second, the recall

(1) Multi-way recall

The recall module faces the scale of millions of recommended pool materials, and the candidate set is very large. Due to the subsequent sorting module as a guarantee, so it does not need to be very accurate, but must ensure no omissions and low latency. At present, it is mainly realized by multi-path recall. On the one hand, each path can calculate in parallel, and on the other hand, learn from each other. Recall pathways mainly fall into two categories: non-individuation and individuation.

  • Non-personalized recall

Non-personalized recall has nothing to do with users and can be constructed offline. It mainly includes:

  • Popular recall: for example, short videos with relatively high VV played in the recent 7 days can be smoothen by combining CTR and time attenuation, and the suspected fraudulent click items with low average duration can be filtered out. You can also select items that users like or praise more. This part is mainly implemented based on rules. Because popular items easily lead to Matthew effect, if popular recall accounts for too large a proportion of the overall channel, some pressure can be considered.

  • Efficient recall: For example, short videos with high CTR, high completion rate and high per capita duration are highly efficient, but they may not be on the shelves for a long time, with few VVS played in history, and favorable comments need time to accumulate, so they may not be included in popular recall.

  • Recall of operation strategy: for example, lists and lists of various categories constructed by operation, latest items on shelves, etc.

  • Personalized recall

Personalized recall is related to users, with thousands of faces. According to the construction mode, it mainly includes:

  • Content-based: based on the content, users can select items similar to their content through tags, such as their favorite director, actor, category and other information filled in during new registration, or through users’ historical behaviors as trigger. Mainly include:

  • Tag recall: such as actor, director, item tag, category, etc.

  • Knowledge graph.

  • Multimodal: for example, item with similar title semantics, item with similar head image, item with similar video understanding, etc.

Generally, the inverted index is constructed offline first. When it is used online, the user label or historical behavior item is used as trigger, and the corresponding candidate can be removed. To build an inverted index based on content, it does not require item to have rich behavior and is more friendly to cold items.

  • Behavior-based: mainly userCF and itemCF. They are similar to each other by behavior, requiring rich behaviors of the user or item. UserCF first finds users whose behavior is similar to that of User and selects items from their behavior sequence as candidates. ItemCF finds other items with similar behaviors for each item and builds an inverted index.

What is the difference between the two? In my opinion, there are mainly:

  • UserCF needs user behavior to be rich, and itemCF needs item behavior to be rich. Therefore, userCF can be considered for scenes with high requirements on real-time performance of items such as news, because there are many cold items.

  • Generally speaking, the number of users is much larger than the number of items in the recommendation pool, that is, the user vector is much larger than the item vector, so the storage pressure and vector retrieval pressure of userCF are both greater than that of itemCF. Meanwhile, the user vector is far more sparse than the Item vector, and the accuracy of similarity calculation is not as good as itemCF.

What are the disadvantages of collaborative filtering?

  • Since most users only have behaviors for a few items, the behavior matrix between users and items is very sparse, and some users even have no behaviors at all, which affects the accuracy of vector similarity calculation.

  • The number of users and items is very large, and the storage pressure of behavior matrix is very high.

  • Sparse matrix also brings a problem, that is, the popular items in the head tend to be similar to most items, resulting in the Harry Potter problem, which leads to the extremely serious Matthew effect.

So how to solve these problems, matrix factorization MF emerged. It decomposed the behavior matrices of user and item into two matrices of User and item, and transformed the matrix of MN into two matrices of MK and K*N. Each row of user matrix is a K-dimensional User vector, and each column of item matrix is a K-dimensional item vector. Unlike the vector in CF, which is generated based on behavior and has a relatively clear meaning, the vector in MF is also called user implicit vector and item implicit vector. MF can solve the problem of CF vector being too sparse. Meanwhile, since K is far less than M and N, the high-dimensional sparse vector realizes low-dimensional densification and greatly reduces the storage pressure.

What are the realization methods of MF matrix decomposition? It can be calculated based on SVD and gradient descent. Because SVD has certain restrictions, it is based on gradient descent. Therefore, MF is also called model-based CF.

In essence, MF is still built based on user behavior. Various features of user and item are not fully utilized, such as user gender and age, resulting in a large amount of information loss. LR and FM were born.

  • Feature-based: based on features, such as the user’s age, gender, model, geographic location, and behavior sequence, as well as item’s shelf time, video duration, and historical statistics. Feature-based recall construction method, which makes full use of information, generally has better effect and is friendly to cold start, has become the focus of research in recent years. And mainly divided into:

  • Linear model: such as FM, FFM, etc., will not be specifically expanded.

  • Depth models: DnN-based DSSM twin Towers (EBR, MOBIUS), youtubeDNN (aka deepMatch). Mind, SDM, CMDM, BERT4Rec based on user sequence. Gnn-based Node2Vec, EGES, PingSage, etc. It’s a big topic.

When used online, there are two ways:

  • Vector retrieval: The generated user embedding is applied to search for similar item embedding to find the specific item. The retrieval methods include hash bucket, HNSW and so on.

  • I2i inverted index: Other items similar to this item are found by item embedding and i2I index is constructed offline. When used online, the candidate set can be found from the inverted index through the item in the user’s history behavior as the trigger.

  • Social-network: Find other people on the social chain through friends’ likes, following relationships, contacts, etc., and then recall items through them. The principle is that if your friends like an item, they will probably like it too. Birds of a feather flock together.

(2) Recall optimization

These are the main paths of multipath recall. What are the main problems in recall? Personally, the main problems are as follows:

  • Negative sample construction problem: It is true that recall is the art of samples and sorting is the art of features. Recall positive sample can choose exposure click sample, but how to choose negative sample? To expose unclicked samples? No.

  • The unexposed sample can compete with the existing recall, coarse arrangement and fine arrangement modules, indicating that the quality and relevance of its item are still good, and it is certainly not suitable to be used as the negative sample of recall.

  • SSB problem: recall all recommendation pools facing, but only a small subset of items that can be exposed can be obtained. In this way, constructing negative samples will lead to very serious PROBLEM of SSB (Sample Selection bias) and make the model seriously deviate from the reality.

Based on this problem, we can randomly select items in the recommendation pool as negative samples, but there is another problem. Compared with positive samples, randomly selected items are generally easy to distinguish, so hard negative samples are needed to stimulate and improve the recall model effect. The construction of hard negative sample is a common direction in current recall studies, which mainly includes:

  1. With the help of the model: For example, the EBR of Facebook selects the items in the middle of the recall score of the last edition, and the items ranked between 101 and 500 are not very advanced, which can be regarded as negative samples or the tail of the crane. They are related to positive samples to a certain extent, so it is difficult to distinguish them. EBR, Mobius, PinSage and others have similar ideas. This method is difficult to define exactly what items are somewhat similar but not so similar, and may require multiple attempts.

  2. Business rules: For example, we can refer to the practice of Airbnb paper in selecting items of the same category and price range.

  3. In-batch: Batch positive samples of other users as negative samples of the current user.

  4. Active learning: recall results were manually reviewed, and bad case was taken as a negative sample.

Generally, hard negative and easy negative will be taken as the negative sample of recall in a certain proportion, such as 1:100. Generally, “Easy negative” is the overwhelming majority.

  1. SSB problem: Recall is oriented to the whole recommendation pool, and the number of items is huge. Therefore, certain negative sampling is required, resulting in relatively large SSB sample selection bias. Therefore, it is necessary to make the selected negative samples represent the whole recommendation pool as much as possible, so as to improve the model generalization ability. The main problem is still negative sampling, especially the problem of hard negative sample. Aliesam tries to use transfer learning to make the model on the exposed sample can be applied to the non-exposed item through Loss regularization, so as to optimize the SSB problem. Other approaches consider starting with negative sample sampling, combining easy negative and hard negative. For example, EBR, Airbnb Embedding, Mobius, PinSage and so on all have hard negative optimization ideas.

  2. Target inconsistency problem: The current recall goal is still to find similarity, whether based on content or behavior and characteristics. However, refinement and final actual business indicators still focus on transformation. Similarity does not mean that good transformation can be achieved. For example, in extreme cases, all short videos similar to those recently played by users can be recalled. In the recall stage, Baidu Mobius introduces CPM and takes the business objective as truncation after vector retrieval, which can optimize items with high correlation but low conversion rate. Ali’s TDM reconstructs the ANN recall retrieval process through the maximum heap tree, greatly reducing the amount of retrieval calculation, so as to accommodate complex models, such as DIN, so that recall and ordering can be structionally aligned (of course, the samples will be very different), which can also be regarded as a certain optimization of such problems.

  3. Competition problem: All recall channels will eventually merge to de-weight, and if the repetition between channels is too high, it will be meaningless. Especially, new recall channels need to have a good complementary gain effect on the historical channels, and there are certain overlapping and competition problems among all recall channels. At the same time, candidate items of recall channels may not be able to compete with each other in refinement, especially items with few recalls in history. Due to the small number of exposure samples and low scores in refinement, they may not be able to be revealed. Recall and fine row of love, but also need to alleviate the full link optimization.

Author’s brief introduction

Xie Yang easy

Tencent Application algorithm researcher.