Guide language

The essence of a recommendation system is information filtering. Multiple information funnels gradually present the content that users are most interested in, as shown in Figure 1 (Rough Layout Model Optimization process of IQiyi Short Video Recommendation). As the first funnel, the recall stage filters out the contents that users may be interested in from the massive videos from multiple dimensions and gives them to the subsequent sorting technology, which directly determines the upper limit of the effect of the subsequent recommendation results. This paper mainly introduces the development process of multi-interest recall technology of IQiyi Instant recommendation team. Compared with other recall technologies, multi-interest recall technology can simultaneously tap multiple potential interests of users and achieve the effect of “thousands of faces” instead of the traditional “thousands of faces” in personalized recommendation system.

Figure 1 Main process of video recommendation system [1]

01

Technical background: How to recall “good seedling”, break the information cocoon room

An excellent video recommendation system can accurately distribute videos to users with matching interests. This process can be likened to the successful selection of elite athletes in the World Series, while the recall stage is equivalent to the selection of city teams that athletes face for the first time when they are young.

A good national coach is good at his job, but it is hard to produce world-class champions without a talented young man. While sorting can improve performance with a large number of features and clever networks, if all the videos recalled are of poor quality themselves, the upper limit of sorting will be locked in advance. Therefore, the national team coach needs sports talents from multiple provinces and cities as the selection source, and the ranking technology needs multiple recall sources as the content to be ranked.

When it comes to recall technology, students familiar with recommendation will cite many strategies and algorithms, for example, strategies include frequent itemset mining Apriori considering content association, recall itemCF considering user and content correlation, recall SVD based on collaborative filtering, etc. Algorithms include Item2vec and Node2vec for neighbor search after embedding content, CDML recall for application content understanding and GNN recall that emerged in recent years.

FIG. 2 Main process of multi-interest recall [2]

As shown in figure 2, more interested in other similar recall recall technology relies on the user behaviour on the history of the past, but the difference is more interested in the recall technology can learn the user’s multiple interests, said the personalized recommendation of “one thousand” upgrade “to one thousand” face “, every interest representation can get the corresponding video based on nearest neighbor search become back to the source. On the one hand, multi-interest recall technology conforms to the reality that most users have different interests and hobbies, which can make the recommendation results accurate and rich and prevent the visual fatigue caused by content homogeneity. On the other hand, in addition to mining users’ existing interests, multi-interest recall technology constantly excavates potential new interests that users have never discovered, preventing the phenomenon of “information cocoon” caused by traditional recommendation algorithms, and presenting massive cultural resources on iQiyi to users.

At the same time, due to the rich product matrix of IQiyi, a user will often use iQiyi baseline, instant engraved, kiwi fruit and other products at the same time. In the case of multi-end user behavior mixed training, different interests of users at different ends and common interests of users at different ends can be extracted. These interests can often help users find their favorite communities and circles, and complete the penetration of products and the compound ecological construction of iQiyi product matrix. Multi-interest recall technologies currently used include clustering multi-interest recall, MOE multi-interest recall and single activation multi-interest recall. This article will introduce them in turn.

02

Clustering multiinterest recall

The main advantage of clustering multi-interest recall is that multiple interest vectors can be formed by embedding of other deep learning online without training complex neural networks (such as mature node2vec, Item2vec and other video embedding Spaces), with small cost in time and space. The main theoretical basis is PinnerSage, an interest clustering method proposed by KDD2020 [3]. (PinSage doesn’t sound like PinSage, but it has nothing to do with graph neural networks.)

PinnerSage cluster multi-interest recall is a new strategy combining with clustering method based on traditional II recall. In traditional II recall, there are usually two methods: 1. Select each video of the user’s short-term history behavior and conduct multiple ANN searches to select the neighbor video, which not only costs a lot of time but also causes serious homogenization of the video. 2. All video embedding of users’ short-term history behaviors are pooled to form user embedding on behalf of users, and then ANN neighbor search is performed. This method can reduce the time and space cost to a certain extent, but it is easy to cause information loss. Pooling out embedding as shown in Figure 3 is likely to be thousands of miles different.

Figure 3

PinnerSage takes the length of the two, conducts clustering and grouping of videos in user history behaviors, and conducts pooling to form multiple interest vectors. Clustering not only avoids the pressure brought by multiple ANN, but also avoids information loss to a certain extent. PinnerSage clustering multi-interest recall is divided into two steps:

A. Clustering process **. ** As shown in FIG. 4, for the clustering operation of all videos watched by users, Pinnersage clustering adopts hierarchical clustering, which does not need to set the initial number of categories like K-means. Instead, each video is regarded as a category first, and then each two categories begin to merge. If the variance in the group increases at least after the combination, the two classes can be combined into one class until the variance exceeds the threshold.

Figure 4.

The embedding process is described as follows: **PinnerSage does not agree on the average embedding of videos in the class, but selects a video embedding as the representative of the class (interest cluster), which meets the minimum sum of the embedding distance of all videos in the class. The ANN can be applied to the embedding which represents the user’s interest.

Clustering multi-interest recall can form multiple interests of users through a simple strategy with less time cost. However, as the embedding space formed by other algorithms is easy to be biased, the content produced by embedding tends to be too hot to meet the needs of individuation. As a result, the team continues to move toward multi-interest networks in deep learning.

03

MOE multi-interest recall

The twin-tower model is the mainstream recall model in the industry, but the effect of twin-tower model is limited in practical scenarios. Therefore, the team modified the tower structure on the user side of the twin towers, introduced a structure similar to MOE[4], and extracted multiple vectors to represent the potential interests of users, thus achieving great improvement. MOE is a classical structure widely used in multi-objective learning. According to the data, multiple expert models are separated and trained. Finally, we take the output of multiple expert models as the user interest vector, and calculate the inner product of the vector extracted from the video side to obtain the user vector most similar to participation loss.

Figure 5

MOE multi-tower structure is shown in FIG. 5. On the left is the multi-tower part of THE user side, and on the right is the single-tower part of the video side. Implementation details of the model include:

A. User input is mainly user preference sequence, including user preference video ID sequence, uploader ID sequence and content tag sequence. After embedding and average pooling operation, different vectors are obtained for sequence features, which constitute MOE multi-tower input. After MOE multi-tower calculation, multiple vectors are obtained to represent multiple potential interests of users.

B. The video side is a single tower structure. The input is the video ID, uploader ID and content tag features interacted by users.

C. In the calculation of Loss, recall finds hundreds of videos that users may be interested in from the video library of tens of millions, so the negative sample space in actual samples is very large. In order to increase the screening ability of the model for negative samples and improve the efficiency of negative sampling of the model, negative sampling within batch was used in the model, and other samples within batch were taken as negative samples of the current sample. Meanwhile, focal Loss function was used to improve the identification ability of the model for difficult samples.

After the modified MOE multi-tower model was put online, the click through rate and per-capita viewing time of single recalled source were greatly improved (CTR of full recalled source was increased by 0.64%, CTR of videos released by recalled source was 28% higher than that of full recalled source, and the average playing time of exhibition was 45% higher than that of full recalled source).

After the revised MOE multi-tower model was launched, the click-through rate and viewing time per person of single recall source were greatly improved. However, MOE multi-towers share the input of the bottom layer and only use the simple DNN network to extract different vectors, resulting in low differentiation between multiple towers and more redundancy in multi-vectors that are difficult to optimize. In addition, location information actually contained in user sequence features is very important to users, so it is difficult to use the current model, so we hope to use it through other networks.

04

4. Single activation multi-interest recall

Single activation multi-interest recall has been used in the industry since 19th, among which MIND[3] proposed by Ali is the most difficult one. Its method of collecting multi-interest by using capsule network to dynamically route user sequence achieved explosive effect on test set, which aroused the enthusiasm of the whole industry to explore multi-interest network. The recommendation team has also been exploring.

4.1 Single activation multi-interest recall first edition

Inspired by MIND and other networks, the team conducted the initial exploration of single-activation multi-interest network, and the network structure is shown in Figure 5. In MIND network, capsule network is adopted to capture the user’s interest. Capsule network can capture the correlation between the sequence information and the video well at the same time. However, due to the complex structure and high computational cost, and the viewing order only has a single dimension, the network does not need to be too sensitive to the location information. Therefore, the team chose transformer structure instead to ensure the speed of training.

Figure 6.

The general process is as follows:

A. Capture the video ID sequence {V1,… VN} is used as sample, and the N+1 video is input into the network as target. The embedding sequence E={E1,E2,.. EN}.

B. E after the transformer to extract layer’s interest to get more interest vector M, take | Mi | maximum interest vector and target video embedding of the javax.media.sound.sampled softmax loss negative sampling, so every time the training actually activates only interest in one channel vector.

C. In the inference stage after the model is trained, all interest vectors of users are taken out and ANN retrieval is carried out one by one to obtain recall results.

Although the initial version has a simple structure, it has a good effect after it goes online, greatly improving consumption indicators, video coverage and diversity. However, the first version also has problems such as high repetition, few features and poor instantaneity of recall results of different interest vectors, so the evolution of multiple versions also occurred.

4.2 “General-interest

In 4.2, there is no constraint between interest vectors, so it is easy to have the problem of interest vectors being too similar. Therefore, regular terms need to be imposed on the loss function. Since transformer was the main part of the initial multi-interest recall, the team explored using three regular functions without changing the network structure [4].

Figure 7.

As shown in Figure 7, regularization constraints are applied to the learned video embedding(Formula 1), Attention(Formula 2) and interest vector (Formula 3) respectively. In the actual production environment, it is found that the best result can be achieved by directly regularizing the interest vector.

4.3 Capacity dynamic multiple interest recall

Different users often show different divergence of interest, so the number of interest vectors should be an elastic indicator rather than a hyperparameter. Based on 4.1 and 4.2, as shown in Figure 8, the activation record table of interest is introduced into the network structure.

Figure 8.

Whenever any interest vector is activated during the training, the recording table will record the activation. In the reasoning stage, the activation table is backtracked, and the unactivated or less activated interest vectors are removed to achieve the purpose of dynamic interest number, so as to match the reality that different users have different divergence of interest.

4.4 Multimodal feature multiinterest recall

In 4.1-4.3, multi-interest recall only applies to video ID features, and the learning effect is still limited. Therefore, in the development of subsequent versions, integrating uploader and content tag into training becomes the main direction. The main structure of the network is shown in Figure 9.

Transformer part is basically the same as that in 4.1-4.3, but the difference is that the training samples are added into Transformer after embedding and pooling. There are two things worth noting:

  1. As for the loss part, negative sampling is only applied to the embedding of video ID (different from MIND and other structures), so that all embedding of video ID can enter negative sampling instead of using batch negative sampling. It can make the final inference phase mainly use video ID embedding more accurate (ANN part of inference phase will not use tag and Uploader).

  2. A video usually has multiple tags, so when embedding tags, pooling is required after embedding all tags.

Figure 9.

4.5 summary

As shown in 4.1-4.4, the single-activation multi-interest network has undergone multiple evolution processes, and the improved applications have brought remarkable effects, including a significant increase of 2% in full-end CTR, 1.5% in full-end duration, and 1.5% in per-capita playback. In particular, the diversity of videos was directly increased by more than 4 percent.

At the same time, as a content platform suitable for all ages, iQiyi has always been a family unit, and users of different ages use the same account. Therefore, the historical behaviors under the same account often come from all ages, and the complexity of users’ historical behaviors brings problems to recommendation. However, the interest vector of single activated multi-interest network has randomness in the sampling of learning process and orthogenality in mathematical presentation, which makes the search range of interest vector able to recall a large number of videos favored by different age groups.

Single activation multi-interest network is also one of the academic hot spots now, and it is hoped that more researchers can put forward new ideas to make recommendation technology continue to shine.

05

Summary and Prospect

This paper has shown the development of multi-interest recall in short video recommendation recall technology of IQiyi. The biggest highlight of multi-interest recall is that it can extract multiple interests of a user, so that the once “thousands of faces” portrait into the “thousands of faces” high-dimensional space, so that the recommendation results simultaneously improve accuracy and richness, but also interested in testing, so as to avoid users into the information cocoon room. At the same time, this technology has been explored in iQiyi’s product matrix complex ecological construction and the solution to the complex problem of user history behavior.

This paper also believes that multi-interest recall can still be optimized:

  1. In the selection of behavior sequence, most multi-interest strategies and networks still only consider users’ viewing history. If the event knowledge graph can be used to include users’ search and subscription behaviors on the platform into training data, more users’ interests and tendencies can be captured.

  2. Multi-interest recall has no solution to deal with negative feedback information. Many behaviors in the video, such as clicking, negative comments, dislike and unfollowing, have not been integrated into the multi-interest recall. These information is also crucial to guide the interest network, and this direction will become a key work in the future.

  3. There is also a lot of application space in the integration of user static information and preference characteristics. The combination of these features can well align with the sorting target, improve the quality of recall source and the upper limit of sorting effect.

reference

[1] Issue 2021-2-26, How to Improve link target consistency? Iqiyi short video recommendation coarse layout model optimization process

[2] AdityaPal, et al. PinnerSage: Multi-Modal User Embedding Framework for Recommendations at Pinterest. KDD 2020

[3] Jiaqi Ma, et al. Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts. KDD 2018

[4] Yukuo Cen, et al. Controllable Multi-Interest Framework for Recommendation.KDD 2020.

[5] Chao Li, et al.Multi-Interest Network with Dynamic Routing for Recommendation at Tmall. CIKM 2019.

[6] Jian Li, et al. Multi-Head Attention with Disagreement Regularization. EMNLP 2018

Maybe you’d like to see more

How to improve link target consistency? Iqiyi short video recommendation coarse layout model optimization process

Practice and application of TensorFlow Ranking framework in overseas recommendation business

I Qiyi Technology Sauce

Cross dimensional virtual idol small again with The9 PK group song “Yes OK”? How is the virtual idol of singing and dancing versatile made? Find out how to make a virtual idol! # Cloud Performance #The9 City of Reality Concert # Virtual Idol # Motion Capture # Black Technology @iQiyi