Evolution of Youtube short video recommendation System: From machine learning to deep learning

Recently, I have systematically watched the related papers recommended by YouTube, and feel very fruitful, so I sort out the essays for readers. “The stone of other mountains can attack jade”, I hope to give you help and inspiration.

The author | Lai Bo first

Youtube is the world’s largest video-sharing platform, with more than 1 billion users and millions of UGC and PGC uploaded every day. So the question is, how do they allow users to quickly find what they are interested in among so many videos? You might think of search, search is indeed an indispensable tool, but there is a premise condition is that the user must know video keywords, by searching the key words to find the corresponding video, and the user a lot of times it’s not very know what you need, visit youtube purely in order to pass the time. In order to solve the problem of users quickly finding videos that they may be interested in, recommendation system is definitely a good supplement to search.

This paper mainly introduces the algorithm and strategy changes related to YouTube’s 10-year recommendation system.

The author found three articles about YouTube recommendation system on the Internet.

One, Video Suggestion and Discovery for YouTube: Taking Random Walks Through the View Graph, was published in 2008.
One was “The_youtube_video_shameation_system” in 2010.
Another is Deep Neural Networks for YouTube Recommendations, published in 2016.

These three articles have introduced the YouTube recommendation system and its internal algorithm architecture. Through these articles, we can observe the evolution process of the system based on different resources and technologies in different periods, which is of great reference significance for students who want to use recommendation technology in their own scenes.

The article published in 2008 is technically quite different from the following two articles. The author bases the recommendation question on a user-video diagram. For a user, the author defines the conditions that the video V that he may be interested in should meet:

The path from u to v is the shortest
There are as many paths from u to v as possible
From u to V, avoid the influence of hot V (hot V can be judged by degree)

According to the above three criteria, suitable videos can be recommended for each user on the graph. How to achieve this? In this paper, the author proposed a learning framework called Adsorption, which aims to solve the problem of a small number of labeled data sets to estimate a large number of unlabeled data sets.

The author uses various methods on this algorithm framework such as Averageing, Random Walk and Linear System, and gives many conceptual definitions and algorithm descriptions. The experimental results are also very interesting. Random Walk is an inevitable solution for graph models. The logic is to send the label of each vertex V to its associated neighbors and normalize the label of the vertex at the end of each transfer. Corresponding to video recommendation is the relationship diagram that we can build for users to watch (of course, it can be other behaviors: recomment and like, etc.) videos, regard the videos that users like to watch as labels, and then carry out random walk to promote the labels to other videos.

The experimental results of this algorithm are very good, but the article is also a beautiful result, the realization of the inside, I understand that for such a large amount of data using this iterative algorithm, the calculation cost is very high, so the application to the actual scene, the system engineering requirements are very high.

Compared with the 2008 article, the RecSys article published in 2010 has undergone significant changes both in style and algorithm architecture. The articles of 2008 were academic and long, but there was not much about how to apply them to the real business.

In 2010, the article was very short, with 4 pages, which introduced all aspects of YouTube recommendation system in a simple and clear way. At the same time, it also introduced many tricks that needed to be used in practical business, such as how to solve the problem of narrow interest caused by relevant recommendations and how to introduce minimum score threshold to remove irrelevant videos. It is of great reference significance in practical business. The recommended scenario introduced in this article is the YouTube home page, which aims to provide users with personalized content to improve the interactivity and entertainment of the website. The core algorithm introduced in this paper is actually the item-based algorithm, which generates a personalized video list for users according to their historical behaviors on the website:

The authors also mention that they define this problem as a top-N recommendation problem, not a click estimation problem, with full consideration of content freshness, accuracy, diversity, and recent user behavior. In addition, serious students will find a series of introductions under the cover picture of each video in the recommendation list, including title, video publication time, number of views, Because you watch XXX; The reason for recommendation is also very important. Let the user know why this video is recommended to him. Moreover, the first four Because you Watch items in the list are different, indicating that the strategy has been adjusted.

Back to the core algorithm of the article, relevant videos should be recommended according to the user’s historical behavior. A core problem is to calculate the correlation between videos. The author said in the article that association rule Mining was used to solve the problem, but in fact, co-view was used to calculate the similarity between videos:

In the formula, the numerator Cij represents the number of co-views of video I and video J in a time window (1 day for the article), and the denominator F (Vi,Vj) is a normalization function to avoid the impact of popular videos. This paper lists a simple function: f(vi, vj) = ci · cj. Of course, this standard function can be customized according to business knowledge in actual business. If f(vi, vj) = CI · cj is used, then R (vi, vj) is actually the computing company of association rule confidence. For seed video VI, to find the most similar video, CI does not affect the ranking, while CJ directly suppress the influence of popular video, improving diversity to some extent. And for small exposure video support.

The above correlation calculation formula is only a simple abstraction. The article mentioned that many factors are also considered in actual business, such as play time stamp, play sequence, video metadata, etc. With correlation between video, can will each video as a node, the correlation between video rij as the value of constructing a directed edge video graph, then according to this, each user u to generate the recommended candidate set, then recommend set into the into the sorting layer generated recommended list, probably logic is as follows:

Seed video generation is obtained according to the user’s historical positive behavior, such as user favorites, likes, joins the playlist, scores, etc. With video graph and seed video, a personalized recommendation candidate pool can be generated for each user according to the item-based algorithm. However, the author believes that this traditional approach will narrow the interest of users, so they expand the search based on the nearest neighbor and search the nearest neighbor of multi-order, and the formula is as follows:

After generating candidates, the next step is to select a few to a dozen videos from hundreds of videos and show them to users, which requires a sorting algorithm. Three types of factors are mentioned in the article for final scoring:

V Video quality
V User’s degree of suitability
V diversity

Video quality is mainly based on user feedback, such as the number of views, total viewing duration, ratings, comments, likes, shares, etc., as well as information such as the upload time of the video.

User fit is determined by how much users like the seed video and how similar the videos are to each other, while reinforcing recent behavior. Diversity is measured primarily by the number of classifications; Then, the weight of candidate set is obtained by linear weighting of these three factors (note: tree model or FM model can achieve better results), and the recommendation list is sorted.

The implementation of recommendation system is divided into three aspects:

Data collection;
Recommendation candidate generation;
Online recommendation service; The system mainly uses Bigtable and mapReduce.

2016 paper used technology has the very big change in 2010, used the data source is not only the user overt behavior, but the main frame is + sorting model was generated candidate mode, only the two layer architecture is to use the depth model (four layers, in fact is not deep) is below 16 years papers referred to in the architecture diagram:

The main architecture finds hundreds of user-related videos to be recommended from millions of video corpus (all recommendation candidate pools) through candidate generation module. The ranking module then selects a dozen videos from the list of recommendations generated by the candidate module and presents them to the user. What is groundbreaking in this paper is that deep learning is used in both modules to reasonably integrate different features and different data sources and achieve very good results. The following is a candidate generation model framework:

This paper transforms the recommendation problem into extreme multiclass classication problem, and the formula is as follows:

Represents the probability that user U (context information C) accurately predicts the category of video I in video library V at time t (each specific video is regarded as a category, and I is a category). The above formula is an obvious softmax multi-classifier form. Where u and V are the embdded vectors of the user and the video respectively. How do they come from? As for the video vector, the embedded vector of each video is calculated by word embedding. Inspired by continuous bag of words language models,we learn high dimensional embeddings for each video in a fixed Vocabulary and feed these embeddings into a feedforward neural network), and U was trained by inputting user information and context information to the above model architecture.

The whole model is composed of three hidden layers DNN. The input information of the input layer is as follows: user’s playback history and search history embedded vector are taken as Average respectively, and then the other features of user’s basic portrait (age, gender, etc.) are added: video quality, video age and other features are concat into vector input. As you can see from the figure above, the output is divided into serving and training. Part of the Training output layer is the Softmax layer, which is the probability value represented by the formula mentioned above. The online part is used to deduce user-relevant topN videos via U and V vectors. Training is the offline training model. The same below).

Ranking Model Framework:

In terms of architecture, the Ranking layer is basically the same as the candidate generation layer, except that the final output layer training is weighted Logistic, while the serving activation function is EX. A ranking model is based on how long a video is playing (this is not a pure CTR prediction model. It points out that making recommendations based solely on CTR can lead to “clickbait”. The positive sample is weighted according to the playing time. The weight of the positive sample is the playing time Ti, the weight of the negative sample is 1, and the last layer model is weighted Logistic regression. So the odds LR learns are:

Where N is the total number of samples, k is the number of positive samples, and Ti is the viewing duration of the i-th positive sample. K is smaller than N, so odds in the above formula can be converted to E[T]/(1+P), where P is the click rate, which is generally very small, so odds are close to E[T], that is, the expected viewing time. Therefore, in the online inference stage of serving, we used EX as an excitation function, which is an approximate estimation of the expected viewing time.

In addition, the paper also spends a lot of time on feature engineering related work (which is somewhat inconsistent with deep learning automatic feature extraction, haha). The author says that although deep learning can relieve the burden of artificial feature construction, the original data cannot be directly fed to the feedforward neural network, so feature engineering is still very important.

In terms of architecture, the whole recommendation system is built on Google Brain and modeled by TensorFlow.

Through the articles published in different periods of YouTube, we can see the changes of its technology and the accumulation of recommended technology in the past 10 years. There are many things worth learning.

References:

[1] Baluja, S. and Seth, R. and Sivakumar, D. and Jing, Y. Video suggestion and discovery for youtube: taking random walks through the view graph. Proceeding of the 17th international conference on World Wide Web. 2008

[2] Davidson, J. and Liebald, B. and Liu, J. The YouTube video recommendation system. Proceedings of the fourth ACM conference on Recommender systems. 2010

[3]P Covington, J Adams, E Sargin. Deep Neural Networks for YouTube Recommendations. Acm Conference on Recommender Systems , 2016 :191-198

This article is transferred from Tencent cloud technology community – Teng Yunge, the original address

https://www.qcloud.com/community/article/989677

Evolution of Youtube short video recommendation System: From machine learning to deep learning

Related Posts

It’s a Python theme month. It’s a Python theme month

Learning Python crawlers (1) : The Beginning

Why doesn’t Python use design patterns?