Well, I still want to talk about this article, although it has been said a lot. After all, as a symbol of the large-scale application of deep learning models to industrial recommendation systems, this article is unavoidable. Deep Neural Networks for YouTube Recommendations is a YouTube article published by Recosys in 2016.

The structure of this article is still classic: it is composed of deep candidate generation Model (recall stage) and deep Ranking Model (ranking stage).

Overview

There are three main challenges to YouTube’s recommendation system:

Scale: Some algorithms perform well on small-scale problems, but it is difficult to apply to large-scale problems. YouTube has one of the largest recommendation systems in the world, and Scale is the primary problem that an algorithm can apply.
Novelty (what we call cold start content) : YouTube is in a sense a short video company, and unlike Netflix and Hulu, there is always new content on the site, and how to promote that new content is a matter of consideration.
Noise: There are a lot of noises in both implicit feedback and content feature, which need to make good use of these noisy data.

YouTube’s model is trained on Google Brain, the open source version of which is called TensorFlow. Their model has a number of parameters on the order of a billion and is trained on data volumes on the order of a billion. The structure diagram of the system is as follows:

Candidate Generation

In the recall phase, YouTube selects hundreds of recall results from massive datasets. Before the neural network, YouTube used MF algorithm. Using deep learning algorithms, YouTube models the retrieval stage as a classification model, with the goal of being context-specific, the userFrom the collectionFind the moment inMost likely to be watched:

Among themEmbedding that represents user and context,Representation of video embedding. The task of deep learning is to learn the user’s embedding from the user’s viewing historical data and the current contextAnd use Softmax to select the videos that are most likely to be watched. The model was trained with Implicit feedback, and the video was taken as a positive sample after the user watched it.

Instead of pursuing a full Softmax process when actually serving, YouTube uses the nearest neighbor algorithm to select the most likely N outcomes.

Model Architecture

Specifically, the structure of the recall model is as follows:

Ranking

The structure of the sorting model is similar to that of the recall model, except that the target function is watch_minutes_per_impression rather than CTR. This is mainly to avoid the click-bait problem. Click-bait in The Chinese context is click-bait, which is a deliberate attempt to attract users’ attention by using a video with an attractive title but boring content, resulting in users quitting soon after clicking.

I will not talk about feature engineering here. Interested readers can read the original text. We will mainly talk about loss Function. The goal of the model was to predict the viewing duration caused by impression, regardless of whether the impression was positive or negative. The original label of the positive example is the time for the user to watch the video. This paper designs a weighted logistic regression function to solve this problem. Essentially, they trained the model using logistic regression, but assigned all negative samples a unit weight and all positive samples a viewing time weight. As a result,

The structure of the whole Ranking model is as follows:

Model Expected Watch Time

Well, this is the most difficult part of this article to understand, I say my own understanding, but not necessarily correct, welcome criticism:

The watch_minutes_per_impression is adopted as the prediction target, but this target is difficult to be predicted directly, so the author makes some modifications to the model:

The weight of all positive samples is equal to the viewing time
The weight of all negative samples is equal to one
Weighted Logistic regression is used to deal with this problem. The idea of Weighted Logistic Regression can be referred to in the article Weighted Logistic Regression Model. This is a method of resampling correlation.
In standard logistic regression, the value of y is the number between [0,1], and its expression isThe odds of his death are against him. Where, the probability of a positive sample isAnd the probability of negative samples is
In weighted Logistic regression, this expression is replaced by:, is the probability of positive sample occurrence and/the probability of negative sample occurrence sum (whereRepresents the viewing duration of each positive sample,Represents the total number of samples,Represents the number of positive samples).
While the expected viewing duration of Per Impression is, i.e.,. Among themThat’s CTR, becauseThe value of alpha is relatively small, and the last two are approximately equal to alpha. So we can finally use odds to estimate expected watch minutes.
So in serving,This is the estimate for watch_minutes_per_IMPRESSION.