Alice: Today we’re going to examine a classic paper: Practial Lessons from Predicting Clicks on Facebook. From the title of this paper, we can see that the author of this paper is the advertising team of Facebook. This is a method that combines GBDT and LR model to predict advertisement click-through rate. Although it has been several years, the method in this paper is still not completely outdated, and some small companies are still using it.

This paper is very classic. It can be said that it is a must-read article in the field of recommendation and advertising. It is also common sense in the industry. The quality of this article is very high, and the content is relatively basic, which is very suitable for everyone’s introduction paper.

This article is a bit long.

Introduction to the

The part at the beginning of the paper briefly introduces the status of advertising in the Internet industry at that time and the scale of Facebook at that time. At that time, Facebook had 750 million daily active users and more than one million effective advertisers. Therefore, it is of great importance for Facebook to select appropriate and effective ads to deliver to users. On this basis, it leads to Facebook’s innovative approach of combining GBDT with logistic regression model, which achieved more than 3% returns in real data scenarios.

In 2007, Google and Yahoo introduced online bidding for advertising, but Facebook is different from the search engine scene, where users have a clear intention to search. The search engine filters ads based on the user’s search intent, so the candidate set of ads is not very large. But Facebook doesn’t have such strong intent messages, so the number of AD candidates on Facebook is much higher, and therefore the pressure and demands on the system are much higher.

However, this paper does not discuss system-related content, but only focuses on the last part of the sorting model. Although the paper doesn’t say it, we can see that the ads on search engines like Google and Yahoo are search ads, while the ads on Facebook are recommendation ads. The biggest difference between the latter and the former is that the logic of recall advertising is not the same, which is similar to the difference between recommendation system and search system.

The core of this is user intent. Although there are no strong intentions when users log in to Facebook, we can extract some weak intentions based on users’ previous browsing behavior and habits. Things like which items users spend the most time on, what content they click on the most, and things like collaborative filtering, which abstracts user behavior into vectors. There’s a lot to be said for this, and it’s very valuable, so Facebook has some leverage when it comes to writing papers.

The specific practices

After all the nonsense, let’s look at the specific approach, the specific approach many students may have heard of GBDT+LR approach. It seems like a sentence, but there are a lot of details. Why GBDT? Why does GBDT work? What is the mechanism that works here? What is written in paper is just the surface, and the thinking and analysis of these details is the key. Because the method in paper is not universal, but the truth contained in it is often universal.

First of all, the paper provides two new methods of model evaluation. A Normalized Entropy and a Calibration. Let’s start with model evaluation.

Normalized Entropy

This metric is quite common in the real world, often seen in various types of code and paper. The literal translation is normalized entropy, but the meaning has changed a little bit, and it can be understood as normalized cross entropy. It is calculated from the ratio of the mean cross entropy of the sample to the cross entropy of the background CTR.

Background CTR refers to the experience CTR of the training sample set sample, which can be understood as the average click through rate. But notice here, it’s not the ratio of positive and negative samples. Because we will do sampling before training the model, for example, sampling according to the ratio of positive and negative samples 1:3, the background CTR here should be set as the ratio before sampling. Let’s assume that the ratio is P, then we can write the formula for NE:


hereThe value isSo click is +1, no click is -1. This is the formula in paper. In practice, we usually write click as 1 and Impression (without click) as 0.

Calibration

Calibration means Calibration, but here I think it should be understood as deviation from the base line. This indicator is the ratio of the average CTR predicted by the model to the background CTR. The closer the ratio is to 1, the smaller the baseline deviation of our model is, and the closer it is to the real situation.

This formula can be written as:


AUC

AUC is a cliche and the most commonly used metric in our industry. As we have introduced in previous articles, AUC represents aera-under-ROC, which is the area enclosed by ROC curve. The ROC curve is a curve consisting of true postive Rate (TPR) and false positive rate (FPR). The larger the area of this curve is, the stronger the ability of the model to distinguish positive and negative samples is. In the CTR sequencing scenario, the accuracy of the prediction of the model is not the most important, but the ability to screen out positive samples is the most important, which is also the reason why AUC is so important.


However, AUC is not without its disadvantages, one of which is listed in the paper. For example, if we take all the CTR predicted by the model by X2, and then calibrate all the predictions by 0.5, the AUC will remain unchanged. But if we look at NE, NE goes up.

Portfolio model

Finally, it’s the highlight. Although this paper talks about many other aspects, we all know that GBDT+LR is its focus. GBDT and LR are familiar to us, but how can they be combined?

In fact, this problem itself is wrong. The so-called GBDT+LR is not a combination of two models, but a transformation of features. That is to say we need to think about it in terms of features rather than models.

There are two common features processing methods described in the paper. The first one is called bin, which means to divide buckets. Income, for example, is a continuous feature. If we put it in the model, what the model learns is one of its weights, which means that it works linearly. In the real world, however, it may not be linear at all. For example, rich people may prefer completely different brands than poor people, and we want the model to learn non-linear effects. A better way to do this is to artificially divide this feature into buckets, such as those with an annual income of less than $30,000, between $30,000 and $100,000, between $100,000 and $500,000, and above $500,000. In which bucket it falls, the value of which bucket is marked as 1, otherwise marked as 0.

The second approach is called characteristic combination, and this is also easy to understand, like gender is a category, whether it’s high income is a category. So if we arrange and combine, we get four categories: male low income, male high income, female low income and female high income. If it is a continuous feature, data structures such as KD-tree can be used for discretization. We can get more features by crossing category features in pairs, but these features are not always useful, some may not be useful, and some may be useful but the data is sparse. So while this can produce a lot of features, it requires manual filtering, many of which are ineffective.

It was too much work and not very profitable to filter features manually, so engineers began to wonder: Is there a way to filter features automatically? Now we all know that you can use neural networks to do automatic feature crossing and filtering, but at that time neural networks didn’t exist, so you had to do it manually. To solve this problem, engineers at the time came up with GBDT.

That’s why we have GBDT+LR. Let’s look at this picture:


Let’s briefly review the GBDT model. Firstly, GBDT is a forest model composed of multiple trees. For each sample, it will fall on one of the leaves of each subtree in the prediction. In this way, we can use GBDT to map features. Let’s take a look at an example:


In the example above, GBDT has 3 subtrees in total, and the first subtree has 3 leaf nodes. Our sample falls into the first one, so the one-hot result of the first subtree is [1, 0, 0]. The second subtree also has three nodes, so the one-hot result of the sample falls into the second node is [0, 1, 0]. Similarly, the result of the third subtree is [0, 0, 1, 0].

So we finally merge the vectors of these trees together to get a new vector: [1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0], which is the input to our LR model.

Let’s analyze the details. First of all, it is clear that GBDT is only used for feature transformation and processing, and its prediction results are not important. By using GBDT, we can simultaneously complete the two operations mentioned just now, which not only converts continuous features into discrete features, but also automatically completes the crossover of features. Because each subtree is essentially a decision tree implemented by CART algorithm, that is to say, the link from root to leaf node represents a potential rule. Therefore, we can approximately regard it as using GBDT instead of manual extraction of rules and feature processing.

The results of

Finally, let’s take a look at the results. In the original paper, three groups of experiments were conducted respectively, which were only LR, only GBDT and GBDT+LR for comparison. The measurement standard is their NE. Taking trees-only, the largest NE, as a reference, it can be found that THE NE of GBDT+LR group decreased by 3.4%.

We pasted the results from the paper:


Well, this is kind of a neat result, because you’re comparing NE, which means that the cross entropy goes down. However, we do not know the situation of AUC, which was not mentioned in the paper. And this decline is based on the maximum results of NE as a reference, there is a little bit of artificial exaggerated feeling. Of course, this is also a common method of paper, we have a good idea.

Real-time performance of data

In addition to model and feature innovation, this paper also discusses the role of data timeliness.

In order to verify the relationship between data freshness and model performance, paper selected one data and trained trees-only and GBDT+LR models, and then used these two models to predict the data from 1 to 6 days later respectively, and drew the overall situation into a chart:


In this figure, the horizontal axis is the number of days between the predicted data and the training data, and the vertical axis is the NE of the model. The lower NE is, the better the effect of the model is. From the figure above, we can see that the difference between the result on the sixth day and NE on the zero day is about 1%. That means we can get a 1% increase simply by keeping the data fresh. This is also why major companies do online-learning.

online learning

In the following content, Paper also introduced some details of online learning for us. Some of them are out of date or not universally applicable, so I’ve picked a few typical examples to list.

Time window

In the process of collecting training data, clicking is relatively clear, because clicking has a specific behavior, but Impression (exposure) is not. Since not clicking is not an action, there is no way to determine whether the user does not want to click or will click later. A more common approach is to maintain a timing window that specifies a time when a user sees an AD and does not click within a specified time, then it is considered a non-click event.

But it’s easy to see that this can be a problem, because users may be slow to respond, or they may not click because they haven’t clicked yet. It’s possible that time has passed and the user has clicked. In this case, impression has been recorded as a negative sample, so the positive sample generated by clicking cannot find the corresponding Impression. We can calculate a proportion, how many clicks can find impression, and this proportion is called click coverage.

Is the longer the window the better? Well, it’s not, because a long window can lead to something that isn’t a click being considered a click. For example, for a product, the first time the user is not satisfied, so there is no click. After a while, when the user looks back, he changes his mind and clicks. So it makes sense that we should treat the first behavior without a dot as a negative sample, and the second behavior as a positive sample. If we have a long time window, it’s considered a click sample. So how long should the time window be set? This is a parameter that needs to be adjusted, not decided by head.

architecture

Streaming is a data processing method commonly used in the industry, called streaming processing. You can simply view it as a queue, which is a queue. But when click occurs, we need to find the corresponding Impression sample and change it to positive sample, so we also need to find the function. That is, we also need a HashMap to record the nodes in the queue.

When we collect enough data or sample data for a specified time, we will use the data in the window to train the model. When the model completes the training, it will be pushed to the sorting system, and the model file that needs to be called for sorting will be updated in real time, so as to achieve the purpose of online real-time training. Let’s take a look at its architecture:


Ranker is a sorting system that sorts candidate ads and presents them to users, and pushes the feature data of these ads to Online Joiner, also known as the feature processing system. When a user clicks on an AD, the click data is also passed into Joiner. Joiner will associate the user’s click data with the data transmitted by Ranker, and then send the data to the Trainer system for model training.

There is one more detail, how do we associate in Joiner? Because a user may have multiple views of an AD, the same AD may be viewed multiple times. So it’s not going to work with the user ID or the AD ID, we need an ID that’s related to the time. This ID is called Requestid, and requestid is refreshed every time the user refreshes the page, so we can ensure that even if the user refreshes the page, we can properly associate the data.

Characteristics analysis

Finally, the paper also attached some analysis of the characteristics. We don’t know exactly what features they use, but the analysis is instructive.

Behavioral characteristics or contextual characteristics

In the scenarios predicted by CTR, features can be mainly divided into two types, one is context features and the other is user history behavior features. Contextual features are a big concept. Information about the AD being displayed, information about the user, the time of day, the content of the page, the location of the user, and so on can all be considered contextual information. It’s also easy to understand the historical behavior characteristics, which are the behaviors that users have previously engaged in within the platform.

The paper focuses on the analysis of the importance of user behavior characteristics, which is inverted according to the importance of features, and then calculates the proportion of user history behavior characteristics among topK important features. In LR model, the importance of features is equivalent to the weight of corresponding positions, and the results are as follows:


As can be seen from the results in the figure above, the historical behavior characteristics of users occupy a very large weight in the whole. Among the top20 features, only two are context features. It’s also consistent with our understanding that the quality of the content we deliver is far less important than what people like. And the history of user behavior data is a very good indicator of user preferences.

Materiality analysis

In addition to analyzing the importance of features, the paper also used only one type of features to predict and compared the performance of models. The results of the experiment are as follows:


In the middle of the image above, the red bars show the training results using only contextual features, and the dark blue bars show the training results using only user behavior characteristics. From this result, we can clearly see that the performance of the model trained by using only behavioral features is at least 3.5 points better than that trained by using contextual features, which is already a very large gap. Therefore, we can also conclude that user behavior characteristics are more useful than context characteristics.

In addition, the paper also discusses the influence of negative down sampling and subsampling on model performance. The second sampling proves that the amount of training data is positively proportional to the model effect, that is to say, the larger the amount of training data, the better the model effect. Negative sampling is also helpful to improve the effect of the model, which is also a conventional method in the current advertising and recommendation scenarios.

That’s all for today’s article. I sincerely wish you all a fruitful day. If you still like today’s content, please join us in a three-way support.

Original link, ask a concern

This article is formatted using MDNICE