Kaggle match xgBoost

 

Note: If some pictures can not be displayed normally and affect reading, please refer to the article here: XgBoost test library version. Date: March 25, 2019.

 

0 foreword

Xgboost has long been rumored to be a magic weapon in competition circles, such as a kaggle/ Tianchi race where someone used it to win the championship.

And we’ll talk about XGBoost in our machine learning course, as Han says: “RF and GBDT is the industry’s favorite model, Xgboost is a big killer package, Kaggle’s various Top leaderboard once presented Xgboost unified the lake situation, another didi game first improvement is not Xgboost credit.”

In addition, the company has been organizing students to participate in various competitions since the first half of 2016. (After all, it is impossible to do AI without actual practice, and participating in the competition after data processing, feature selection, model tuning, code tuning, is an excellent real combat opportunity. Great for improving your skills and finding/changing jobs).

Under the AI tide, this year especially many friends from the traditional IT industry transition AI, many friends are consulting how to change AI, I generally will focus on learning AI or find/change AI four king Kong: course + question bank + OJ + Kaggle/Tianchi. The graduation examination, including the training camp, will be integrated with kaggle or Tianchi competitions.

Given the importance of Kaggle/Tianchi competitions for math and science, here’s a quick introduction to XGBoost and how to do well in the competition.

Finally, XGBoost wasn’t invented by Me July, but I will make sure that this introduction to it is as accessible as possible (and reviewed by Dr. Chen, head of July online AI Lab). In addition, thanks for all the references listed at the end of this article.

1 the decision tree

For example, there are more than 100 trainees in one session of the intensive training camp. Suppose you are given a task to count the number of boys and girls. When one student stands up in front of you, how will you distinguish who is male and who is female?

Soon, you took into account that boys’ hair is generally short and girls’ hair is generally long, so you divided all the students in this class into two waves by the length of their hair, with long hair as “women” and short hair as “men”.

It’s like you’re dividing the entire class by a single metric, “hair length,” and you’re creating a simple decision tree based on hair length. At this time, some people may have different opinions: why to use “hair length” division ah, I can use “shoes are high heels”, “have Adam’s apple” and so on these to divide, the answer is certainly yes.

But which indicator is better? It’s pretty straightforward to decide which category works better and which one takes precedence. Therefore, an evaluation criterion is needed to quantify the classification effect.

How to judge “hair length” or “whether there is an Adam’s apple” is the best way to divide, how to quantify the effect? Intuitively, the more pure the group, the better, if you divide it into two groups, the “women” group is all women, and the “men” group is all men, the better. But sometimes the actual classification situation is not so ideal, so the closer we get to it, the better we think.

There are many ways to quantify the classification effect, such as information gain (ID3), information gain rate (C4.5), Gini coefficient (CART) and so on.

The metric of information gain: entropy

The core idea of ID3 algorithm is to measure attribute selection by information gain, and select the attribute with the largest information gain after splitting for splitting.

What is information gain? To accurately define the information gain, we first define a metric widely used in information theory, called entropy, which characterizes the purity of any sample set. Given the sample set S containing positive and negative examples about a certain objective concept, then the entropy of S relative to this Boolean type classification is:

In the above formula, P + represents the positive example, as in the second example at the beginning of this article, P + means to play badminton, while P – represents the negative example, not to play badminton (in all entropy calculations we define 0log0 as 0).

For example, suppose S is a set of 14 examples of a Boolean concept, including 9 positive examples and 5 negative examples (we use the notation [9+, 5-] to summarize such data examples), then the entropy of S with respect to this Boolean example is:

Entropy ([9+, 5-]) =- (9/14) log2 (9/14) – (5/14) log2 (5/14) =0.940.

So, according to the above formula, we can get:

  • If all members of S belong to the same class, then Entropy(S)=0;
  • If the number of positive and negative samples of S is equal, then Entropy(S)=1;
  • If the number of positive and negative examples of S is different, then the entropy is between 0 and 1

As shown in the figure below:

See, with the Entropy value, you can evaluate how well the current classification tree works.

For more details on pruning, overfitting, advantages and disadvantages, please refer to this article “Decision Tree Learning”.

So, now that the soul of the decision tree is there, the tree is split by some metric for classification/regression purposes, always hoping for as much purity as possible.

 

2. Regression tree and integrated learning

If you define XGBoost in one sentence, it’s simple: XGBoost is an integration of many CART trees. But what is a CART tree?

There are two main types of decision trees used in data mining or machine learning:

  1. Classification tree analysis is the prediction of the category of the data (e.g., to see a movie or not to see a movie).
  2. Regression tree analysis is where the predicted results can be considered real numbers (e.g., the price of a house, or the length of a patient’s hospital stay)

The term CART, Classification And Regression Tree (CART) analysis is a general term used to refer to the above two trees, which was first proposed by Breiman et al.

2.1 Regression Tree In fact, classification and regression are two very close problems. The goal of classification is to determine which known sample class a new sample belongs to according to some characteristics of known samples, and its result is a discrete value. And the regression results are continuous values. Of course, the essence is the same, a mapping between feature and result/label.

Once you know what classification and regression are, it’s not hard to understand classification trees and regression trees.

The sample output (or response value) of the classification tree is in the form of a class, such as deciding whether the life-saving medicine is real or not, or whether to go to the movie “The Curse of the Wind” this weekend or not. And the sample output of the regression tree is in the form of a number, so for example, the amount of mortgage that you give somebody is a number, and it could be anywhere from zero to $3 million.

So, for regression trees, instead of using the classification tree’s set of information gain, information gain rate, and Gini coefficient to determine if the tree’s nodes are split, you need new ways to evaluate the effects, including prediction errors (commonly mean square error, logarithmic error, etc.). And nodes are no longer categories, but values (predicted values), so how do you determine? Some are in-node sample means, some are optimally calculated like Xgboost.

CART regression tree is assumed to be a binary tree, and features are continuously divided. For example, the current tree node is split based on the JTH eigenvalue, and the sample with the eigenvalue less than s is divided into the left subtree, and the sample with the eigenvalue greater than s is divided into the right subtree.

In essence, CART regression tree is to divide the sample space in this feature dimension, and the optimization of this space division is a NP-hard problem. Therefore, heuristic methods are used to solve the problem in the decision tree model. The objective function generated by a typical CART regression tree is:

Therefore, in order to solve the optimal sharding feature J and the optimal sharding point S, it is transformed into solving such an objective function:

So we just go through all of the sharding points for all of the features, and we find the best sharding features and sharding points. You end up with a regression tree.

2.2 Boosting Integrated Learning

The so-called ensemble learning refers to the construction of multiple classifiers (weak classifiers) to predict the data set, and then use a certain strategy to integrate the predicted results of multiple classifiers as the final prediction results. The popular analogy is that “three heads are better than all,” or that directors vote on a corporate board, which requires a certain amount of “accuracy” for each weak classifier and a “difference” between classifiers.

Ensemble learning can be divided into two schools, Boosting and Bagging, according to whether there is a dependency between each weak classifier:

  1. Boosting school, each classifier has a dependency relationship, must be serial, such as Adaboost, GBDT(Gradient Boosting Decision Tree), Xgboost
  2. Bagging genre: there is no dependency between classifiers and they can be parallel, such as Random Forest.

And the famous Adaboost as the most representative method in Boosting genre, this blog has introduced it in detail.

AdaBoost, the abbreviation of “Adaptive Boosting” (Adaptive Enhancement), was proposed by Yoav Freund and Robert Schapire in 1995. Its adaptation lies in that the samples incorrectly classified by the previous basic classifier will be strengthened, and the weighted samples will be used to train the next basic classifier. At the same time, a new weak classifier is added in each round until a predetermined small enough error rate or a predetermined maximum number of iterations is reached.

Specifically, the entire Adaboost iterative algorithm consists of three steps:

  1. Initialize the weight distribution of the training data. If there are N samples, each training sample is initially given the same weight: 1/N.
  2. Training weak classifiers. In the specific training process, if a sample point has been accurately classified, its weight will be reduced in the construction of the next training set. Conversely, if a sample point is not classified accurately, its weight is increased. Then, the sample set with updated weights is used to train the next classifier, and the whole training process continues iteratively.
  3. The trained weak classifiers are combined into strong classifiers. After the training of each weak classifier, the weight of the weak classifier with a small error rate is increased to make it play a larger decisive role in the final classification function, while the weight of the weak classifier with a large error rate is reduced to make it play a smaller decisive role in the final classification function. In other words, a weak classifier with a low error rate will have a larger weight in the final classifier, otherwise it will have a smaller weight.

Another Boosting method, GBDT (Gradient Boost Decision Tree), is different from AdaBoost in that each calculation of GBDT is to reduce the last residual, and then a new model is built in the direction of residual reduction (negative Gradient).

Boosting integrated learning makes joint decision by multiple decision trees that are related to each other. For example

  1. A sample [data -> tag] is: [(2,4,5)-> 4]
  2. The prediction of the first decision tree trained with this sample is 3.3
  3. Then the input for the training of the second decision tree, the sample becomes: [(2,4,5)-> 0.7]
  4. In other words, the input sample of the next decision tree will be related to the training and prediction of the previous decision tree

You’ll soon realize why Xgboost is also an ensemble learning for Boosting.

The key to the formation of a regression tree is:

  • What is the dividing point based on (such as the minimum mean square error (LOSS) mentioned above);
  • What is the predicted value of the node after classification (as mentioned above, there is one way to take the actual mean value of each sample under the leaf node as the prediction error of the leaf node, or the calculated result)

As for another kind of integrated learning method, such as the Random Forest algorithm, each decision tree is independent. Each decision tree randomly selects a batch of samples from the sample heap and selects a batch of features for independent training. There is no relationship between each decision tree. This article will not elaborate.

 

3.GBDT

When it comes to Xgboost, we have to start with GBDT(Gradient Boosting Decision Tree). Because xgBoost is essentially a GBDT, but strives to maximize speed and efficiency, it is called X (Extreme) Gvhall. Boosting, both of which are boosting methods.

The principle of GBDT is simple, that is, the sum of the results of all weak classifiers equals the predicted value, and then the next weak classifier fits the gradient/residual of the error function to the predicted value (the gradient/residual is the error between the predicted value and the true value). Of course, the weak classifier in it is represented by the trees. As shown in the figure, Y = Y1 + Y2 + Y3.

For a very simple example, let’s say I’m 30 years old, but the computer or model GBDT doesn’t know how old I am. What about GBDT?

  1. It will fit the first weak classifier (or the first tree) with a random age like 20 years, and find that the error is 10 years;
  2. In the second tree, six years were used to fit the remaining loss, and the gap was four years;
  3. Then in the third tree, the remaining gap was fitted by 3 years, and the gap was only 1 year;
  4. Finally, in lesson 4, the tree is fitted with the remaining residual 1 year old, perfect.

In the end, the four tree conclusions add up to the true age of 30. In practical engineering, GBDT is to calculate the negative gradient and use the negative gradient to approximate the residual.

Notice why GBDT can be approximated with a negative gradient?

Under the regression task, GBDT will have a predicted value for each sample in each round of iteration, and the loss function is the mean square error loss function.

So the negative gradient is calculated like this

Therefore, when the mean square loss function is chosen as the loss function, the value of each fitting is (the true value – the value predicted by the current model), namely the residual. The variable here is theta, i.e., “the value of the current prediction model”, i.e., take the negative gradient of it.

In addition, here have to take a long-winded again, the above prediction age “casual” two characters in the first step in seemingly casually, actually deeply think about is not casually, you will find that most of the prediction model, basic it is such a regular routine, casually with a value to predict first, and then compare the predicted values and the real value of the gap, constantly adjust in the end Close the gap. So there’s a set of objective functions: determine the target, and a loss function: reduce the error.

If you think about it further, you will find that this is completely consistent with the common sense and common practice of human making predictions. When you do not know much about a thing, you will try and explore it based on experience at the beginning until it is close to or completely consistent with some sense.

Again, age prediction.

For simplicity, assume that the training set consists of only four people: A,B,C, and D, and that their ages are 14,16,24, and 26, respectively. A and B are students of grade one and grade three respectively. C) fresh graduates D) employees who have worked for two years

So, the question now is how do we predict the ages of these four people? It’s easy, just use a random age like 20 to fit them, and then adjust accordingly.

If a traditional regression decision tree is used for training, the results are as follows:



Now we use GBDT to do this, because there is too little data, we limit the number of leaf nodes to two, that is, each tree has only one branch, and limit the learning of only two trees.

We get the result as shown in the figure below:



The first branch of the tree is the same as figure 1. Since A and B are of similar ages and C and D are of similar ages, they are divided into two groups, and the average age of each group is used as the predicted value.

  • At this point, the residual is calculated (residual means: the actual value of A – the predicted value of A = the residual of A), so the residual of A is the actual value of 14 – the predicted value of 15 = the residual value of -1.
  • And notice, the prediction of A is the sum of all the trees in front of it, so there’s only one tree in front of it so it’s just going to be 15, and if there’s still trees, you have to add them all up as A prediction of A.

In mathematical statistics, residual is the difference between an actual observed value and an estimated value (fitted value). “Residials” contain important information about the basic assumptions of the model. If the regression model is correct, we can think of the residual as an observation of the error.

Thus, the residual of A,B,C, and D is -1,1, -1,1, respectively.

Then take their residials -1, 1, -1 and 1 instead of the original values of A, B, C and D, learn to the second tree, the second tree has only two values 1 and -1, which are directly divided into two nodes, that is, A and C are divided into the left side, and B and D are divided into the right side. After calculation (for example, A, the actual value -1 – predicted value -1 = residual 0, for example, C, Actual value -1 – predicted value -1 = 0), everyone’s residual is 0.

If the resiticals are all zero, which means that the predicted values for the second tree are equal to their actual values, then you just add the results from the second tree to the first tree and you get the true age, which means everyone gets the true predicted values.

In other words, the predicted values of A,B,C, and D are now consistent with their real ages. Perfect! A: A 14-year-old senior high school student, shopping less, often asked the upperclassmen questions, predicted age A = 15-1 = 14 B: A 16-year-old senior high school student, shopping less, often asked the upperclassmen questions, predicted age B = 15 + 1 = 16

D: A 26-year-old employee who has worked for two years, buys a lot of things, and is often asked questions by his or her younger brothers. The predicted age is D = 25 + 1 = 26

Therefore, GBDT needs to accumulate the scores of multiple trees to get the final predicted score. In each iteration, a tree is added on the basis of the existing tree to fit the residual between the predicted results of the previous tree and the true value.

 

4.Xgboost

4.1 Definition of xGBoost Tree

The schematic diagram in this section is basically quoted from the powerpoint of XGBoost original author Chen Tianqi.

For example, we want to predict a family on the degree of be fond of computer games, considering the compared to young and old, the young are more likely to love video games, as well as compared to men and women, men are more like video games, so the first according to the age to distinguish between children and adults, and then distinguish by gender is male or female, one by one for each grade in the video game preferences degree, See the figure below.

In this way, two trees tree1 and Tree2 are trained. Similar to the principle of GBDT before, the sum of the conclusions of the two trees is the final conclusion, so the child’s prediction score is the sum of the scores of the nodes the child falls on in the two trees: 2 + 0.9 = 2.9. Grandpa’s predictive score was the same: -1 + (-0.9) = -1.9. The details are shown in the following figure

Well, you might want to stand up and exclaim, isn’t that the same thing as GBDT?

In fact, the big difference between XGBoost and GBDT is the definition of the objective function, regardless of the differences in engineering implementation and problem solving. The target function of XGBoost is shown below:

Among them

  • The L pointed by the red arrow is the loss function (for example, the squared loss function:, or logistic loss function:)
  • The red box encloses the regular terms (L1 regular, L2 regular)
  • The red circle is the constant term
  • forXgboost uses Taylor to expand three terms to make an approximation

We can clearly see that the final objective function depends only on the first and second derivatives of each data point on the error function.

Well, when you suddenly lose such a big formula, many people may be confused instantly. Okay, so let’s break down the target function, and break down what each formula, each symbol, each subscript means.

4.2 XgBoost target function

The idea behind XGBoost’s core algorithm isn’t hard

  1. You keep adding trees, you keep splitting features to grow a tree, one tree at a time, you’re actually learning a new function to fit the resiutions that you predicted last time.

    Note: w_q(x) is the fraction of leaf node Q,Represents the set of all K regression trees, and f(x) is one of the regression trees.

  2. When we get k trees after training, we need to predict the score of a sample. In fact, according to the characteristics of this sample, it will fall to a corresponding leaf node in each tree, and each leaf node corresponds to a score

  3. You just add up the scores for each tree and you get the predicted value for that sample.

Obviously, our goal is to get the predicted value of the tree groupAs close as possible to the true value, and have the ability to generalize as much as possible.

Therefore, from a mathematical point of view, this is a functional optimization problem, so the objective function is simplified as follows:

As you can see, the objective function is divided into two parts: the loss function and the regularization term. Furthermore, the loss function reveals the training error (i.e., the difference between the predicted score and the real score) and regularizes the complexity.

For the above equation,Is the output of the whole accumulative model, the regularization termIs, the function represents the complexity of the tree. The smaller the value is, the lower the complexity will be and the stronger the generalization ability will be

T is the number of leaves and w is the fraction of leaves. Intuitively, the goal requires the prediction error to be as small as possible, the number of leaf nodes T to be as small as possible (γ controls the number of leaf nodes), and the node value W to be as less extreme as possible (λ controls the fraction of leaf nodes not to be too large) to prevent overfitting.

By the way, the general target function consists of the following two terms

Among them, the error/loss function encourages our model to try to fit the training data, so that the final model will have less bias. Regularization terms encourage simpler models. Because when the model is simple, the randomness of the results of finite data fitting is less, and it is not easy to overfit, which makes the prediction of the final model more stable.

4.2.1 Model learning and training errors

Specifically, in the first part of the target functionAccording to the firstA sample, ( − The first saidWe want to make sure that the prediction error is as small as possible.

Similar to the previous GBDT formula, XGBoost also needs to accumulate scores of multiple trees to get the final predicted score (in each iteration, a tree is added to the existing tree to fit the residual between the predicted results of the previous tree and the true value).

But how do we choose what to put in each round? The answer is pretty straightforward, pick oneTo make our target function as low as possible.

And again, given thatThe model predicted value of round T = model prediction of the first T-1 round  +  , so the error function is denoted as:   ( . ), the latter term is the regularization term.

For this expression of the error function, at step t,Is the true value, that is, known, **** can be obtained from step T-1 in the previous stepaddIt’s calculated, and in a sense it’s known, so what the model learns is.

The formula for Obj up here may be a little too abstract, but let’s think about whenIs the squared error case), our goal can be written as a quadratic function like this (the circled part of the figure represents the residual between the predicted value and the true value) :

More generally, what if the loss function is not quadratic? Using Taylor expansion, not quadratic try to approximate quadratic (as you can see, defines the first derivativeAnd the second derivative).

Yeah, yeah, pay attention! Many people could be stuck here, also there are few articles to explain online, online AI lab in July and Dr Chan discussion, find it the most critical is the second order Taylor expansion the corresponding relation of the objective function and xgboost clear, as we can do it using the second order Taylor expansion to target function approximation.

First of all, this is Taylor’s second order expansion

Corresponds to the target function of xGBoost

Ignore loss functionIn the(Don’t forget “On step T,Is the true value, that is, known “, does not affect the subsequent pairs of objective functions), and make a one-to-one correspondence:

  • The x in Taylor’s second order expansion F corresponds to the x in the target function.
  • In the fOf the corresponding objective function.
  • So the derivative of f with respect to x corresponds to a pair of objective functionsStrives for the partial derivatives

Get:

Among them,.

Alas! But where does the key to this transformation, the Taylor quadratic expansion, come from?

In mathematics, Taylor’s Formula is a Formula that uses information about a function at a certain point to describe values nearby. This formula from Taylor theorem of differential and integral calculus (Taylor ‘s unseen), Taylor theorem describes a differentiable function, if the function is smooth enough, in a known function at some point, the derivative value of Taylor formula can do with this derivative value coefficient of a polynomial is constructed to approximate function value in this field, This polynomial is called the Taylor polynomial.

It tells us that we can approximate the function by using some degree of the Taylor polynomial.

Taylor’s theorem: let n be a positive integer. If a function f defined on an interval consisting of a is differentiable n+1 at a, then for any x on that interval, there is:

Where the polynomial is called the Taylor expansion of the function at a, and the restIs the remainder of Taylor’s formula, isHigher order infinitesimal of theta.

Next, considering that our t-th regression tree is derived from the residual of the previous t-1 regression tree, it is equivalent to the value of t-1 tree **** is known. In other words,It has no effect on the optimization of the objective function, and can be directly removed, and the constant term can also be removed, so as to obtain the following relatively uniform objective function.

In this case, the objective function depends only on the first derivative of each data point on the error functionAnd the second derivative(You can already see the difference between XGBoost and GBDT. The target function retains the quadratic term of Taylor expansion.)

The general guiding principle, as Google reader Crab6789 puts it:

The essence is that the assignment of the sample to the leaf node corresponds to an OBJ, and the optimization process is obJ optimization. That is, splitting nodes into different combinations of leaves, different combinations correspond to different OBj’s, and all optimization revolves around this idea.

So far we have discussed the first part of the objective function: the training error. Now we discuss the second part of the objective function: the regularization terms, which is how to define the complexity of the tree.

4.2.2 Regular terms: tree complexity

First, a few rules

  • Represented by the set of leaf nodes and the score of leaf nodes
  • Each sample falls on a leaf node
  • Q (x) indicates that sample X is on a leaf node, and wq(x) is the score of this node, that is, the model predicted value of this sample

So when we divide the tree into the structure q and the leaf weight w, the structure function Q maps the input to the leaf index, and w gives us the leaf fraction for each index.

In addition, xGBoost has two parts for tree complexity as shown below:

  1. One is the number of leaves in the tree, T
  2. One is the l2-modulo square of the score w of the leaf nodes on the tree (L2 regularization of W is equivalent to adding L2 smoothing to the score of each leaf node in order to avoid overfitting).

With this new definition, we can transform the previous target function as follows (also, don’t forget:)

Among themIs defined as the set of sample subscripts above each leaf node JG is the first derivative and h is the second derivative. This step is due to the addition of two regular terms in the second part of the XGBoost target function, one for the number of leaf nodes (T) and one for the fraction of leaf nodes (w).

So there are two kinds of summations in the target function with the regular term added

  • One is the– > n (sample size)
  • One is the-> T (number of leaves)

This goal consists of T independent quadratic functions of one variable.

What’s the key to understanding this derivation? After discussion with Dr. Chen at AI Lab, it’s really about understanding this definition:Is defined as the set of sample subscripts above each leaf node JQ (xi) in this definition is saying that each sample value XI can be mapped to a leaf node on the tree by q(xi), and so the two summations are unified by this definition.

And then we can define

The final formula can be simplified to

Based on theThe derivative is equal to 0, and you get

Then put theThe optimal solution can be substituted into:

4.3 Scoring function calculation

Obj is the maximum reduction we can make on the target when we specify the structure of a tree. We could call that a structure score.

4.3.1 Splitting a Node

One of the interesting things about xGBoost is that we’ve seen all the way through how it’s optimized and how it’s calculated, but we’ve never seen what the tree actually looks like. Obviously, a tree is created by a node splitting into two, and then splitting into the whole tree. So how the tree splits is the key to what we’re going to talk about.

How do the xGBoost authors split a leaf node in their original paper

(1) Enumerate the greedy method of all different tree structures

Now what happens is that if you know the structure of the tree, you get the best score for that structure, so how do you determine the structure of the tree?

One way to think about it is to enumerate the structures of different trees, use the scoring function to find the optimal structure, add it to the model, and repeat the process. On second thought, you realize that there are almost an infinite number of states to enumerate, so what does that do?

We try greedy method, from the tree depth 0, each node traverse all of the features, such as age, gender, etc., and then for a characteristic, the value in the first according to the characteristics of the sorting, then linear sweep the characteristics, in turn, determine the best split point, finally, after all the characteristics of the segmentation, we choose to Gain the highest Gain the characteristics of the so-called, How does Gain count?

Remember the calculation we got at the end of Section 4.2?

In other words, the G/(H+λ) part of the objective function represents the contribution degree of each leaf node to the loss of the current model. After fusion, the calculation expression of Gain is obtained as follows:

The first thing to note is that “for a feature, order by the values in that feature.” Here’s an example.

Such as setting up a value to a, then enumerates all x < a, b < x such conditions (x represents some characteristics such as age, the age, the age smallest: assume that increases from left to right in turn, more than a little on the left, is bigger than a on the right side), for a particular division a, we need to calculate the derivative and a left and right.

For example, there are five people in total. After ranking by age, at the beginning, we have four categories as follows:

  1. Separate the first person from the next four
  2. Separate the first two from the last three
  3. Separate the first three from the last two
  4. Divide the front four from the back one

Next, the Gain of the above four partitioning methods was calculated separately to see which partitioning method had the largest Gain value. After calculation, it was found that the Gain value of the second partitioning method “dividing the first two people and the next three people” was the largest. It means that this is the most appropriate partition on the first layer, which is divided into two.

In other words, for all the features X, we can enumerate all the segmtioned gradients and GL and GR by doing one scan from left to right. Then use the formula for calculating Gain to calculate the score of each partition scheme.

Then the second, third, fourth, and NTH layers are divided in the same way.

The second thing to note is that introducing segmentation doesn’t necessarily make things better, so we have a penalty term for introducing new leaves. This optimization goal corresponds to tree pruning, and the introduced segmentation is ignored when the gain is less than a threshold γ.

In other words, when a partition is introduced, the result is that the score obtained after segmentation – the score obtained without segmentation is too small (for example, less than our minimum expected threshold γ), but the complexity is too high, then the gain is not worth the loss. If the benefits of doing a certain action are much greater than the disadvantages, it is better not to do it in order to avoid the complicated attitude of doing more than one thing is better than doing less.

It is equivalent to the case that two leaf nodes exist in the same tree after we find that “dividing” is worse than “dividing” (the gain obtained is too small to be less than the threshold γ).

Here’s the algorithm from the paper

(2) Approximate algorithm

The main reason is that the data is too large to calculate directly

Crab6789, a reader at Google, commented:

When you assign a sample from the root to the leaf, it’s a permutation. Different combinations have different cosine of t. To find the best combination, you have to try. It is impossible to exhaust yourself, so the greedy method comes out. Instead of going from start to finish, we’re looking at what’s the best way to allocate the nodes at the moment. This led to the exact Greddy method, and later to the accelerated method of Histogram.

4.4 Summary: Vaughan Tree Algorithm

To sum up, it looks like this

Let’s go over the whole process again.

If the value of a sample label is 4, then the first regression tree predicts 3 and the second one predicts 1; Another set of regression trees, one prediction 2, one prediction 2, so the latter one, why? In the former case, the first tree is too much and too close to 4, which means that there is a greater risk of overfitting.

OK, that sounds nice, but how do we do that? How does this objective function relate to the actual parameters, remember we said, the parameters of the regression tree:

  1. Which feature is selected to split the node
  2. The predicted value of nodes (not by taking an average of such a rude and unreasonable way, but at least a higher level)

The final strategy: greed + optimization (yes, quadratic optimization).

Popular interpretation of the greedy strategy: is the decision-making moment in accordance with the current goal of the optimal decision, in short, is the maximum interests of the decision, “shortsighted” strategy.

How to use the greedy strategy here? At the beginning, you have a group of samples and put them in the first node. At this time, T=1, what is w

  • If the l(w−yi) error here is denoted by the squared error, then the above function is a quadratic function of w to be minimized, and the minimum value is the predicted value of the node, and the minimum value is the minimum loss function.
  • In essence, this is a quadratic optimization problem! But what if the loss function isn’t quadratic? Taylor expansion, not quadratic try to approximate quadratic.

Next, a feature should be selected to split into two nodes and become a weak sapling. Then:

  1. Determine the feature used for splitting, how? The simplest is the rough enumeration/bruising (well, rough enough), then choose the one that works best with the loss function;
  2. How to establish the w of the node and the minimum loss function, loudly tell me how to do it? Yes, the quadratic function of the maximum value (the calculation of the quadratic maximum value generally has a fixed set of steps, that is, the derivative is equal to zero point). Therefore, select a feature to split and calculate the minimum value of the loss function, and then select another feature to split and get the minimum value of the loss function. After enumeration, find the one with the best effect and split the tree to get the saplings.

When splitting, you can notice that every time a node is split, only the sample of this node will be affected by loss function. Therefore, only the sample of the node to be split should be paid attention to when calculating the gain of the split (reduction of loss function).

All in all, XGBoost uses the same idea as the CART regression tree, using a greedy algorithm to traverse all feature partition points for all features, but using a different target function. Specifically, the objective function value after splitting is the gain of the objective function value of the single leaf node, and in order to limit the tree growth too deep, a threshold value is added. Only when the gain is greater than the threshold value, the splitting will be carried out.

The thresholds are set below

So it continues to split, forming a tree, forming a tree, each time on the basis of the last prediction to take the best split/tree, is it a greedy strategy?

There must be a stop condition for this kind of iteration. When should it stop? In short, set the maximum depth of the tree and stop growing when the sample weights and values are below the set threshold to prevent overfitting. Specifically, then

  1. When the gain brought by the introduced split is less than the set threshold, we can ignore the split, so the loss function will not be increased for every split, which means pre-pruning. The threshold parameter is(i.e., the coefficient of the number of leaf nodes T in the regular term);
  2. When the tree reaches the maximum depth, the decision tree establishment is stopped, and a super parameter max_depth is set to avoid learning local samples because the tree is too deep, thus overfitting.
  3. When the sample weight sum is less than the set threshold, the tree building stops. What does that mean? It involves a super parameter — the minimum sample weight and min_child_weight, which is similar to, but not exactly the same as, the MIN_child_leaf parameter of GBM. The general idea is that if a leaf node sample is too small, it will also be terminated to prevent overfitting;
  4. Seems to have seen the largest number of…

 

6 References and recommended reading

  1. Xgboost original paper: arxiv.org/pdf/1603.02…
  2. Xgboost PPT: homes.cs.washington.edu/~tqchen/pdf…
  3. XGBoost and Boosted the Tree
  4. Xgboost principle: blog.csdn.net/a819825294/…
  5. Xgboost introduction and actual combat (principle) : blog.csdn.net/sb19931201/…
  6. The CART Wikipedia
  7. Ensemble Learning
  8. Ensemble learning: Boosting and Random Forest
  9. From decision tree learning to Bayesian classification algorithm
  10. Decision tree (three) — complete summary (ID3, C4.5, CART, pruning, replacement) and source analysis
  11. Read the machine learning machine XGBoost principle
  12. Why GBDT and Random Forest work so well in actual Kaggle matches?
  13. Write an article about the principle of Xgboost for discussion and reference
  14. Decision trees, Random forests, GBDT, XgBoost
  15. Some of xGBoost’s core parameters
  16. How to understand Taylor series popularly? : www.zhihu.com/question/21…
  17. Xgboost why need the second order Taylor expansion: www.julyedu.com/question/bi…
  18. The most popular gradient lifting tree (GBDT) principle
  19. ID3, C4.5, CART, Random Forest, Bagging, Boosting, Adaboost, GBDT, XGBoost algorithm summary
  20. Derivation of XGBoost second order Taylor expansion formula
  21. Taylor & Company Wikipedia
  22. Xgboost Notes written by Super Mario: This article takes you to understand the principle of XGBoost
  23. Online editing LaTeX formula: private.codecogs.com/latex/eqned…

 

Afterword.

When you’re learning something and you don’t feel like you understand it, chances are it’s not because you’re not smart enough. Chances are, it’s because the material you’re reading isn’t accessible enough. (If that still doesn’t work, ask someone.)

Hopefully, if you’re reading this, you feel the same way.

The following improvement process of this paper is for reference for understanding:

  1. 8.4 The first edition in the morning introduces GBDT through an easy-to-understand example of age prediction, because GBDT is the basis for understanding XGBoost;
  2. 8.4 afternoon second edition, xGBoost derivation of a lot of formulas, beginners are easy to get into, after by seizing the core of XGBoost: target function, comb clear xGBoost framework;
  3. 8.5am the third edition optimized the introduction of decision tree, such as the introduction of information gain;
  4. 8.5pm version 4: optimized the display of most formulas, such as LaTeX image display instead of plain text display;
  5. 8.6am version 5, the introduction of booting integration learning has been optimized to make the full text more progressive;
  6. 8.6 evening sixth edition, specification xGBoost objective function formula representation, and comb through the full text of many details, formula;
  7. 9.1am version 7, improved the way the XGBoost tree is split in section 4.3.1 to be clearer;
  8. 19, 1.9, 8th edition, improved the description of split nodes in section 4.3.1 to make the logic clearer and the writing more popular;
  9. In part 3 of 1.10 9th edition, an example of predicting age was added to explain GBDT more populatively;
  10. 19, 1.14 10th edition, Taylor second-order expansion and xgBoost target function correspondence write clear, so as to understand more smoothly.

After reading the whole article, you will find that there are three key points to understand XGBoost: ① The one-to-one correspondence between the terms of the XgBoost objective function and the binomial of Taylor’s expansion in Section 4.2.1, ② a mathematical technique used to calculate the score w in section 4.2.2, ③ the tree splitting algorithm shown in Section 4.3.1. With these three points in mind, there is no barrier to understanding XGBoost.

The evening of August 6, 2018 ~ January 14, 2019.