GBDT- this is how it works (with code)

1. Explain the process of GBDT algorithm

Boosting Decision Tree GBDT(Gradient Boosting Decision Tree) uses Boosting idea.

1.1 Boosting thought

Boosting method trains base classifiers in a serial way, and each base classifier is dependent on each other. Its basic idea is to stack the base classifier layer by layer, and each layer gives higher weight to the samples misclassified by the base classifier of the previous layer during training. During the test, the final result is obtained by weighting the results of each layer classifier.

Bagging and Boosting have different serial training modes. Bagging method has no strong dependence on each base classifier in the training process and can carry out parallel training.

1.2GBDT used to be like this

The principle of GBDT is very simple, that is, the results of all weak classifiers add up to the predicted value, and then the next weak classifier fits the residual error of the error function to the predicted value (the residual error is the error between the predicted value and the real value). And, of course, the weak classifiers in it take the form of individual trees.

Take a very simple example, for example, I am 30 years old this year, but the computer or model GBDT does not know how old I am, then GBDT what to do?

It will fit a random age in the first weak classifier (or the first tree), say 20 years, and find that the error is 10 years;
Then in the second tree, six years was used to fit the remaining loss, and the gap was four years;
Then, in the third tree, we matched the remaining gap with 3 years and found that the gap was only 1 year;
Finally, the remaining residuals were fitted with 1 year old in the tree in Lesson 4, perfect.
Finally, the sum of the conclusions of the four trees is the real age of 30 years (in practical engineering, GBDT is to calculate the negative gradient and use the negative gradient to approximate the residual).

Why can GBDT approximate residuals with negative gradients?

Under the regression task, GBDT has a predicted value for each sample in each round of iteration, and the loss function at this time is the mean square error loss function.

So the negative gradient is calculated like this

Therefore, when the loss function selects the mean square loss function, the value of each fitting is (true value – the value predicted by the current model), that is, the residual. The variable here is theta, which is the value of the current prediction model, which is the negative gradient.

The training process

For simplicity, suppose there are only four people in the training set: A,B,C, and D, whose ages are 14,16,24, and 26 respectively. A and B are students of senior one and senior three respectively. C and D are fresh graduates and employees who have worked for two years. If a traditional regression decision tree is used for training, the following results will be obtained:

Now we use GBDT to do this. Due to the lack of data, we limit the number of leaf nodes to two, that is, each tree has only one branch and only two trees to learn. We get something like this:

In the first branch tree, as shown in Figure 1, since A and B are relatively similar in age, and C and D are relatively similar in age, they are divided into left and right groups, and the average age of each group is used as the predictive value.

The residual is calculated (the residual means: the actual value of A – the predicted value of A = the residual of A), so the residual of A is the actual value 14 – the predicted value 15 = the residual value -1.
Notice, the predicted value of A is the sum of all the trees in front of it, there’s only one tree in front of it so it’s just 15, and if there are other trees you have to add them all up as the predicted value of A.

Then replace the original values of A, B, C and D with their residuals -1, 1, -1 and 1, and go to the second tree to learn. The second tree only has two values 1 and -1, which are directly divided into two nodes, namely, A and C points are on the left, B and D points are on the right. After calculation (for example, A, actual value -1 – predicted value -1 = residuals 0, such as C, Actual value -1 – predicted value -1 = 0), and then the residual is 0 for everyone. The residuals are all 0, which is equivalent to the predicted value of the second tree being equal to their actual value. Then the real age can be obtained by adding the conclusions of the second tree to the first tree, that is, everyone gets the real predicted value.

In other words, the predicted values of A,B,C and D are now consistent with real age. Perfect!

A: Fourteen-year-old students go shopping less and often ask their upperclassmen questions. Their predicted age is A = 15 — 1 = 14
B: 16 years old senior three students, seldom shopping, often asked questions by the younger students, predicted age B = 15 + 1 = 16
C: 24 years old fresh graduates, shopping more, often ask questions of senior brothers, predicted age C = 25 — 1 = 24
D: An employee of 26 years old who has worked for two years, often goes shopping and is often asked questions by his junior and teacher. His predicted age is D = 25 + 1 = 26

Therefore, GBDT needs to accumulate the scores of multiple trees to get the final predicted score, and in each iteration, a tree is added on the basis of the existing tree to fit the residual between the predicted results of the preceding tree and the true value.

2. What are the differences and connections between gradient ascension and gradient descent?

The following table shows the comparison of gradient lift algorithm and gradient descent algorithm. It can be found that in each iteration, the current model is updated by using the information of the loss function relative to the direction of the negative gradient of the model. However, in gradient descent, the model is presented in the form of parameterization, so the update of the model is equivalent to the update of the parameters. However, in gradient lifting, models do not need to be parameterized, but are directly defined in the function space, which greatly expands the types of models that can be used.

3. What are the advantages and limitations of GBDT?

3.1 the advantages

In the prediction stage, the computation speed is fast and the computation between trees can be parallelized.
On a dense data set, GBDT has good generalization ability and expression ability, which makes it ranked first in many competitions of Kaggle.
Using decision tree as weak classifier makes GBDT model have better interpretation and robustness, can automatically discover the high-order relationship between features, and does not need special data preprocessing such as normalization.

3.2 limitations

GBDT is inferior to support vector machine or neural network in high-dimensional sparse data set.
Compared with other models, GBDT has less obvious advantages in dealing with text classification features than it does in dealing with numerical features.
The training process needs serial training, and only some local parallel methods can be used in the decision tree to improve the training speed.

4. The difference and connection between RF(Random forest) and GBDT

Similarities:

It’s made up of multiple trees, and the final result is determined by multiple trees.

Difference:

Trees composed of random forest can be classified trees or regression trees, and GBDT is only composed of regression trees
The trees that make up the random forest can be generated in parallel, while GBDT is generated in serial
The result of random forest is majority vote, while GBDT is the sum of multiple trees
Random forest is insensitive to outliers, while GBDT is sensitive to outliers
Random forest is to reduce the variance of the model, while GBDT is to reduce the deviation of the model
Random forest does not need feature normalization. GBDT requires feature normalization

5. Code implementation

GitHub:github.com/NLP-LOVE/ML…

Machine Learning

Author: @ mantchs

GitHub:github.com/NLP-LOVE/ML…

Welcome to join the discussion! Work together to improve this project! Qq Group Number: [541954936]

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

GBDT– this is how it works (with code)

1. Explain the process of GBDT algorithm

1.1 Boosting thought

1.2GBDT used to be like this

2. What are the differences and connections between gradient ascension and gradient descent?

3. What are the advantages and limitations of GBDT?

3.1 the advantages

3.2 limitations

4. The difference and connection between RF(Random forest) and GBDT

5. Code implementation

GBDT– this is how it works (with code)

1. Explain the process of GBDT algorithm

1.1 Boosting thought

1.2GBDT used to be like this

2. What are the differences and connections between gradient ascension and gradient descent?

3. What are the advantages and limitations of GBDT?

3.1 the advantages

3.2 limitations

4. The difference and connection between RF(Random forest) and GBDT

5. Code implementation

Related Posts

Python cv2.resize resets the image size

Stephen Hawking warns of government ‘AI arms race’

How do you iterate over DataFrame rows (and should you do so?