The article is from wechat official account: [Alchemy of Machine Learning]

[TOC]

1 Author’s Preface

It’s a little out of date to be sorting out XGB’s algorithms in 2020. However, it is mainly for the purpose of broadening the knowledge and dealing with the interview. Now in the big data race, XGB has been largely replaced by LGB models, so here’s a look at the Boost algorithm. The Adaboost algorithm and gradient-boost algorithm have been introduced in other posts, but this article is about XGBoost.

Overview of the tree model

XGB is the Extreme Gradient Boosting limit Gradient Boosting model. XGB is simply a combination of classification and regression trees (CART). It is similar to GBDT and Adaboost. 【CART= Classification ADN Regression Trees 】

Here for a decision tree, how to split, how to choose the best split point, is actually a search process. Search for how to split in order to minimize the target function. The objective function is as follows: Obj=Loss+ ω Obj=Loss+ \OmegaObj=Loss+ ω ObjObjObj is the optimization function we want to minimize, and LossLossLoss is the predicted result and real worth Loss of this CART model. Omega Omega Omega is the complexity of the CART model, similar to the regularization term in the neural network. The above formula is an abstract concept. What we should know is that the CART tree model requires the prediction to be as accurate as possible, and the tree model should not be too complex.

For regression problems, we can use the mean square error as Loss: Loss = ∑ I (yi – yi) ^ 2 Loss = \ sum_i {(y_i – \ hat {y_i}) ^ 2} Loss = ∑ I (yi – yi ^) 2

For classification problems, it is very common to use cross entropy. Here, binary cross entropy is used as an example: Loss = ∑ I (yilog (yi ^) + 1 – (yi) log (yi ^)) Loss = (\ \ sum_i {(y_ilog hat {y_i}) + (1 – y_i) log (\ hat {y_i}))} Loss = ∑ I (yilog (yi ^) + (1 – yi) log (yi ^))

In short, the Loss is a measure of the prediction accuracy of the model.


Let’s see how to calculate the complexity of this model Omega \Omega Omega. Ω = gamma T + 12 lambda ∑ jTwj2 \ Omega = \ gamma T + \ frac {1} {2} \ lambda \ sum ^ T_j Ω {w_j} ^ 2 = gamma T + 21 lambda ∑ jTwj2

TTT represents the number of leaf nodes, and wjW_jwj represents the weight on each leaf node (proportional to the sample number of leaf nodes).

The problem here is that wjW_jwj is proportional to the number of samples of each leaf, but not the number of samples. The solution of wjW_jwj depends on taking the derivative of the whole objective function and finding the weight of each leaf.

3 XGB vs GBDT

Having said all this, it seems that there is not much difference between XGB and GDBT. That’s because we haven’t even started talking about XGB! We’ve talked about the general concepts of the tree model. Explain XGB~ below tidy up some view on the net, add own understanding. If there is any mistake, please point out the comment, thank you!

3.1 Difference 1: Built-in regular terms

In GDBT, only the new weak classifier is used to fit the negative gradient, then how many trees is good to fit? I don’t know. XGB’s optimization function has an Omega \Omega Omega complexity. This complexity is not the complexity of a particular CART, but the total complexity of all the Carts in XGB. As you can imagine, the complexity increases its penalty with each CART, and XGB stops when the loss decrease is less than the complexity increase.

3.2 Difference 2: There is second derivative information

The new CART in GBDT fits the negative gradient, that is, the first derivative. And in XGB we take the second derivative information.

Here’s a simple derivation of how XGB uses the second derivative information:

  1. Previously, we obtained the XGB optimization function:


O b j = L o s s + Ω Obj = Loss + \Omega

  1. Then we made Loss and Omega a little more specific:

Obj = ∑ inLoss (yi, y ^ it) + ∑ jt Ω Obj = (cartj) \ \ sum_i ^ n {Loss (y_i, hat {} y _i ^ t)} + \ sum_j ^ t {\ Omega (cart_j)} Obj = ∑ inLoss (yi, y ^ it) + ∑ jt Ω (cartj) – Yit ^\hat{y_i^t}yit^ means that there are t CART weak classifiers in total, and then t weak classifiers give the estimated value of sample I. -Yiy_iyi true value of the ith sample; – ω (CARTj)\Omega(cart_j) ω (CARTJ) Complexity of the JTH CART model.

  1. Now we want to take the optimal function of the TTH CART model, so for now we just know the t-1 model. So we get:

Y ^it=y^it−1+ft(xi)\hat{y}_i^t = \hat{y} _I ^{t-1}+ F_t (x_i)y^it=y^it−1+ft(xi) t CART model prediction, is equal to the previous T-1 CART model prediction plus the t model prediction.

  1. So we can get:

∑ inLoss (yi, y ^ it) = ∑ inLoss (yi, y ^ it – 1 + ft (xi)) \ sum_i ^ n {Loss (y_i, \ hat {} y _i ^ t)} = \ sum_i ^ n {Loss (y_i, \ hat {} y _i ^ {1} t – + f_t (x_i)} ∑ I NLoss (yi,y^it)=∑inLoss(yi,y^it−1+ft(xi)) F (x + Δ x) ‘material f (x) + f (x) Δ x + 12’ f ‘(x) Δ x2f (x + x) Delta \ \ approx f (x) + f’ (x) \ Delta x + \ frac {1} {2} ‘f’ (x) \ Delta X ^ 2 f (x + Δ x) ‘material f (x) + f (x) Δ x + 21 f’ ‘Δ x2 (x)

  1. So how do I plug in Taylor’s formula?

Loss (yi, y ^ it) {Loss (y_i, \ hat {} y _i ^ t)} Loss (yiy_iyi in yi, y ^ it) is constant, Loss(y^it)Loss(\hat{y}_i^t)Loss(y^it) Loss(\hat{y}_i^t) Loss (y ^ it – 1 + ft (xi) Loss (\ hat {} y _i ^ {1} t – + f_t (x_i) Loss (y ^ it – 1 + ft (xi))

  1. Insert Taylor’s formula, and view ft(xi)f_t(x_i)ft(xi) as δ x\Delta x δ x:

Loss (y ^ it – 1 + ft (xi)) = Loss (y ^ it – 1) + Loss ‘(y ^ it – 1) ft (xi) + 12 Loss’ ‘(y ^ it – 1) (ft (xi) 2 Loss (\ hat {} y _i ^ {1} t – + f_t (x_i)) = Loss (\ hat {y }} _i ^ {t – 1) + Loss ‘(\ hat {} y _i ^ {1} t -) f_t (x_i) + \ frac {1} {2} Loss’ ‘(\ hat {} y _i ^ {1} t -) (f_t (x_i)) ^ 2 Loss (y ^ it – 1 + ft (xi)) = Loss (y ^ it – 1) It – 1) + Loss ‘(y ^ ft (xi) + 21 Loss’ ‘(y ^ it – 1) (ft (xi) 2 – in many articles, Will use gi = Loss ‘(y ^ it – 1) g_i = Loss’ (\ hat {} y _i ^ {1} t -) gi = Loss ‘(y ^ it – 1), as well Hi = Loss ‘ ‘(y ^ it – 1) h_i = Loss “(\ hat {} y _i ^ {1} t -) hi = Loss’ ‘(y ^ it – 1) to represent the function of the first derivative and second derivative.

  1. To bring the Taylor expansion back to the original optimization function, remove the constant term Loss(y^it−1)Loss(\hat{y}_i^{t-1})Loss(y^it−1)(this is not related to the t model) and the complexity of the previous T-1 model, The optimization function of the TTH CART can be obtained:


O b j t material i n [ g i f t ( x i ) + 1 2 h i ( f t ( x i ) ) 2 ] + Ω ( c a r t t ) Obj^t \approx \sum_i^n{[g_i f_t(x_i)+\frac{1}{2}h_i(f_t(x_i))^2}]+{\Omega(cart_t)}

So XGB uses the second derivative information, while GBDT only uses the first gradient.

3.3 Difference 3: Column sampling

Taking a page from random forest, XGB supports not only sample sampling but also feature sampling (column sampling), which not only reduces overfitting but also reduces computation. (But I personally doubt this point, I feel that this is just a function given by the code, not an advantage of THE XGB algorithm itself over GBDT. Because both XGB and GBDT can be done using column sampling. Anyway, the key difference is the second derivative and introducing the regular term.)

4 XGB why the second derivative

This is an advanced interview question about XGB. When I first saw this question, I was stunned.

【 First say oneself summarize the answer 】

  1. The second derivative information is used to accelerate the convergence rate.
  2. Reduce the amount of calculation.

4.1 Why is the calculation amount reduced

That makes sense, so let’s start with that. In GBDT, what takes the most time is to calculate the split point, select which feature and split at which point to obtain the minimum loss. Assuming that there are 5 features, each feature has 100 potential segmentation points, then the classification needs to be calculated 500 times.

Loss (y,y^t)loss(y,\hat{y}^t) Loss (y,y^t) Loss (y,y^t−1+ft(x)) Loss (y,\hat{y}^{t-1}+f_t(x)) Loss (y,y^t−1+ft(x)) We need to calculate the loss of 500 times (y, y ^ t – 1 + ft (x)) loss (y, \ hat {} y ^ {1} t – + f_t (x)) loss (y, y ^ t – 1 + ft (x)) but assume that using Taylor expansion: Loss (y ^ t – 1) + g ∗ ft (x) + 12 h (ft (x)) 2 loss (\ hat {} y ^ {} t – 1) + g * f_t (x) + \ frac {1} {2} h (f_t (x)) ^ 2 loss (y ^ t – 1) + g ∗ ft (x) + 21 h (ft (x)) 2 of them Loss (y^t−1) Loss (\hat{y}^{t-1}) Loss (y^t−1), GGG, HHH are only related to the decision tree that has been trained before, so they are constants, so they can be shared among 500 calculations, and one calculation is sufficient.

4.2 Why is the Convergence Speed Accelerated

Here’s going back to Taylor’s expansion: F (x + Δ x) (x) = f (x) + g ∗ Δ x + 12 h (Δ x) 2 f (x) (x + \ Delta x) = f (x) + g (x) * x + Delta \ \ frac {1} {2} h (x) (\ Delta X) ^ 2 f (x + Δ x) = f (x) (x) + g ∗ Δ x + 21 h (x) (Δ x) 2 this formula actually can be regarded as f (Δ x) f (\ Delta x) f (Δ x), because XXX can be seen as a constant. We want F(δ x)F(\Delta x)F(δ x) to be minimum (that is, the loss is minimum), so we take the derivative of Delta x\Delta x δ x: F ‘(Δ x) = g (x) + h (x) Δ x = 0 F’ (\ Delta x) (x) + h (x) = g \ Delta x = 0 F ‘(Δ x) = g (x) + h (x) Δ x = 0 derivative is zero, (default is convex function) δ x=−g(x)h(x)\Delta x=-\frac{g(x)}{h(x)} δ x=−h(x)g(x), that is, the updated step size is actually the first derivative divided by the second derivative.

Those of you who know optimization algorithms realize that this is actually the equivalent of Newton’s method. XGB trains a new base model every time, in fact, Newton’s method is used to optimize and update the minimum value of the loss function.

Therefore, I personally think that the reason why XGB using second-order information converges faster than GBDT using first-order information can be explained by Newton method converging faster than gradient descent method.

In fact, I have some explanation of this piece is not clear, because MY optimization calculation law is not fine (as if suddenly found that can not find the reason for the job 2333). What can be given is a more general explanation: in essence, Newton’s method is second order convergence, and gradient descent is first order convergence, so Newton’s method is faster. If more informally, such as you want to find a shortest path to walk to the bottom of a basin, the gradient descent method at a time from your current location selected a gradient in the direction of the greatest step, Newton’s method when choosing the direction, will not only consider whether or not the slope is enough big, will consider you a step, slope will become larger.

5 Newton’s method

Here’s a brief introduction to what Newton’s method is. After all, some friends may not have learned it, or have learned it like me and forgotten it.

The purpose of Newton’s method is to find the roots of a function, the point at which the function intersects the X-axis.

So here we have A cubic curve, and we start at A, and we take the tangent line to A, and we see that the tangent line intersects the X-axis.And then this focus makes a line that’s parallel to the Y-axis, intersects with point B, and then makes a tangent line from point B, intersects with the X-axis, and then……

And then iterate to point C

And slowly, you’re approaching the point where the cubic function intersects the X-axis, which is the root of the cubic function is equal to 0.


【 Mathematical formula 】
x n x_n
Equation of tangent line of point:
f ( x n ) + f ( x n ) ( x x n ) = 0 f(x_n)+f'(x_n)(x-x_n)=0
So it’s very easy to get:
x n + 1 = x n f ( x n ) f ( x n ) x_{n+1}=x_n-\frac{f(x_n)}{f'(x_n)}
Why is only first-order information used here? Because the whole point here is to find the root of a function, the root of a function equal to 0. In an optimization problem, we’re solving for the minimum of a function, which requires that the derivative of that function be equal to the root of zero, so in an optimization problem, it’s a second derivative optimization.

4,000 words. I’m so tired. Welcome to add friends to communicate.