1. What is XGBoost

XGBoost is an open source machine learning project developed by Tianqi Chen et al., which efficiently implements GBDT algorithm and makes many improvements in algorithm and engineering. XGBoost has been widely used in Kaggle competition and many other machine learning competitions and has achieved good results.

In terms of XGBoost, WE have to mention GBDT(Gradient Boosting Decision Tree). Because XGBoost is essentially a GBDT, but strives for maximum speed and efficiency, it has been called X (Extreme) Gvaughted. Both are boosting methods, including those mentioned above.

GBDT, not mentioned here, you can see my previous introduction, click this jump.

1.1 Definition of XGBoost tree

First, for example, we want to predict a family on the degree of be fond of computer games, considering the compared to young and old, the young are more likely to love video games, as well as compared to men and women, men are more like video games, so the first according to the age to distinguish between children and adults, and then distinguish by gender is male or female, one by one for each grade in the video game preferences degree, As shown in the figure below.

In this way, two trees tree1 and Tree2 were trained, similar to the principle of GBDT before. The sum of the conclusions of the two trees was the final conclusion, so the predicted score of the child was the sum of the scores of the nodes that the child fell into in the two trees: 2 + 0.9 = 2.9. Grandpa’s predicted score was the same: -1 + (-0.9) = -1.9. The details are shown in the figure below:

Well, you may have to stand up and exclaim, this is not to follow the article introduced GBDT is the same thing?

In fact, one of the biggest differences between XGBoost and GBDT is the definition of the objective function, leaving aside some differences in engineering implementation and problem solving. The target function of XGBoost is shown below:

Among them:

  • The red arrow points to L, which is the loss function (e.g., the squared loss function:)
  • The red boxes are the regulars (L1 regulars, L2 regulars).
  • The red circle is the constant term
  • For f(x), XGBoost makes an approximation using Taylor expansion of terms. F of x is one of the regression trees.

See here may be some readers will be dizzy, so many formulas, I only do a brief explanation here, specific algorithm details and formula solution please check this blog, very carefully: popular understanding of Kaggle competition kill xgboost

The core algorithm idea of XGBoost is not difficult, basically:

  1. You keep adding trees, you keep doing feature splitting to grow a tree, one tree at a time, and you’re learning a new function, f(x), to fit the residuals of the last prediction.
  2. When we complete the training and get K trees, we need to predict the score of a sample. In fact, according to the characteristics of this sample, a leaf node will fall into each tree, and each leaf node corresponds to a score
  3. Finally, you just add up the scores of each tree to get the predicted value of the sample.

Obviously, our goal is to make the predicted value of the treeAs close as possible to the real valueAnd have the maximum generalization ability. Similar to the previous GBDT routine, XGBoost also needs to accumulate the scores of multiple trees to get the final predicted score (in each iteration, a tree is added on the basis of the existing tree to fit the residual between the predicted result of the previous tree and the real value).

So then, how do we choose what f to put in each round? The answer is pretty straightforward, pick an F to reduce our target function as much as possible. Here f can be approximated using Taylor’s expansion formula.

In essence, the distribution of samples to leaves corresponds to an OBJ, and the optimization process is OBJ optimization. That is to split nodes into different combinations of leaves, and different combinations correspond to different OBJ. All optimization is carried out around this idea. So far we have discussed the first part of the objective function: the training error. Next we discuss the second part of the objective function: the re term, which defines the tree’s complexity.

1.2 Regular terms: tree complexity

XGBoost’s tree complexity consists of two parts:

  • One is the number of leaves in the tree, T
  • One is the L2 module square of the score W of the leaf node in the tree (L2 regularization of W is equivalent to increasing L2 smoothing for the score of each leaf node to avoid over-fitting)

Let’s look again at XGBoost’s objective function (loss function reveals training error + regularization definition complexity) :


The regularization formula is the second half of the objective function. For the equation above,Is the output of the whole accumulative model, and the regularization term ∑ K ω (ft) is the function representing the complexity of the tree. The smaller the value, the lower the complexity and the stronger the generalization ability.

1.3 How should trees grow

It was interesting to see how XGBoost was optimized and calculated from start to finish, but we never saw what the tree actually looked like. Obviously, a tree is created by dividing a node into two parts, then splitting into the whole tree. So how the tree splits is going to be the key to what we’re going to talk about. As for how a leaf node splits, the XGBoost authors in their original paper gave a method for splitting nodes: greedy method of enumerating all the different tree structures

Enumerating the structure of different trees, then using the scoring function to find an optimal structure of the tree, then add to the model, and repeat the process. This search process uses a greedy algorithm. Select a feature to split and calculate the minimum value of Loss Function. Then select another feature to split and obtain the minimum value of Loss Function. After enumeration, find the one with the best effect and split the tree to obtain the sapling.

In summary, XGBoost uses the same idea as the CART regression tree, using greedy algorithms to traverse all feature partition points for all features, but using different target functions. The specific approach is that the objective function value after splitting is better than the gain of the objective function of the single leaf node. At the same time, in order to limit the tree growing too deep, a threshold value is added. Only when the gain is greater than this threshold value can splitting be carried out. It continues to split to form a tree, and then another tree, each time taking the best prediction to further split/build the tree.

1.4 How do I Stop tree Cyclic Generation

There must be a stop condition for this kind of loop iteration. When should it stop? In short, set the maximum depth of the tree and stop growing when the sample weight sum is less than the set threshold to prevent overfitting. Specifically, then

  1. When the gain brought by the introduced splitting is less than the set threshold, we can ignore the splitting. Therefore, not every splitting loss function will increase as a whole, which is a bit of pre-pruning. The threshold parameter is (the coefficient of the number of leaf nodes T in the regular term).
  2. When the tree reaches the maximum depth, the establishment of the decision tree is stopped, and a hyperparameter max_depth is set to avoid the over-fitting caused by learning local samples because the tree is too deep.
  3. When the sum of sample weight is less than the set threshold, tree construction stops. The minimum sample weight and min_child_weight are similar but not identical to GBM’s min_child_leaf parameters. The general idea is that a leaf node sample is too small, also terminated to prevent overfitting;

2. What is the difference between XGBoost and GBDT

In addition to some algorithmic differences from traditional GBDT, XGBoost also has a lot of engineering optimizations. In general, the differences and connections between the two can be summarized in the following aspects.

  1. GBDT is a machine learning algorithm and XGBoost is an engineering implementation of this algorithm.
  2. When using CART as a base classifier, XGBoost explicitly adds regularization terms to control the complexity of the model, which helps prevent overfitting and thus improves the generalization of the model.
  3. GBDT only uses the information of the first derivative of the cost function in model training, XGBoost carries out the second-order Taylor expansion of the cost function, and can use the first and second derivatives at the same time.
  4. Whereas traditional GBDT uses CART as a base classifier, XGBoost supports many types of base classifiers, such as linear classifiers.
  5. Whereas traditional GBDT uses all the data in each iteration, XGBoost adopts a strategy similar to random forest and supports data sampling.
  6. Whereas traditional GBDT is not designed to handle missing values, XGBoost can automatically learn the missing value processing strategy.

3. Why XGBoost uses Taylor and what are the advantages?

XGBoost uses the first and second partial derivatives, and the second derivative is conducive to faster and more accurate gradient descent. By using Taylor expansion to obtain the second derivative form of function as independent variable, leaf splitting optimization can be carried out by only relying on the input data without selecting the specific form of loss function. In essence, the selection of loss function and model algorithm optimization/parameter selection are separated. This decoupling increases XGBoost’s applicability, allowing it to select loss functions on demand, which can be used for classification as well as regression.

4. Code implementation

GitHub: Click to enter

Machine Learning

5. References

Xgboost is a kaggle match killer

Author: @ mantchs

GitHub:github.com/NLP-LOVE/ML…

Welcome to join the discussion! Work together to improve this project! Group Number: [541954936]