XGBoost is no stranger to data science competitions, but it’s also widely used in industry. This article shares a collection of XGBoost high-frequency interview questions that have been collected over the years in the hope of deepening your understanding of XGBoost and, more importantly, helping you find opportunities.

1. A brief introduction to XGBoost

Firstly, WE need to talk about GBDT, an addition model based on Boosting enhancement strategy. In training, the forward distribution algorithm is used for greedy learning, and each iteration learns a CART tree to fit the residuals between the predicted results of the previous T-1 tree and the real values of the training samples.

XGBoost has carried out a series of optimizations for GBDT, such as second-order Taylor expansion for loss function, regular term addition for objective function, support for parallelism and default missing value processing, etc., which has greatly improved the scalability and training speed, but its core idea has not changed much.

2. What is the difference between XGBoost and GBDT

  • Base classifiers: XGBoost’s base classifiers support not only CART decision trees but also linear classifiers, where XGBoost is equivalent to either Logistic regression with L1 and L2 regularized terms (classification problem) or linear regression (regression problem).
  • Derivative information: XGBoost does a second-order Taylor expansion of the loss function, GBDT only uses the first-order derivative information, and XGBoost also supports custom loss functions, as long as the loss function is first and second order differentiable.
  • Regular term: XGBoost’s objective function adds regular term, which is equivalent to pre-pruning, making the learned model more difficult to overfit.
  • Column sampling: XGBoost supports column sampling, similar to random forest, to prevent overfitting.
  • Missing value handling: For each non-leaf node in the tree, XGBoost can automatically learn its default split direction. If this eigenvalue is missing for a sample, it will be assigned to the default branch.
  • Parallelization: Note not the parallelism of tree dimensions, but the parallelism of feature dimensions. XGBoost will arrange each feature according to the feature value in advance and store it as a block structure. When splitting nodes, multithreading can be used to find the best segmentation point of each feature in parallel, which greatly improves the training speed.

3. Why does XGBoost use Taylor second-order expansion

  • Accuracy: Compared with GBDT’s first-order Taylor expansion, XGBoost adopts second-order Taylor expansion, which can approach the real loss function more accurately
  • Extensibility: loss function support customization, only need the new loss function second order differentiable.

4. Why can XGBoost train in parallel

  • The parallelism of XGBoost does not mean that each tree can be trained in parallel. XGB still uses boosting idea in essence, and each tree needs to wait for the previous tree training to be completed before starting training.
  • Parallelism of XGBoost refers to parallelism of feature dimensions: Before training, each feature is pre-sorted according to the feature value of the sample and stored as a Block structure, which can be used repeatedly in the later search for feature segmentation points. Moreover, features have been stored as a Block structure, so when searching for the optimal segmentation point of each feature, multi-threading can be used for parallel calculation of each Block.

5. Why is XGBoost fast

  • Block parallel: before training, each feature is sorted according to the feature value and stored as a Block structure, which is repeated in the search for feature segmentation points later. In addition, the parallel search for the segmentation points of each feature is supported
  • Candidate loci: each feature uses a constant number of loci as candidate segmentation points
  • CPU cache hit optimization: Cache prefetch is used to allocate a continuous buffer to each thread. Gradient information of samples in each block is read and stored in the continuous buffer.
  • Block processing optimization: Block is pre-placed in memory; Block decompresses by column; Blocks are divided into different hard disks to improve throughput

6. XGBoost’s approach to preventing overfitting

XGBoost was designed with a number of optimizations to prevent overfitting, as follows:

  • Objective function to add regular term: number of leaf nodes + L2 regularization of weight of leaf nodes
  • Column sampling: Training with only a subset of features (regardless of the remaining blocks)
  • Sub-sampling: Each round of calculation can not use all samples, making the algorithm more conservative
  • Shrinkage: Can be called learning rate or step length, in order to allow more learning space for later training

7. How does XGBoost handle missing values

One advantage of the XGBoost model is that it allows features to have missing values. Missing values are handled as follows:

  • When finding the best split point on feature K, the sample with feature missing in this column will not be traversed, but only the corresponding feature values in the sample with feature non-missing in this column will be traversed. Through this technique, the time cost of finding split points for sparse discrete features is reduced.
  • In terms of logical implementation, in order to ensure completeness, samples of missing eigenvalue will be assigned to left leaf node and right leaf node respectively. After calculation for both cases, the direction with the maximum gain after splitting (left branch or right branch) will be selected as the default branch direction of samples with missing eigenvalue in prediction.
  • If there are no missing values in training but missing values in prediction, the dividing direction of missing values will be automatically placed in the right child node.

Pseudocode for missing value handling when find_split

8. How to calculate the weight of leaf nodes in XGBoost

The final derivation of XGBoost objective function is as follows:

When the objective function reaches the minimum value Obj*, the weight of each leaf node is wj*.

The specific formula is as follows:

9. Stop growing conditions for a tree in XGBoost

  • When the Gain of a newly introduced split is less than 0, the current split is abandoned. This is the game process of training loss and model structural complexity.
  • When the tree reaches its maximum depth, it stops building the tree because the tree depth is too deep for overfitting, and a hyperparameter max_depth is required.
  • When a split is introduced, the sample weights and of the newly generated left and right leaf nodes are recalculated. If the sample weight of any leaf node is lower than a certain threshold, the splitting will also be abandoned. This involves a hyperparameter: minimum sample weight sum, which means that if a leaf node contains too few samples, it will give up splitting to prevent the tree from being too thin.

10. Difference between RF and GBDT

Similarities:

  • It’s made up of multiple trees, and the final result is determined by multiple trees.

Difference:

  • Integrated learning: RF belongs to Bagging idea, while GBDT is Boosting idea
  • Bias – variance tradeoff: RF continuously reduces the variance of the model, while GBDT continuously reduces the bias of the model
  • Training samples: RF samples of each iteration are formed by putting back samples from all training sets, while GBDT uses all samples each time
  • Parallelism: RF trees can be generated in parallel, while GBDT trees can only be generated sequentially (need to wait for a tree to be fully generated)
  • End result: RF is ultimately a majority vote for multiple trees (regression problem is taking average), while GBDT is a weighted fusion
  • Data sensitivity: RF is not sensitive to outliers, while GBDT is sensitive to outliers
  • Generalization ability: RF is not easy to overfit, while GBDT is easy to overfit

11. How does XGBoost handle unbalanced data

Lopsided data sets, such as user purchase behavior, can be extremely lopsided, which can have a significant impact on XGBoost’s training, and XGBoost has two built-in solutions to this:

First, if you care about AUC and use AUC to evaluate model performance, you can balance the weight of positive and negative samples by setting scale_POS_weight. For example, when the ratio of positive and negative samples is 1:10, scale_POS_weight can be 10.

Second, if you care about probability (the rationality of predicted scores) and you cannot rebalance the data set (which would destroy the true distribution of the data), you should set max_delta_step to a finite number to aid convergence (effective if the base model is LR).

The exact words were:

For common cases such as ads clickthrough log, the dataset is extremely imbalanced. This can affect the training of xgboost model, 
and there are two ways to improve it.
  If you care only about the ranking order (AUC) of your prediction
      Balance the positive and negative weights, via scale_pos_weight
      Use AUC for evaluation
  If you care about predicting the right probability
      In such a case, you cannot re-balance the dataset
      In such a case, set parameter max_delta_step to a finite number (say 1) will help convergence
Copy the code

So, source code exactly how to use scale_POS_weight to balance the sample, is to adjust the weight or oversampling? Look at the source code: \

if (info.labels[i] == 1.0f)  w *= param_.scale_pos_weight
Copy the code

As you can see, I’m increasing the weight of a few samples.

In addition, you can fix positive and negative sample imbalance by up-sampling, down-sampling, SMOTE algorithm, or custom cost function.

12. Compare LR and GBDT, and tell us when GBDT is inferior to LR

First, the difference between LR and GBDT: \

  • LR is a linear model with strong explanatory ability and easy parallelization, but it has limited learning ability and requires a lot of artificial feature engineering
  • GBDT is a nonlinear model with natural feature combination advantages and strong feature expression ability. However, parallel training between trees is impossible, and tree model is easy to overfit.

In the scenario of high dimensional sparse features, LR’s effect is generally better than GBDT. Here’s why:

Let’s start with an example:

Assume a binary classification problem with labels of 0 and 1 and features of 100 dimensions. If there are 1W samples, but only 10 positive samples 1, and the values of f1 of these samples are all 1, and the f1 features of the remaining 9990 samples are all 0(this situation is very common in the case of high-dimensional sparsity).

We all know that in this case, the tree model is easy to optimize the split using f1 features as an important node of the tree, because the node can directly will be divided into training data is very good, but when the test is found that the effect is very poor, because of the characteristics of f1 is just occasionally with y fitting to this rule, we often call this the fitting.

In this case, if LR is adopted, a similar over-fitting situation should also occur: y = W1*f1 + Wi*fi+… , W1 is particularly large to fit the 10 samples. Why is the tree model more overfitting at this point?

If you think carefully, it is found that current models generally carry regular terms, while the regular terms of linear models such as LR are the punishment for weight, that is, once W1 is too large, the punishment will be very large, and the value of W1 will be further compressed so that it will not be too large. However, the tree model is different. The penalty items of the tree model are usually the number of leaf nodes and depth, etc. As we all know, for the above case, the tree only needs one node to perfectly divide 9990 and 10 samples, and the penalty items generated in the end are extremely small.

This is why linear models are better than nonlinear models in the case of high-dimensional sparse features: linear models with regularization are less likely to overfit sparse features. * * * *

13. How to prune a tree in XGBoost

  • A regular term is added to the objective function: the number of leaf nodes and the square of L2 module of weight of leaf nodes are used to control the complexity of the tree.
  • When the node is split, a threshold is defined. If the gain of the objective function after splitting is less than this threshold, the node is not split.
  • When a split is introduced, the sample weights and of the newly generated left and right leaf nodes are recalculated. If the sample weight of any leaf node is lower than a certain threshold (minimum sample weight sum), the splitting will also be abandoned.
  • XGBoost first builds the tree from top to bottom until it reaches its maximum depth, and then reversely checks from bottom to top to see if there are any nodes that do not meet the splitting conditions and prune the tree.

14. How does XGBoost select the best split point?

XGBoost pre-sorted the features according to their eigenvalues before training and stored them as block structures, which can be reused later when nodes split. \

Therefore, the feature parallelism method can be used to calculate the optimal split point of each feature by multiple threads. According to the gain generated after each split, the eigenvalue of the feature with the maximum gain is finally selected as the optimal split point.

If each sample is traversed when calculating the optimal segmentation point of each feature, the computational complexity will be very large. This global scanning method is not suitable for big data scenarios. XGBoost also provides a histogram approximation algorithm, which only selects constant candidate splitting locations as candidate splitting points after sorting features, greatly improving the computing efficiency of node splitting.

15. How Scalable is XGBoost

  • The scalability of base classifiers: Weak classifiers can support CART decision trees, LR and Linear.
  • The scalability of objective functions: this scalability function supports custom loss function, which requires only the first and second order scalability. This property is due to Taylor’s second order expansion, resulting in a general form of the objective function.
  • The scalability Block structure supports parallelization and out-of-core computation.

16. How does XGBoost rate the importance of features

We adopt three methods to evaluate the importance of features in XGBoost model:

Official documents: (1) weight - The number of times a feature is used to split the data across all trees.2) gain - The average gain of the feature when it is used in trees. (3Cover - the average coverage of the feature when it is used in trees.Copy the code
  • Weight: The total number of times this feature was used in all trees to split the sample.
  • Gain: The average gain generated by this feature in all trees in which it occurs.
  • Cover: The average coverage of this feature across all trees where it appears.

Note: Coverage here refers to the number of samples affected by a feature after it is used as a segmentation point, that is, how many samples are divided into two sub-nodes through the feature.

17. General steps for tuning XGBooost parameters

First you need to initialize some basic variables, such as: \

  • max_depth = 5
  • min_child_weight = 1
  • gamma = 0
  • Subsample, colsample_bytree = 0.8
  • scale_pos_weight = 1

(1) Determine the number of learning rate and Estimator ****

The learning rate can first be 0.1, and CV can be used to find the optimal Estimators

(2) max_depth and min_child_weight * * * *

We adjust these two parameters because they have a big impact on the output. We first set these two parameters to large numbers and then iterate to narrow them down.

Max_depth, maximum depth of each subtree, check from range(3,10,2).

Min_child_weight, child node weight threshold, check from range(1,6,2).

The leaf node can be divided only if the sum of the weights of all its children nodes is greater than the threshold after a node is split.

(3) gamma****

Also known as min_split_loss, check from 0.1 to 0.5 refers to the threshold of the reduced value of the loss function for a leaf node after partitioning.

  • If it is greater than this threshold, the leaf node is worth further partitioning
  • If less than this threshold, the leaf node is not worth further partitioning

(4) subsample, colsample_bytree****

Subsample is the sampling ratio of training

Colsample_bytree is the sampling ratio of features

Both check from 0.6 to 0.9

(5) Regularization parameter ****

Alpha is L1 regularization coefficient, try 1e-5, 1e-2, 0.1, 1, 100

Lambda is the REGULarization coefficient of L2

(6) Reduce the learning rate ****

Increase the number of trees while decreasing the learning rate, usually ending with a learning rate of 0.01~0.1

18. How to solve the problem if XGBoost model is overfitted

When a fit occurs, there are two types of parameters that can be mitigated: \

The first type of parameter is used to directly control the complexity of the model. Including max_depth,min_child_weight, and gamma

The second type of parameter is used to increase randomness so that the model is not sensitive to noise during training. Including the subsample, colsample_bytree

Another is to directly reduce the learning rate, but increase the Estimator parameter at the same time.

19. Why is XGBoost less sensitive to missing values than some models

For features with missing values, the general solution is:

  • Discrete variables: fill with the most frequent eigenvalues;
  • Continuous variables: filled with median or mean;

Some models, such as SVM and KNN, involve the measurement of sample distance in their model principles. If the missing values are not handled properly, the prediction effect of the model will be poor.

However, tree model has low sensitivity to missing values and can be used in most cases when data is missing. The reason is that each node in a tree is looking for the optimal splitting point (eigenvalue) of a feature when splitting, and samples with missing eigenvalue can be completely ignored. In other words, if the missing eigenvalue of some samples is missing, the influence on the search for the optimal splitting point is not great.

XGBoost has a specific approach to missing data, as described in question 7 of the previous article.

Therefore, for data with missing values, after missing processing:

  • When the amount of data is very small, naive Bayes is preferred
  • Data volume is moderate or large, use tree model, preferentially XGBoost
  • Large amount of data, neural network can also be used
  • Avoid using distance metric dependent models such as KNN and SVM

20. Differences between XGBoost and LightGBM

(1) Tree growth strategy: XGB adopts level-WISE splitting strategy, while LGB adopts Leaf-WISE splitting strategy. XGB does undifferentiated splitting for all nodes in each layer, but some nodes may have very small gain and have little impact on the results, resulting in unnecessary overhead. Leaf-wise is carried out by selecting the node with the largest splitting benefit from all Leaf nodes, but it is easy to have over-fitting problems, so the maximum depth needs to be limited. \

(2) Segmentation point search algorithm: XGB uses feature pre-sorting algorithm, while LGB uses histogram based segmentation point algorithm, which has the following advantages:

  • For example, when 256 bins are discretized, only 8-bit integers can be used to save the bin to which a sample is mapped (the bin can be said to be the converted feature), compared with the exact greedy algorithm of pre-ordering (int_32 to store the index + float_32 to store the feature value). It saves 7/8 space.
  • The calculation efficiency is improved. The pre-ordered Exact greedy needs to iterate the data for each feature and calculate the gain, with the complexity of???? (#???????????????????????????? X #???????????????????? . After establishing the histogram, the histogram algorithm only needs to traverse the histogram for each feature, and the complexity is???? (#???????????????????????????? X #???????????????????? .
  • LGB can also use histogram for differential acceleration. The histogram of a node can be obtained by subtracting the histogram of its parent node from the histogram of its sibling node, thus speeding up the calculation

In fact, xGBoost’s approximate histogram algorithm is similar to lightgBM’s histogram algorithm. Why is XGBoost’s approximation algorithm much slower than LightgBM’s?

Xgboost dynamically builds histograms at each layer, because the histogram algorithm of Xgboost is not targeted at a specific feature, but all features share a histogram (the weight of each sample is the second derivative), so histograms need to be rebuilt at each layer. Lightgbm has a histogram for each feature, so building the histogram once is enough.

(3) Support for discrete variables: It is impossible to directly input category variables, so it is necessary to encode category variables in advance (such as unique thermal coding), while LightGBM can directly process category variables.

(4) Cache hit ratio: one disadvantage of XGB using Block structure is that gradients are obtained by index, and these gradients are obtained in order of feature size. This will lead to discontinuous memory access, which may make CPU cache hit ratio low, thus affecting algorithm efficiency. LGB is based on histogram splitting feature, and gradient information is stored in one bin after another, so access gradient is continuous and cache hit ratio is high.

(5) The parallel strategies of LightGBM and XGboost are different:

  • Feature parallelism: The premise of LGB feature parallelism is that each worker has a complete data set, but each worker only searches for the optimal segmentation point on feature subset; Workers need to communicate with each other to determine the best segmentation point by comparing losses; Then the location of the optimal sharding point is broadcast globally, and each worker can be sharded. The biggest difference between XGB and LGB in feature parallelism is that XGB only has partial column data in each worker node, namely vertical segmentation. Each worker looks for the local optimal segmentation point, workers communicate with each other, and then nodes are split on the worker with the optimal segmentation point. Only after this node broadcasts the sample index numbers that are segmented to the left and right nodes, can other workers start to split. The difference between the two results in a significant reduction in the communication cost between workers in LGB. Only one feature split point is needed to communicate, while the sample index needs to be broadcast in XGB.
  • Data parallelism: When a large amount of data has a small number of features, the data parallelism policy can be adopted. LGB on data level in the cut, the data on the each worker to establish the local histogram, and then merged into global histogram, with the method of histogram is presupposed, calculate sample size less node index of sample first, and then directly by subtracting get another child node index of sample, the histogram algorithm makes the worker communication between the double cost reduction, Because we only communicate with nodes with a small sample size. Data parallelism in XGB is also horizontal segmentation, and then a local histogram is established for a single worker and then merged into a global one. The difference is that when nodes on each worker are split according to the global histogram, the sample index of child nodes will be separately calculated, so the efficiency is slow and the communication volume between each worker becomes large.
  • Voting parallelism (LGB) : When the data volume and dimension are large, voting parallelism is selected. This method is an improvement of data parallelism. The cost of merging histograms in data parallelism is relatively high, especially when the feature dimension is large. The general idea is: each worker will first find some excellent local features, and then conduct global voting. According to the voting results, top’s features are selected to merge the histograms, and then the optimal global segmentation point is sought.

Reference:

1. blog.csdn.net/u010665216/…

2. blog.csdn.net/jamexfx/art…

Site introduction ↓↓↓

“Machine Learning Beginners” is a personal public account to help artificial intelligence enthusiasts get started (founder: Huang Haiguang)

Beginners on the road to entry, the most need is “help”, rather than “icing on the cake”.

ID: 92416895\

Currently, the planet of Knowledge in the direction of machine learning ranks no. 1.

Past wonderful review \

  • Conscience recommendation: Introduction to machine learning information summary and learning recommendations \

  • Github Image download by Dr Hoi Kwong (Machine learning and Deep Learning Notes and Resources)

  • Machine Learning Cheat Sheet – (Understanding machine learning like reciting TOEFL Vocabulary) \

  • Introduction to Deep Learning – Python Deep Learning, annotated version of the original code in Chinese and ebook

  • Machine learning – “Statistical learning methods” python code implementation, ebook and courseware \

  • Blockbuster | complete AI learning course, the most detailed resources arrangement! \

  • Word2vec

  • Fundamentals of Mathematics for Stanford CS229 Machine Learning (Probability and Linearity)

Note: This site’S QQ group: 865189078 (a total of 8 groups, do not add repeatedly).

To join the wechat group of this site, please add the assistant wechat of Huang Bo, explanation: public number user group.