Xgboost is already popular in machine learning circles, and I’m sure many of you have used it. To fully understand XGBoost, you must understand its internal modeling. In this way, each parameter can be corresponding to the internal model, and then understand the meaning of parameters, according to the need to adjust parameters. The goal of this article is to make it as easy as possible for you to understand the internals. The main reference is the article introduction to XgBoost by Tianqi Chen. In my opinion, this article is one of the best introductions to XGBoost out there. Students who are good at English suggest reading English directly, if there are not very understanding places, then refer to this article.

1. A few things you need to know in advance

1. Supervised Learning Supervised learning is learning with labeled training data. Let’s say I have 100,000 pieces of data, each of which has 100 features and a tag. The content of the label depends on the question of learning. If the data is the result of the various tests the patient has taken for a cancer diagnosis, the label is whether the patient has cancer. It’s 1, not 0.

Supervised learning is all about learning from these 100,000 pieces of data to diagnose cancer based on the test results. Because the scope of learning is limited to the 100,000 pieces of data, that is to say, the knowledge to be learned must be extracted from the 100,000 pieces of data. Figuratively, learning is done under the “supervision” of 100,000 tagged data. So it’s called supervised learning. 2. Results of supervised learning How is the knowledge learned from supervised learning represented and how is it used by us humans? In short, the learned knowledge is represented by a model that we humans use to use the learned knowledge. So, what is a model? A model is a mathematical expression. The simplest model is the linear model, which looks something like this: y^ I =∑jθjxij. In our example above, xi is the ith result of our 100,000 pieces of data, and xij is the JTH result of our ith piece of data. Y ^ I is what the model predicts for this number, and the higher this number is, the more likely the patient is to get cancer. Usually, we also need to process y^ I into a value of 0 to 1, to make it clear that this is a probability prediction, usually using the sigmoid function, you can refer to other sources if you are not familiar with it. θj is the “contribution” of the JTH test to whether or not a patient gets cancer, and it’s the parameter of our model, what we learned from the 10 data points.

Visible, so-called supervised learning, is a step or two, one is set model parameters, the second is based on the training data to find the best parameter values, the so-called best, from the application point of view, is the maximum absorption of article 100000 of the training data knowledge, but from the perspective of the process of parameters we are looking for, but with different gloss, below will be explained in detail, and find the best parameters, We have a model where the parameters are known, and the knowledge is there, and we can use it freely.

3, how to find the optimal parameters for the linear model above, for example, the patient has 100 examination results, then there are 100 parameters θj (j from 1 to 100). Each parameter can take on a real number, and there are obviously infinitely many combinations of 100 parameters, so how do we determine if one set of parameters is the best? At this point, we need another function to help us determine whether the argument is optimal, which is the object function.

How do we determine the objective function? To use our example above, where we want to determine whether a patient has cancer, suppose we treat the value of the linear model above, y^ I, and reduce it to between 0 and 1. In our 100,000 training data, the patients who developed cancer were labeled 1, and the patients who didn’t were labeled 0. So obviously, the best parameter must be the one that predicts 1 for all patients with cancer and 0 for all patients without cancer. This is almost the perfect parameter! Therefore, our objective function can be set as the MSE function: obj = ∑ I (Sigmoid (∑jθjxij) -yi)^2

The above function means that for the i-th data, the predicted value of the model is reduced to 0 and 1, and then the difference is made with the real label value (0 and 1) of the data, and then the square is taken. The larger the square value is, the more inaccurate the prediction is, which is the prediction error of the model. Finally, we sum the prediction error of the model over 100,000 pieces of data. A measure of how well a specific set of parameters can be predicted is derived.

Is that perfect? Isn’t. The above objective function only evaluates whether the parameters are good or bad for the training data, and does not evaluate the performance of this group of parameters when we use the model for prediction. In other words, what is good for training data is not necessarily good for prediction. Why is that? One is that there are errors in 100,000 pieces of data; the other is that 100,000 pieces of data may not cover all kinds of samples. For an extreme example, if 100,000 pieces of data are all the examination results of people over 60 years old, we may not be able to predict a 10-year-old child by using the model we have learned.

So how do you measure whether a set of parameters is good for prediction? The answer is test!

I thought that was bullshit.

That’s the truth. True prediction is the ultimate judgment. But one thing we can do is regularize.

The so-called regularization is to exert certain control on parameters to prevent them from going to extremes. For example, let’s say that in 100,000 data sets, all of the patients with cancer are over 60 years old, and all of the patients without cancer are under 30 years old, and one of the results is bone density. In general, the old have low bone density and the young have high bone density. So the model that we’re learning is probably going to look like this, with the bone density parameter θj being very high, and the other parameters being very small, and basically, the model tends to use this one test to determine whether the patient has cancer, because that minimizes the target function.

It is obvious that such parameters are not good for forecasting.

Regularization can help us avoid such problems.

A common regularization is L2 regularization, which is the sum of squares of all parameters. We want this sum to be as small as possible while the model has the best possible prediction of the training data.

Finally, we add the L2 regular term to the original objective function to obtain the final objective function: obj = ∑_i(sigmoid(∑_jθjxij) -yi)^2 + ∑j(θj^2)

The set of arguments that minimizes this function is the best argument we’re looking for. This OBJ contains two terms called the loss function and the regular term. The regular term here is essentially used to control the complexity of the model.

1. Above, we intentionally left out some important aspects in order to be as simple as possible. For example, our example is classification, but the loss function used is MSE, which is usually not used this way. For regression problems, the commonly used loss function is MSE, namely:





Return. PNG

For classification problems, the loss function commonly used is logarithmic loss function:







Classification. PNG


At first glance, this loss function looks strange, and we have to ask why this function is used to judge whether a set of parameters are good or bad for training data.

Using the example above, if we have a sample with a label of 1, yi = 1, then the only thing left in the loss function for this sample is the left-hand side, and since yi = 1, the final form looks like this:





Logarithmic 1. PNG

Yi with a little pointed cap on its head is the predicted value of our model, and obviously the larger this value is, the more the function above tends to zero, and the loss value is zero as YI goes to infinity. This meets our requirements.

Similarly, you can do a similar analysis for the yi=0 sample.

In terms of how this loss function is derived, there are two ways, one is LR, the other is maximum entropy. See other resources for detailed derivation.

2, xgboost

Since XGBoost is a supervised model, our first question is: What model does XGBoost correspond to? The answer is a bunch of CART trees. Now, maybe we have a question, what is a CART tree? Please refer to other sources for this question, and I have related articles on my blog. Then, how does a bunch of trees make predictions? The answer is simply to add the predicted values of each tree together as the final predicted value, which is pretty straightforward.

Here’s an example of a CART tree and a bunch of CART trees to determine if a person would like a computer game:





predict1.PNG




predict2.PNG

The bottom of the second figure illustrates how to make predictions with a bunch of CART trees by simply adding up the predicted scores of each tree.

Why does XGBoost use a CART tree instead of a normal decision tree? In short, for the classification problem, since the value of the leaf node of the CART tree corresponds to an actual score rather than a determined category, this will facilitate the implementation of efficient optimization algorithms. Xgboost is known for its accuracy and speed, which is partly due to the use of CART trees.

Knowing xGboost’s model, we need to use mathematics to accurately represent this model, as follows:





predict3.PNG

Here K is the number of trees, F is all possible CART trees, and F is a specific CART tree. This model consists of K CART trees. So once we have the model, the natural question is, what are the parameters of this model? Because we know that “knowledge” is embedded in parameters. Second, what is the objective function used to optimize these parameters?

Let’s first look at the second problem, the objective function of the model, as follows:





predict4.PNG

The objective function also has two parts, the first part is the loss function, and the second part is the regularization term, and the regularization term here is the sum of the regularization terms of K trees, and you might wonder, what is the regularization term of a tree? But hold your curiosity for the moment, and there will be answers. For now, they’re all abstract, so don’t worry, we’ll flesh them out one by one later.

3. Train XGBoost

Above, we have obtained the XGBoost model and its objective function, so the task of training is to find the best parameter set by minimizing the objective function.

The question is where are the parameters?

It is natural to think that the XGBoost model consists of CART trees, and the parameters naturally exist in each CART tree. So, in terms of a single CART tree, what are its parameters? According to the introduction of CART tree above, we know that two parts need to be determined to determine a CART tree. The first part is the structure of the tree, which is responsible for mapping a sample to a certain leaf node, which is essentially a function. The second part is the score on each leaf node.

It looks like you’re in trouble. You can say the score of the leaf node as an argument, but what about the structure of the tree? And we’re not a tree, we’re K trees!

Let’s imagine that if the structure of K trees is determined, all that remains of the model is the value of the leaves of all K trees, and the regularization term of the model can also be set as the sum of squares of the values of each leaf node. At this point, the entire objective function is actually a function of the values of all the leaf nodes of a K tree, so we can use gradient descent or random gradient descent to optimize the objective function. This method is no longer working, and another must be found.

4. Addition training

The so-called addition training is essentially a meta-algorithm, which is applicable to all addition models. It is a heuristic algorithm. About this algorithm, MY another article about GBDT has a detailed introduction, here no longer repeat, unfamiliar friends, you can take a look. With addition training, our goal is no longer to directly optimize the entire objective function, which we have shown to be impossible. I’m optimizing the objective function step by step, first optimizing the first tree, then optimizing the second tree, until I’ve optimized K trees. The whole process is shown below:





predict6.PNG

At step t, we add an optimal CART tree ft. How did we get this optimal CART tree ft? Very simple, it is the CART tree with the smallest objective function on the basis of the existing T-1 tree, as shown in the following figure:





10.PNG

If we were to use the loss function MSE, the above expression would look like this:





11.PNG

This is a beautiful expression, because it contains both the first and the second forms of ft of xi, and the coefficients of the first form are residuals. You might wonder why the quadratic and the quadratic are beautiful, because it’s going to be a lot easier for us to optimize later, and you’ll see. Notice: what is ft(xi)? It’s just the value of some leaf of FT. As mentioned earlier, the values of leaf nodes can be used as model parameters.

However, for other loss functions, we may not be able to get such a beautiful formula. Therefore, for general loss functions, we need to perform Taylor’s second-order expansion, as shown below:





12.PNG

Among them:





13.PNG

It’s important to clarify what gi and HI mean. How does GI understand that? I have t minus one tree, right? This t-1 tree model has a predictive value y^ I for the ith training sample, right? This y^ I must be different from the true label yi of the ith sample, right? This difference can be measured in terms of the loss function l(yi,y^ I), right? Now you know what GI and hi mean, don’t you?

Let’s answer a quick question, how many gi’s and Hi’s do I have to compute when I’m optimizing the t-th tree? Well, yes, there are N of each, and N is the number of training samples. If you have 100,000 samples, when you optimize the t tree, you need to compute 100,000 GI and hi. Seems like a lot of trouble, doesn’t it? But if you think about it, isn’t there a relationship between those 100,000 Gi’s? Is it possible to do it in parallel? Once again, you’re smart enough to get a feel for why XGBoost is hot!

All right, now let’s look at what’s constant and what’s variable. There’s a constant term at the end, which, as you’re smart enough to guess, is the regularization term for the first t-1 tree. L of yi, yi to the t minus 1 is also a constant term. The remaining three variables are the first form of the t CART tree, the second form, and the regularization term of the whole tree. And again, what we call a quadratic of a tree, is a quadratic of some leaf node.

Our goal is to minimize this objective function, and the constant terms are obviously useless, so we can remove them and get something like this:





14.PNG

Ok, so now we can answer the question of why the first and second forms are so beautiful. Because the coefficients of these first and quadratic forms are gi and hi, and GI and hi can be found in parallel. Furthermore, gi and HI do not depend on the form of the loss function, as long as the loss function is quadratic differentiable. What’s the good of that? The benefit is that XGBoost supports custom loss functions that are quadratic differentiable. Makes my brother stronger, doesn’t it?

5. Regularization term of model

This is pretty nice, but the omega (ft) is still a bit of a cloud. Now let’s define how to measure the regularization term of a tree. There is no objective standard for this matter. To do this, let’s first define the CART tree differently, as follows:







16.PNG

First, a tree has T leaf nodes, and the values of these T leaf nodes form a T-dimensional vector W. Q (x) is a mapping used to map a sample to a value from 1 to T, that is, to a leaf node. Q (x) actually represents the structure of the CART tree. W_q of x is, of course, what this tree predicts for sample x.

With this definition, XgBoost uses the following regularization terms:





17.PNG

Note the presence of γ and λ, which are defined by XgBoost itself. With XgBoost, you can set their values. Obviously, the higher the γ is, the more you want a simpler tree, because there is a greater penalty for trees with more leaves. The larger λ is, the simpler the tree is expected to be.

Why does XGBoost choose regularization items like this? It’s simple, it works! Good results are good.

6. A moment of wonder

At this point, our optimization goal of the t-tree is clear. Let’s do the following deformation to it. Please keep your eyes open and concentrate:





18.PNG

I need to stop here and really think about it. What does Ij stand for? It represents a set, in which each value represents the serial number of a training sample. The whole set is the training sample divided into the JTH leaf node by the t th CART tree. So with that in mind, if you look at this transformation, it’s just a change in the order of the summation. If you still feel difficult, feel free to leave a comment.

Further, we can simplify as follows:





19.PNG

Gj and Hj should be self-evident.

For a certain structure of the t – th CART tree (denoted by q(x)), all Gj and Hj are determined. And the values wj of each leaf node in the above formula are independent of each other. In fact, the above equation is a simple quadratic equation, and it is easy for us to calculate the optimal value of each leaf node and the value of the objective function at this time. As follows:





20.PNG

What does obj star stand for? It shows how well structured the tree is, and the smaller the value, the better structured it is! In other words, it is a measure of the structure of the t th CART tree. Note that this value is only used to measure the quality of the structure, not the value of the leaf node. Why is that? Take a closer look at the derivation of OBj *. Obj * is related only to Gj and Hj and T, which in turn are related only to the structure of the tree (q(x)), but not to the value of the leaf node. As shown below:





23.PNG

7. Find the optimal tree structure

Ok, so now that we have the criteria for how well a tree is structured, we can figure out what the optimal tree structure is, and once we figure that out, the optimal value of the leaf node is actually figured out up here.

Here’s the thing: there are nearly an infinite number of tree structures, and it’s not practical to measure how good or bad they are one by one and then pick the best. So, we still need to adopt a little strategy, which is to gradually learn the best tree structure. This is the same as the decomposition of the K tree model into a tree to learn, but from a tree to a layer of nodes. If you’re still a little confused, don’t worry, let’s take a look at the learning process. Take the example we mentioned above of determining whether someone likes computer games. The simplest tree structure is a tree of one node. We can figure out how good or bad this one-node tree is obj star. Suppose we now want to fork the single-node tree by age, we need to know: 1. Is the age division valid, that is, does it reduce the value of OBj? 2.

To answer these two questions, we can rank the five members of the family by age. As shown below:





29.PNG

By scanning this graph from left to right, we can find all the points of segmentation. For each identified sharding point, we use the following criteria to measure the quality of sharding:





27.PNG

The Gain is actually the obj* of a single node minus the tree OBj * of the two nodes after segmentation. If the Gain is positive and the value is larger, it means that obj* after segmentation is smaller than obj* of a single node, and it is more worthy of segmentation. At the same time, we can also observe that if the left half of Gain is smaller than the gamma on the right, then Gain is negative, indicating that OBJ becomes larger after segmentation. In fact, γ is a critical value here, and the larger its value is, the more stringent requirements we have on the decrease of OBJ after segmentation. This value can also be set in XGBoost.

After the scan, we can determine whether to shard or not. If we shard, we can get a relatively good tree structure by recursively calling the shard process for the two nodes that are shard.

Note: The shard operation of XGBoost is not the same as the normal decision tree shard. Ordinary decision trees do not consider the complexity of the tree during the cutting, but rely on the subsequent pruning operation to control. Xgboost takes into account the tree’s complexity when it shard, that gamma parameter. Therefore, it does not require a separate pruning operation.

You’re done

Once the optimal tree structure is found, it is easy to determine the optimal leaf node. We succeeded in finding the t-th tree! Scatter flowers!!