Mention of GBDT classification I believe that we should not feel strange, this article on the basic principle of GBDT classification to explain, and hand in hand, side by side with you to achieve this algorithm.

Complete implementation code please refer to my Github:

https://github.com/tushushu/imylu/blob/master/imylu/ensemble/gbdt_base.py
https://github.com/tushushu/imylu/blob/master/imylu/ensemble/gbdt_classifier.py
https://github.com/tushushu/imylu/blob/master/examples/gbdt_classifier_example.py
Copy the code

I. Principles

Let’s talk about GBDT classification in terms of words rather than big mathematical formulas.

1.1 Review the old and learn the new

GBDT classification is only a little transformation on GBDT regression, and GBDT classification is built on the basis of regression tree. I wrote an article about the return of GBDT before, and the link is as follows:

GBDT regression principle and Python implementation \

I wrote a previous article on regression trees, which is linked below:

Regression tree principle and Python implementation \

1.2 the Sigmoid function

If you are familiar with logistic regression or neural network, you will be familiar with the Sigmoid function, which is expressed as:

It is not difficult to get: \

So, the range of the Sigmoid function is (0, 1), and the derivative is y * (1-y)\

1.3 Transformation of GBDT regression

According to GBDT Regression, it is assumed that m round of prediction is to be made, the prediction function is Fm, the initial constant or the regression tree of each round is Fm, and the input variable is X.

Since it is a regression problem, the range of function F is at (-∞, +∞), while the binary classification problem requires the predicted function value to be at (0, 1), so we can use the Sigmoid function to control the range of the final predicted value between (0, 1), and its function expression is as follows: \

1.4 Forecast meeting \

For example, to predict whether a blind date will meet, 1 is used for meeting, and 0 is used for not meeting. As we learned from the Regression Tree article, if you need a constant to predict a colleague’s age, the average is one of the best choices. So for the predictive dichotomy problem, how do we calculate this constant? Let’s prove it briefly:

In conclusion, if you want to use a constant to predict y, log(sum(y)/sum(1-y)) is the best choice. \

1.5 Residual of meeting

Assuming that the three blind dates are [1, 0, 1], the initial value z = log((1+0+1)/(0+1+0)) = 0.693, so we use 0.693 to predict the age of colleagues. The Sigmoid ([0.693, 0.693, 0.693]) = [0.667, 0.667, 0.667]. Residual of each blind date whether to meet = whether to meet – predictive value = [1, 0, 1] – [0.667, 0.667, 0.667], so residual is [0.333, -0.667, 0.333]

1.6 Prediction of meeting residuals

To make the model more accurate, one idea is to make the residuals smaller. How do you reduce residuals? We might as well build a regression tree for the residuals and predict the exact residuals. Assuming that the predicted residual of the tree is [1, -0.5, 1], sum the predicted value of the previous round with the predicted value of the current round, and then calculate the Sigmoid value. = Sigmoid([0.693, 0.693, 0.693] + [1, -0.5, 1]) = [0.845, 0.548, 0.845], obviously closer to the real value [1, 0, 1], The residual of whether or not each blind date meets then becomes [0.155, -0.548, 0.155], and the accuracy of prediction is improved.

1.7 GBDT

To rearrange our thinking, assume that our prediction is whether the three iterations meet: [1, 0, 1]

Sigmoid(0.693, 0.693, 0.693) = 0.667, 0.667, 0.667)

Round 1 residual: [0.333, -0.667, 0.333]

Second round forecast: Sigmoid (0.693, 0.693, 0.693 + [1, 0.5, 1]) (regression tree), 1) = Sigmoid ([1.693, 0.193, 1.693]) = [0.845, 0.548, 0.845]

Second round residual: [0.155, -0.548, 0.155]

3rd round forecast: Sigmoid (0.693, 0.693, 0.693 + 1, 0.5, 1 + 2, 1, 2) = Sigmoid ([3.693, 0.807, 3.693]) = [0.976, 0.309, 0.976]

Round 3 residual: [0.024, -0.309, 0.024]

It looks like the residuals are getting smaller and smaller, and that’s the GBDT algorithm.

1.8 Derivation of formula

See here, I believe you have an intuitive understanding of GBDT. What’s the scientific basis for that, why can residuals get smaller and smaller? Front section math formula low energy warning.

Therefore, we need to obtain the function FM by using the predicted value and residual of the m-1 round, and then optimize the function FM. The principle of regression tree is to make prediction through the mean value of the best region division. Different from GBDT regression, the mean value should be changed to Formula 1.7, 11. Therefore, FM can choose regression tree as the basic model, and then calculate the Sigmoid value by adding the initial value and predicted value of M-1 regression tree to predict Y.

Ii. Realization

I use the universe’s most simple programming language – Python to achieve GBDT classification algorithm, without relying on any third party library, easy to learn and use. A brief description of the implementation process, more detailed comments please refer to my github code.

2.1 Import regression tree class

Regression tree is a class I have written before, which was introduced in detail in the previous article. Please refer to the code:

https://github.com/tushushu/imylu/blob/master/imylu/tree/regression_tree.py from .. tree.regression_tree import RegressionTreeCopy the code

2.2 Create GradientBoostingBase class

Initialization, store regression tree, learning rate, initial predictive value and transform function.

class GradientBoostingBase(object):
    def __init__(self):
        self.trees = None
        self.lr = None
        self.init_val = None
        self.fn = lambda x: sigmoid(x)
Copy the code

2.3 Calculate the initial predicted value

For the initial predicted value, see equation 1.7. 10.

def _get_init_val(self, y):
    n = len(y)
    y_sum = sum(y)
    return log((y_sum) / (n - y_sum))
Copy the code

2.4 Matching leaf nodes

Calculate which leaf node of regression tree the training sample belongs to.

def _match_node(self, row, tree):
    nd = tree.root
    while nd.left and nd.right:
        if row[nd.feature] < nd.split:
            nd = nd.left
        else:
            nd = nd.right
    return nd
Copy the code

2.5 Obtaining leaf Nodes

Gets all the leaves of a regression tree.

def _get_leaves(self, tree):
    nodes = []
    que = [tree.root]
    while que:
        node = que.pop(0)
        if node.left is None or node.right is None:
            nodes.append(node)
            continue
        left_node = node.left
        right_node = node.right
        que.append(left_node)
        que.append(right_node)
    return nodes
Copy the code

2.6 Region Division

All the training samples corresponding to the leaf nodes of the regression tree were stored in the dictionary.

def _divide_regions(self, tree, nodes, X):
    regions = {node: [] for node in nodes}
    for i, row in enumerate(X):
        node = self._match_node(row, tree)
        regions[node].append(i)
    return regions
Copy the code

2.7 Calculate the predicted value

See formula 1.7, 11.

def _get_score(self, idxs, y_hat, residuals):
    numerator = denominator = 0
    for idx in idxs:
        numerator += residuals[idx]
        denominator += y_hat[idx] * (1 - y_hat[idx])
    return numerator / denominator
Copy the code

2.8 Update forecast value

Update the predicted values of each leaf node of regression tree.

def _update_score(self, tree, X, y_hat, residuals):
    nodes = self._get_leaves(tree)
    regions = self._divide_regions(tree, nodes, X)
    for node, idxs in regions.items():
        node.score = self._get_score(idxs, y_hat, residuals)
    tree._get_rules()
Copy the code

2.9 Calculating residuals

def _get_residuals(self, y, y_hat):
    return [yi - self.fn(y_hat_i) for yi, y_hat_i in zip(y, y_hat)]
Copy the code

2.10 Training model

The following points need to be noted when training the model:

  1. Max_depth controls the maximum depth of the tree.
  2. Control the minimum sample size min_samples_split;
  3. When training each regression tree, a learning rate LR should be multiplied to prevent over-fitting of the model;
  4. When the sample is sampled, the sampling method should be put back.
def fit(self, X, y, n_estimators, lr, max_depth, min_samples_split, subsample=None):
    self.init_val = self._get_init_val(y)

    n = len(y)
    y_hat = [self.init_val] * n
    residuals = self._get_residuals(y, y_hat)

    self.trees = []
    self.lr = lr
    for _ in range(n_estimators):
        idx = range(n)
        if subsample is not None:
            k = int(subsample * n)
            idx = choices(population=idx, k=k)
        X_sub = [X[i] for i in idx]
        residuals_sub = [residuals[i] for i in idx]
        y_hat_sub = [y_hat[i] for i in idx]

        tree = RegressionTree()
        tree.fit(X_sub, residuals_sub, max_depth, min_samples_split)

        self._update_score(tree, X_sub, y_hat_sub, residuals_sub)

        y_hat = [y_hat_i + lr * res_hat_i for y_hat_i,
                    res_hat_i in zip(y_hat, tree.predict(X))]

        residuals = self._get_residuals(y, y_hat)
        self.trees.append(tree)
Copy the code

2.11 Predict a sample

def _predict(self, Xi):
    return self.fn(self.init_val + sum(self.lr * tree._predict(Xi) for tree in self.trees))
Copy the code

2.12 Multiple samples are predicted

def predict(self, X):
    return [int(self._predict(Xi) >= threshold) for Xi in X]
Copy the code

3. Effect evaluation

3.1 the main function

Using the well-known breast cancer data set, split into training set and test set in a 7:3 ratio, training model, and statistical accuracy.

@run_time def main(): print("Tesing the accuracy of GBDT classifier..." ) X, y = load_breast_cancer() X_train, X_test, y_train, y_test = train_test_split(X, y, Random_state CLF = = 20) GradientBoostingClassifier (CLF). The fit (X_train y_train, n_estimators = 2, lr = 0.8, max_depth = 3, min_samples_split=2) get_acc(clf, X_test, y_test)Copy the code

3.2 Effect Display

The final accuracy is 93.082%, the running time is 14.9 seconds, the effect is still good ~

3.3 Tool Functions

I have customized some tool functions, which can be viewed on Github

https://github.com/tushushu/imylu/blob/master/imylu/utils.py
Copy the code
  1. Run_time – Test function run time
  2. Load_breast_cancer – Loads breast cancer data
  3. Train_test_split-split training set, test set
  4. Get_acc – Calculation accuracy

conclusion

GBDT classification principle: GBDT regression plus Sigmoid

Implementation of GBDT classification: a long story

\

Click here to participate in Python programming

* * * *

In this paper, the author

Li Xiaowen: I have been engaged in data analysis and data mining, mainly developing Python. Now I am an algorithm engineer in a small Internet company. Github: github.com/tushushu

\

Submission Email:[email protected]

Welcome to apply for the Python Chinese Community’s new Columnist program


\

Python Chinese community as a decentralized global technology community, to become the world’s 200000 Python tribe as the vision, the spirit of Chinese developers currently covered each big mainstream media and collaboration platform, and ali, tencent, baidu, Microsoft, amazon and open China, CSDN industry well-known companies and established wide-ranging connection of the technical community, Have come from more than 10 countries and regions tens of thousands of registered members, members from the Ministry of Public Security, ministry of industry, tsinghua university, Beijing university, Beijing university of posts and telecommunications, the People’s Bank of China, the Chinese Academy of Sciences, cicc, huawei, BAT, represented by Google, Microsoft and other government departments, scientific research institutions, financial institutions, and well-known companies at home and abroad, nearly 200000 developers to focus on the platform.

Click **** to read the original article and become a free member of **** community