Welcome to Tencent cloud technology community, get more Tencent mass technology practice dry goods oh ~

** By ** Wang Yixiong

This paper explains three common algorithms of decision tree, their advantages and disadvantages, and the meaning of random forest with easy to understand language and examples, which is believed to help beginners really understand relevant knowledge.

The decision tree

The introduction

Decision tree is a very common classification method in machine learning. It can also be said that it is the most intuitive and best understood algorithm among all algorithms. Here’s the simplest example:

A: Are you going to dinner?

B: I’ll go if you go.

“You go, I go” is the classic decision tree idea.

Here’s another example:

I have been asked to lend me some money. To borrow or not to borrow? I will decide my answer based on three characteristics: whether I have money, whether I use money, and whether the other person has good credit.

Let’s move to a more general point of view, for some characteristic data, if we can have such a decision tree, we can also predict the conclusions of the sample very easily. So the question becomes how to find an appropriate decision tree, which is how to rank these features.

Before sorting features, imagine that when making a decision on a feature, we definitely want the purity of the classified samples to be as high as possible, that is to say, the samples of branch nodes belong to the same category as possible.

Therefore, when selecting the root node, we should choose the feature that makes the “branch node with the highest purity”. After processing the root node, for its branch nodes, continue to recurse using the root node idea, so that a tree can be formed. This is actually the basic idea of greedy algorithms. So how do you quantify the purity? Entropy, of course, is our most common measure of purity. Its mathematical expression is as follows:

Where N is the number of possible values of the conclusion, p is the probability of taking the KTH value, which is the frequency/total number of occurrences for the sample.

The lower the entropy, the purer the sample.

So let’s graph the entropy as a function of a two-point distribution of sample X (X =0 or 1), where the x-coordinate is the probability of sample 1, and the y-coordinate is the entropy.

And you can see that when p of x is equal to 1 is equal to 0, which means that all the samples are 0, entropy is 0.

When p (x=1) =1, which means that all of the samples are 1, entropy is 0.

When p (x=1) =0.5, which is equal to 0 and 1 in the sample, the entropy energy maximizes.

By extension, sample X can take on n possible values (x1… Xn). It turns out that when p of xi is equal to 1 over n, which means the sample is absolutely uniform, entropy is maximum. When one of p (xi) is 1 and all the others are 0, so the sample values are xi, entropy is minimum.

Decision tree algorithm

ID3

Suppose that in sample set X, for a feature A, it may have (A1, A2… These values, if you partition sample X by feature A, you’re going to have n branch nodes. As I mentioned, we want the sample of the branch nodes to be as pure as possible, so the total entropy of the branch nodes to be as low as possible.

Since each branch node has a different number of nodes, we should make a weighting when calculating the “total entropy”. Assume that the sample number of the ith node is W(ai), and its weight in all samples is W(ai)/W(X). So we can get a total entropy:

This formula stands for one sentence: ** the sum of entropy at each node after weighting. ** The smaller the value, the higher the purity.

At this point, we introduce a term called information gain G (X, a), which means the enhancement of information brought by a feature to the sample. The formula is:

The feature with the maximum information gain is found as the target node and the tree is constructed recursively step by step

Ok, a simple example to illustrate the calculation of information gain:

In the example above, I calculate the information gain for feature 1

So let’s calculate H of X for the sample.

After calculating the total entropy, it can be seen that feature 1 has 3 nodes A, B and C, which are 6, 6 and 5 respectively

So the weight of A is 6/(6+6+5), the weight of B is 6/(6+6+5), and the weight of C is 5/(6+6+5).

Since we want the purity of the divided nodes to be as high as possible, we need to calculate the entropy of node A, B and C respectively

Feature 1=A: there are 3 yes and 3 no, and its entropy is

Feature 1=B: 2 yes and 4 no, and its entropy is

Feature 1=C: 4 yes and 1 no, and its entropy is

So the total entropy of the branch nodes is equal to:

The information gain of feature 1 is 0.998-0.889=0.109

Similarly, we can also calculate the information gain of other features, and finally take the feature with the maximum information gain as the root node.

Experience in calculation can also have conditional entropy to deduce: G (X, A) = H (X) – H (X | A), are interested in this part of the students can understand it.

C4.5

There is an obvious problem in ID3 algorithm.

If you have a sample set that has a (unique) feature called id or name or something like that, you’re screwed. If you think about it, if you have n samples, the id feature will definitely divide this sample into n parts, so there are n nodes, and each node has only one value, so the entropy of each node is zero. That is to say, the total entropy of all branch nodes is 0, so the information gain of this feature must reach the maximum. Therefore, if ID3 is used as the decision tree algorithm, the root node must be the feature of ID. But obviously this is not reasonable…

Of course, the above is the limit case. In general, if a feature is too sparse in the sample division, this is also unreasonable (in other words, the feature is biased towards more values). In order to solve this problem, THE C4.5 algorithm adopts the information gain rate as the feature selection standard.

The so-called information gain rate is based on the information gain, except for a split information, to punish the attributes with more values.

The split information is actually the entropy H (A) of the number of features.

Why can this be reduced? Take the example of ID above to understand. If ID divides n samples into n parts, then the probability of the value of the feature ID is 1/ N. As stated in the introduction of this article, the entropy is maximum when the samples are absolutely uniform.

Therefore, in this case, although id is characterized by the maximum information gain, the penalty factor split information is also the maximum, so as to lower its gain rate. This is the idea of C4.5.

CART

The goal of the decision tree is ultimately to find quantitative criteria to distinguish the purity of the sample. In CART decision tree, gini index is used as its measurement standard. The intuitive understanding of the Gini coefficient is that two samples are randomly selected from the set, and the purer the sample set, the less likely it is to pick different samples. This probability is the Gini coefficient.

So if a sample has K categories. Assuming that a feature a of the sample has n values, the probability of a node picking different samples is:

Therefore, the sum of probability of k categories is called gini coefficient:

The Gini index, on the other hand, is the weighting of the Gini coefficients at all the nodes

After the calculation, we will choose the feature with the lowest Gini coefficient as the optimal partition feature.

pruning

The purpose of pruning is to prevent overfitting, which is the most important means for decision tree to prevent overfitting. In the decision tree, in order to strive for the classification of training samples, so our decision tree will always grow. However, sometimes the training sample may learn so well that the specific properties of the sample are treated as general properties. At this point, we need to actively remove some branches to reduce the risk of overfitting.

There are two kinds of pruning: pre-pruning and post-pruning.

Pre pruning

In general, the tree doesn’t stop growing until the node sample is 100% pure. But this might overfit, so we don’t have to let it grow 100%, so before we do that, set some termination conditions to stop it early. This is called pre-pruning, and it happens before the decision tree is created.

Generally we pre-pruning means are:

1. Limit the depth of the tree

2. The number of child nodes of a node is smaller than the threshold

3. Set the threshold of node entropy

And so on.

After pruning

As the name implies, this pruning occurs after the decision tree establishment process. There are many post-pruning algorithms, and some of them are quite esoteric. Here, the idea of a simple algorithm will not be explored.

Reduced-Error Pruning (REP)

This pruning method considers every node in the tree as a candidate for pruning, but there are certain conditions that determine whether to prune, and there are usually these steps:

1. Delete all its subtrees to make them leaf nodes.

2. Assign the most associated classification to this node

3. Verify its accuracy with verification data and compare it with before processing

If not, the subtree is actually deleted. And then you do it repeatedly from the bottom up. The process is to get rid of the “bad” nodes.

Random forests

The theory of the random forest should not be associated with the decision tree itself, the decision tree can only be used as an algorithm of its ideas.

Why do we introduce random forests. We know that for the same set of data, we can only produce one decision tree, so it’s a simple change. What about a combination of algorithms?

This is where the concept of integrated learning comes in.

As can be seen in the figure, each individual learner (weak learner) can contain an algorithm, which can be the same or different. If it’s the same, we call it a homogeneous integration, and if it’s not, it’s heterogeneous.

Random forest is a special case of ensemble learning based on Bagging strategy.

As can be seen from the figure above, bagging’s individual learner’s training set is obtained by random sampling. By taking n random samples, we’re going to get n sample sets. For these n sample sets, we can independently train n individual learners, and then obtain the final output of these n individual learners through the set strategy. These N individual learners are independent of each other and can be parallel.

Boosting Is another way of ensemble learning, which has a strong correlation with each other.

The sampling method adopted by random forest is generally Bootstap sampling. For the original sample set, we will randomly collect one sample each time and put it into the sample set and then put it back. That is to say, the sample may still be collected next time and a sample set will be obtained after a certain number of samples. Because of the random sampling, each sample set is different from the original sample set, and different from the other sample sets, and the resulting individual learner is also different.

The main problem of random forest is how to set the combination strategy with N results. There are also several main ways:

Weighted average method:

Averaging is often used for regression. The way to do this is to give each learner a preset weight wi,

Then the final output is:

When all the weights of the learner are 1/n, the averaging method is called simple averaging.

Ballot:

Voting is like voting in our lives, if each learner has the same weight.

Then there is the absolute vote, which is a majority. In contrast to voting, the minority rules the majority.

If there’s a weighting, it’s still majority rule, but the numbers are weighted.

example

A simple quadratic function code to see how the decision tree is used.

Training data is 100 random real square data, different depth will get different curves

The test data was also random, but models of trees at different depths produced different predictions. As shown in figure

The code for this image is as follows:

If you need to install numpy, matplotlib, and sklearn libraries, you can check them out.

#! /usr/bin/python
# -*- coding:utf-8 -*-

import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor


if __name__ == "__main__":
    
    # Prepare training data
    N = 100
    x = np.random.rand(N) * 6 - 3
    x.sort()
    y = x*x
    x = x.reshape(-1, 1)

    mpl.rcParams['font.sans-serif'] = ['SimHei']
    mpl.rcParams['axes.unicode_minus'] = False

    # Decision tree depth and its curve color
    depth = [2, 4, 6, 8, 10]
    clr = 'rgbmy'
    
    # the actual value
    plt.figure(facecolor='w')
    plt.plot(x, y, 'ro', ms=5, mec='k', label='Actual value')
    
    # Prepare test data
    x_test = np.linspace(-3, 3, 50).reshape(-1, 1)
    
    # Build a decision tree
    dtr = DecisionTreeRegressor()
    # Loop the model of decision tree at different depths and test the output of data
    for d, c in zip(depth, clr):
        # Set maximum depth (pre-pruning)
        dtr.set_params(max_depth=d)
        # Training decision tree
        dtr.fit(x, y)
        Verify the test data with the model derived from the training data
        y_hat = dtr.predict(x_test)
        # Draw the curve of the model
        plt.plot(x_test, y_hat, The '-', color=c, linewidth=2, markeredgecolor='k', label='Depth=%d' % d)
    # Some basic parameters for drawing
    plt.legend(loc='upper center', fontsize=12)
    plt.xlabel('X')
    plt.ylabel('Y')
    plt.grid(b=True, ls=':', color='# 606060')
    plt.title('Quadratic decision tree', fontsize=15)
    plt.tight_layout(2)
    plt.show()
Copy the code

The resources

Machine Learning Zhou Zhihua

Machine learning by Zou Bo

## Related reading

Machine Learning: From The Beginning to the First Model GBDT Algorithm: Principles of Integration, by Angel on the Self-cultivation of an Excellent Machine Learning Platform (PART 1)

This article has been published by Tencent Cloud technology community authorized by the author, please indicate the source of the article if reproduced

The original link: https://www.qcloud.com/community/article/160232