Decision trees: How to classify decisions using valid features?

Decision Tree is a supervised learning method, which constructs a Decision Tree through features and labels and learns rules among features to solve classification and regression problems.

The process of using decision tree to make decisions is to test the corresponding characteristic attributes in the items to be classified from the root node, and select the output branches according to their values until it reaches the leaf node, and take the categories stored in the leaf node as the decision result.

The decision tree consists of the following three elements:

Root node: contains the complete set of samples
Internal node: tests corresponding feature attributes
Leaf node: decision result (label)

How does a decision tree distinguish good melons from bad ones?

(This picture is from zhou Zhihua’s watermelon book, hand-drawn version of white board)

Take the watermelon above as an example, how can we tell if a melon is a good melon? Features: clear texture, slightly curled root, green to the touch, and just so, you’ve built a decision tree to immediately decide is this a good melon or a bad melon?

The procedure is as follows:

Given texture clarity, given texture clarity, go to the left and look at step one
Then, from texture clarity to layer 2, by decision tree, we can see that the root is slightly curled up
Then to the third layer, the color of the characteristic green, from which we can conclude that this is a good melon.

According to the above example, we can intuitively get the category judgment of an instance. As long as the specific value of each feature is told, the decision process of the decision tree is equivalent to traversing from the root node to a leaf node in the tree. How to traverse each step is determined by the specific feature attributes of each feature of the data.

So, based on the tree above, we have the following question: why is the root node a texture, and not a root or other features?

What criteria does the decision tree use to select features? How do I build a decision tree?

3 steps of decision tree learning

Based on the above questions, in order to build a high-quality decision tree, the decision tree mainly has 3 steps.

Feature selection, decision tree generation, decision tree pruning

Feature selection

Feature selection is to select the optimal partition attribute and select a feature from the features of the current data as the partition standard of the current node. We hope that in the process of continuous division, the samples contained in the branch nodes of the decision tree belong to the same class as far as possible, that is, the “purity” of the nodes is getting higher and higher. The different criteria for selecting the optimal partition feature also results in different decision tree algorithms.

The information entropy (information entropy)
Information gain (information gain)
Information gain rate (information gain ratio)
Gini index (Gini index)

Decision tree is based on tree structure to make decisions, which can be classified or regressive, including ID3, C4.5 and Cart regression trees. The optimal feature is usually selected recursively, and the training data is divided according to this feature, so as to have the best decision process for each sub-data set, and the decision tree hopes to reach the direction of higher purity sub-set as soon as possible.

Decision tree generation

Generation algorithm	The standard
ID3	Information gain
C4.5	Information gain rate
CART	The gini coefficient

ID3

ID3 divides the feature with the maximum information gain as the node, and builds the decision tree recursively until all the features are divided.

The information gain represents the degree of uncertainty reduction of the information of class Y when the information of feature X is known. Represents the difference of entropy before and after partition. The larger the difference, the larger the gain. The entropy before partition is determined, and the smaller the reduction entropy after partition is, the larger the final information gain will be.

When you make a decision, you always want to go in the direction of the highest purity subset, and the higher the purity, what is the explanation? High purity, the greater the difference between features, the more conducive to feature selection.

The specific methods

Starting from the root node, the information gain of all possible features is calculated for the node, and the feature with the maximum information gain value is selected as the partition feature of the node.
The child nodes are established by different values of the feature.
Then the above methods are recursively called for the sub-nodes to build a decision tree.
The final decision tree is obtained until the information gain of all features is small or no features can be selected

The information gain is biased to features with more values

The effect of information gain is better for discrete data with fewer categories, but worse for features with more values.

As for the bias of information gain to features with multiple values, I think the reasons are as follows: When there are many values of features, the purity of the subset divided according to these features is more likely to be higher (features with fewer comparison and values), so the entropy after division is lower. Since the entropy before division is constant, the information gain is larger, so the information gain tends to the features with more values.

An extreme example may be better understood: if the value of feature A can allocate each sample to A node (such as features such as numbering), the conditional entropy part will be 0, which means that the information gain in this case has reached the maximum value, so ID3 algorithm will definitely select feature A. But, of course, we know that feature A is obviously not the best choice.

So why does information gain rate improve this situation? First, let’s take a look at the calculation formula of information gain rate:

Also known as the inside information of feature A,It’s really A measure of the uncertainty of classifying data set D with different values of feature A. If you have more values of feature A, then the uncertainty is usually greater, soThe value of theta is going to be larger than the value of thetaIt is equivalent to multiplying the information gain by a penalty factor, namely:

The characteristics of ID3

Select features with large information gain to establish a decision tree, and the information gain will be biased towards those features with large values (this is also the reason why C4.5 adopts the information gain rate).
Only the tree generation process is easy to overfit
Only discrete data can be processed
Missing values cannot be handled

C4.5

ID3 is improved, information gain rate is used for feature selection, and decision tree is constructed recursively.

C4.5 characteristics

ID3 is improved and the information gain rate is used for feature selection
Can handle continuous, discrete types
Can handle missing values
Construct decision tree for pruning

Cart

CART (Classification and Regression Tree) classification regression tree algorithm can be used for both classification and regression. In this part, we first focus on the generation of classification tree.

CART classification tree

Different from ID3 and C4.5, CART assumes that the decision tree is a binary tree, the value of internal node features is “yes” and “no”, the left branch is the branch with the value of “Yes”, and the right branch is the branch with the value of “no”.

Such a decision tree is equivalent to recursively dichotomizing each feature, dividing the input space (that is, the feature space) into a finite number of units.

Gini index is used to select the optimal partition point of the optimal feature in the CART classification tree, and the specific process is as follows

Starting from the root node, the Gini index of the existing features is calculated for the node. For each feature, such as A, and then for each possible value, such as A, the sample points are divided into two parts according to the “yes” and “no” of the result of A= A

Among all possible feature A and all possible value A of this feature, the feature with the smallest Gini index and its corresponding value are selected as the optimal feature and the optimal segmentation point. Then, according to the optimal feature and the optimal segmentation point, the data set is divided into two and two sub-nodes are generated.
The above steps are recursively invoked for the two child nodes until the number of samples in the node is less than the threshold, or the Gini index of the sample set is less than the threshold, or there are no more features.
Generate CART classification tree.

CART regression tree

A regression tree is a decision tree model that can be used for regression. A regression tree corresponds to a partition of the input space (that is, the feature space) and the output value on the partition unit.

Different from the classification tree, the regression tree adopts a heuristic method to divide the input space. It traverses all input variables and finds the optimal segmentation variable J and the optimal segmentation point S. In other words, the JTH feature Xj and its value S are selected to divide the input space into two parts, and then repeat the operation.

How to find the optimal j and s is obtained by comparing the errors of different partitions. The partition error of an input space is measured by the least square of the true value and the predicted value of the partition, i.e

Where f(xi) is the predicted value of each partition unit, and this predicted value is the mean value of each sample point in the unit.

Then, the solution of j and s can be carried out by the following formula

R1(j, S) and R2(j, S) are the two regions divided.

Generating methods

Select the optimal segmentation variable J and the optimal segmentation point S, solve and traverse all features, scan all values of fixed features, and find the (j, s) that makes the above formula reach the minimum value.
The region is divided with the selected (j, s) and the predicted value of the region is determined
Continue with the above steps for both subregions until the stop condition is met
Generating regression tree

Decision tree pruning

Pruning is to avoid over-fitting of the decision tree model, because in the learning process of decision tree algorithm, in order to correctly classify training samples as far as possible, nodes are constantly divided, which will lead to too many branches of the whole tree, leading to over-fitting.

There are two types of decision tree pruning: pre-pruning and post-pruning

Pruning (pre – pruning)

Pre-pruning is to estimate each node before partition in the process of constructing decision tree. If the partition of the current node cannot improve the generalization performance of decision tree model, the current node is not divided and the current node is marked as leaf node.

After pruning (post – pruning)

Post-pruning means that the whole decision tree is constructed first, and then the non-leaf nodes are investigated from bottom to top. If the sub-tree corresponding to the node is replaced by a leaf node to improve generalization performance, the sub-tree is replaced by a leaf node.