[size=0.8]1. Introduction to decision tree


[size=0.8] Decision tree is a basic classification and regression method. Learning usually consists of three steps: feature selection, decision tree generation and decision tree pruning.


[size=0.8] The essence of decision tree learning is to induce a set of classification rules from the training data set. The loss function of decision tree learning is usually the regularized maximum likelihood function and the learning strategy is to estimate the conditional probability model from the training data set.


[size=0.8] The algorithm of decision tree learning usually selects the optimal feature recursively and divides according to the feature. This process corresponds to the construction of decision tree and the division of feature space. Make each subset after partition can be basically classified, then build leaf node; Otherwise continue the recursive partition.


[size=0.8] The decision tree may have been fitted, so pruning is needed to proceed from bottom to top, and subtracting overly subdivided nodes so that they will fall back to the parent node.


[size = 0.8] root node


[size=0.8] Non-leaf node (decision point) : represents the test condition, the test of data attributes


[size=0.8] Leaf node: represents the classification mark obtained after classification


[size=0.8] Branch: represents the result of the test










[size=0.8] First build decision tree (training stage), then use decision tree (classification stage)


[size=0.8]2


[size=0.8] Criteria for feature selection: information gain or information gain ratio. We choose the partition with the greatest information gain.


[size = 0.8] A. Entropy and Gini coefficient


[size=0.8] Entropy: the degree of chaos or uncertainty in an object. The greater the entropy, the greater the degree of chaos in the object. Conversely, the lower the entropy, the higher the internal purity of the object.






[size=0.8] According to the entropy formula, when the probability of a certain class in the object is small, the absolute value of (negative) is larger. It can be seen from this that when there are many categories in an object or set (that is, there is a high degree of chaos inside the object), the probability of each category will be small, resulting in a high entropy value of the object.


[size=0.8] [size=0.8] [size=0.8] [size=0.8]


[size=0.8] The entropy of nodes decreases rapidly with the increase of tree depth. The speed of entropy reduction (introducing the concepts of information gain and information gain ratio) is as fast as possible in order to obtain a decision tree with the lowest height






[size = 0.8]






[size=0.8] First, determine the root node, assuming outlook, temperature, Humidity, windy as the root node, calculate the corresponding entropy, so as to get the information entropy;


[size=0.8] Then, the information gain or information gain rate is obtained by comparing with the original entropy to determine the root node.


[size=0.8] Secondly, other nodes of the decision tree are constructed by recursive method.


[size=0.8] Finally, after the construction of the decision tree is completed, the evaluation function (objective function) is used to judge the quality of the current decision tree.






[size=0.8] Calculate the entropy when Outlook is assumed to be the root node


[size = 0.8] outlook = sunny, entropy = = 0.971


[size = 0.8] outlook = overcast, entropy = 0


[size = 0.8] outlook = rainy, entropy = = 0.971


[size=0.8] According to the historical statistics, the probability of outlook choosing sunny, overcast, and rainy is 5/14, 4/14, and 5/14 respectively.


[size=0.8] [size=0.8] [size=0.8] [size=0.8] [size=0.8] [size=0.8] [size=0.8] [size=0.8]


[size=0.8] :


[size = 0.8] gain (outlook) : 0.940-0.693 = 0.247


[size=0.8] [size=0.8]


[size = 0.8] gain (temperature) = 0.029


[size = 0.8] gain (humidity) = 0.152


[size = 0.8] gain (windy) = 0.048


[size=0.8] In summary, gain(Outlook) is the largest (that is, outlook reduces the information entropy of the system fastest in the first step), so the root node of the decision tree is Outlook


[size=0.8]ID3 algorithm: Information gain (traditional algorithm)


[size=0.8]– Usually, judging by information gain alone is unreliable


[size=0.8]– Disadvantages: When the attributes under a feature are very sparse (i.e., there are many categories) and the number of samples corresponding to each attribute is small, the entropy value can obviously decrease fastest when selecting this node to calculate the entropy value. However, this attribute is likely to be irrelevant to the problem, thus leading to the information gain rate


[size=0.8]C4.5 algorithm: information gain rate


[size = 0.8] –


[size=0.8]– C4.5 is an extension of ID3


[size=0.8]CART algorithm: Gini coefficient


[size=0.8] [size=0.8]


[size=0.8]– The smaller the better, similar to the loss function


[size=0.8]– Nt: number of samples per leaf node


[size=0.8]– H(t) : the entropy of the leaf node


[size=0.8] After the construction of the decision tree is completed, the evaluation function (objective function) is used to judge the quality of the current decision tree.


[size=0.8] Decision trees can handle both discrete and continuous attributes. Firstly, the continuous attributes are discretized and the values of the continuous attributes are divided into different intervals based on the comparison of the information gain values of each split point.


[size=0.8] decision tree pruning


[size=0.8] If the decision tree considered all the training data sets, the resulting decision tree would be too large. Although the fitting probability of the decision tree for the training data set is 100%, the decision tree will learn some noise points and error points due to the excessive consideration of all the data and the fragmentation of the data.


[size=0.8] The main reason for over-fitting lies in the complexity of decision tree construction.


[size=0.8] Decision tree pruning is achieved by minimizing the loss function or cost function of the whole decision tree.


[size=0.8] Prepruning: Stops early in the process of building a decision tree.


[size=0.8]– minimum sample size


[size=0.8]– maximum depth


[size=0.8] Post-pruning: After the decision tree is built, the pruning begins.


[size = 0.8] –


[size=0.8]– The greater the number of leaves, the greater the loss


[size=0.8]– a: Manually set, the larger the value, the greater the weight of the number of leaf nodes; The smaller the value is, the less influence the number of leaf nodes has


[size=0.8]– Tleaf: number of leaf nodes


[size=0.8]5


[size=0.8] Sampling mode


[size=0.8]Bootstraping


[size=0.8]Bagging: Put back n samples


[size=0.8] For a test data, put it into different decision trees in the random forest, will get different test results. If the problem is a classification problem, the classification of test data can be obtained by finding mode. If the problem is a regression problem, the values of the test data can be obtained by averaging them.


[size=0.8] Double randomness


[size=0.8] Randomness of data sample sampling: Many and different decision trees are constructed through random samples that are put back from the training data set


[size=0.8]– Objective: To make some decision trees fail to select abnormal samples


[size=0.8] Randomness of data feature sampling: A random forest is constructed by sampling randomly from features (no putting back, features cannot be repeated)


[size=0.8]– Objective: To make some decision trees fail to select abnormal features