Abstract:Decision tree pruning strategy: pruning first, pruning later to solve the overfitting problem.

This article is shared from Huawei Cloud Community “A Brief Analysis of the Growth and Pruning of Decision Trees”, originally written by Chengxiaoli.

Decision Tree is a Decision analysis method to obtain the probability that the expected value of NPV is greater than or equal to zero, evaluate the project risk and judge its feasibility on the basis of the known probability of occurrence of various situations by forming a Decision Tree. It is a graphical method of intuitive use of probability analysis. This decision branch is called a decision tree because it is drawn like the branches of a tree. In machine learning, decision tree is a prediction model, which represents a mapping relationship between object attributes and object values, and it is a kind of supervised learning.

one Decision tree model

First of all, what is a decision tree? A decision tree is a flowchart-like tree structure: each internal node (branch/branch node) represents a feature or attribute, and each leaf node represents a classification.

The main problem in the growth process of decision tree is that it has strong subjectivity to select branch nodes. Solution: Use information entropy or information gain to solve the problem of subjective judgment, only need to calculate the information entropy or information gain and then sort the process to correct classification.

The meaning of information gain: the change of information before and after partitioning a data set.

Entropy: refers to the uniform distribution of energy of an object in physics; information entropy: a measure of the uncertainty of information: formula: H(x)=-sum(plog(p)). The smaller the entropy of information, the smaller the uncertainty, the greater the certainty, and the higher the purity of information. H(D) is the entropy of data set D, and the calculation formula is as follows:

CK is the number of classes K appearing in data set D, and N is the number of samples and the total number of classes. H (D | A) is A feature with A data set D condition entropy, its meaning is: the distribution of Y in the subset Di. The calculation method is as follows:

Gaina (information gain of A) =H_All(overall information entropy) -h (A)(node A is taken as the information entropy of dividing nodes) branch node selection in the decision tree: the branch node with A large information gain is the branch node with A larger information gain, the smaller information entropy, the smaller information uncertainty, the greater certainty and the higher purity. Formula of information gain after synthesis:

The information gain ratio Gr (D,A) of feature A to training set D is defined as



HA(D) describes the resolution ability of feature A to training set D. The improvement of information gain rate further divides the decision tree with information gain rate due to the deficiency of more feature values due to information gain bias.

The above decision algorithms: ID3 algorithm – information gain, C4.5 algorithm – information gain rate. Decision tree pruning strategy: pruning first, pruning later to solve the overfitting problem.

two ID3 and C4.5 partitioning policies

Dividing ideas of ID3 and C4.5 algorithms: select branch nodes of the decision tree according to information gain or information gain rate, and recursively build the tree in turn.

The basic steps of decision tree construction:

(1) If all attributes are used for partitioning, end directly;

(2) Calculate the information gain or information gain rate of all features, and select the feature corresponding to the value of the larger information gain (such as node A) for classification;

(3) If node A is used as the partition node and the partition is not completed, then the decision tree is further established by using other feature nodes except node A with larger information gain. (Recursive decision tree establishment)

Conditions for decision tree to stop and stop growing:

  • If all attributes are used for partitioning, end them directly. If there are undivided nodes, use majority vote;
  • If all samples have been classified, end directly;
  • The maximum impurity was defined for measurement;
  • Define the number of leaf nodes;
  • Defines the number of samples a branch node will contain.

3. Decision tree pruning

Decision tree is a complex tree generated by taking all data points into full consideration. Overfitting may occur. The more complex the decision tree is, the higher the degree of overfitting will be. The construction process of a decision tree is a recursive layer-over, so stop conditions must be determined, otherwise the process will not stop and the tree will continue to grow.

Pruning first: Prematurely ends the growth of the decision tree. Prepruning reduces the risk of overfitting and reduces the time cost of training and testing the decision tree. There is a risk of underfitting.

Post-pruning: The process of pruning after the growth of the decision tree is complete. — Minimal error pruning techniques (MEP), pessimistic error pruning (MEP) and cost complexity pruning (CCP) tend to have better generalization performance than prepruning decision trees, and the training time is much larger than that of unpruning decision trees and prepruning decision trees.

Conclusion:

The advantage of using a decision tree for classification is that it is intuitive, easy to understand, and efficient to execute, requiring only one build, and can be used repeatedly. However, it is more effective for small-scale data sets. Moreover, the effect is not good when dealing with continuous variables, and it is difficult to predict continuous fields. When there are more categories, errors increase faster.

reference

[1] Chen Lei. Deep Learning and MindSpore Practice [M]. Tsinghua University Press, 2020.

[2] Zhuge Yue, Calabash Baby. Hundred Faces Machine Learning [M]. Posts and Telecommunications Press, 2020.

[3] Ashton. Zhang, Li Mu. Deep Learning in Hands-on Learning [M]. Posts and Telecommunications Press, 2020.

Click on the attention, the first time to understand Huawei cloud fresh technology ~