In the field of machine Learning, Deep Learning (Deep Learning\text {Deep Learning}Deep Learning) has been dominated by neural networks, Shallow Learning (Shallow Learning\text {Shallow Learning}Shallow Learning) still belongs to the domain of tree model. On the one hand, although deep learning is strong in large-scale learning, its performance in small-scale learning is not satisfactory. On the other hand, the integrated tree model (RF\text {RF}RF, GBDT\text {GBDT}GBDT, XGBoost\text {XGBoost}XGBoost, etc.) has the advantages of high model interpretation, low parameter adjustment difficulty, fast running speed and almost no feature engineering. Deep learning is completely crushed on small – and medium-sized datasets, and for “heterogeneous data” (such as age, income, city in risk control scenarios), integrated tree models perform better than deep learning even on large datasets. In practical application, Facebook, Alibaba and other large enterprises are using GBDT\text {GBDT}GBDT combined with LR as the technical support for CTR estimation, advertising recommendation and other important businesses. XGBoost text {XGBoost}XGBoost is more in recent years in the Kaggle algorithm competition repeatedly bloom.

This article introduces the basic algorithms of tree model ID3, C4.5, CART. If you want to know about integrated tree model (random forest, gradient ascending tree, etc.), you can refer to my other two posts:

Integrated learning algorithm:
Bagging \text {Bagging}
Methods and random forest

Integrated learning algorithm:
Boosting \text {Boosting}
The methods and
AdaBoost \text {AdaBoost}
,
GBDT \text {GBDT}



ID3 \text {ID3}
The decision tree

ID3\text {ID3}ID3 is a classical algorithm of very early origin. Based on information theory, it measures feature selection by information gain, selects the feature with the largest information gain each time to branch, and traverses all possible decision tree space by top-down greedy search.

Information gain

  • Information entropy is one of the most commonly used indicators to measure the purity of sample sets. It is assumed that the proportion of class KKK samples in the current sample set DDD is pk(k=1,2… , ∣ Y ∣) p_ {k} (k = 1, 2, \ ldots, | \ mathcal {} Y |) pk (k = 1, 2,… ,∣Y∣), then the information entropy of DDD is defined as:

Ent ( D ) = k = 1 Y p k log 2 p k \operatorname{Ent}(D)=-\sum_{k=1}^{|\mathcal{Y}|} p_{k} \log _{2} p_{k}
  • Information entropy Ent⁡(D)\ OperatorName {Ent}(D) The smaller the value of Ent(D), the higher the deterministic and purity of the set DDD. Suppose the discrete attribute AAA has VVV possible values {A1, A2… , aV} \ left \ {a ^ {1}, a ^ {2}, \ ldots, a ^ {n} \ right \} {a1, a2,… AV}, if aaa is used to divide the sample set DDD, it will produce VVV branch nodes. The VVV branch node contains all the samples in DDD whose attribute aaa is ava^{v} aV, denoted as DvD^{v}Dv. Calculate the DvD ^} {v Dv information entropy, and then considering the different sample sizes, different branch node contains weights to the branch node given ∣ Dv ∣ / ∣ D ∣ \ left | D ^ {n} \ right | | D | / ∣ Dv ∣ / ∣ D ∣, namely the more branch node is larger, the influence of sample Thus, the “information gain” obtained by dividing the sample set DDD with attribute AAA can be calculated:

Gain ( D . a ) = Ent ( D ) v = 1 V D v D Ent ( D v ) \operatorname{Gain}(D, a)=\operatorname{Ent}(D)-\sum_{v=1}^{V} \frac{\left|D^{v}\right|}{|D|} \operatorname{Ent}\left(D^{v}\right)
  • Principle of thought:

    Generally speaking, the greater the information gain is, the greater the “certainty improvement” (or “purity improvement”) obtained by dividing with attribute AAA. The ID3 decision tree learning algorithm selects the attributes to be used in each branch based on the information gain. Therefore, the branch closer to the root node of the decision tree has stronger “deterministic improvement” ability.

The formation of a tree

  • Taking the above data set as an example (classification of watermelons according to their appearance), there are 17 training samples in total. At the beginning of decision tree learning, the root node contains all the samples in DDD, where positive cases account for P1 =817p_{1}=\frac{8}{17} P1 =178, and negative cases account for P2 =917p_{2}=\frac{9}{17} P2 =179. Therefore, the information entropy of the root node can be calculated according to the formula:

Ent ( D ) = k = 1 2 p k log 2 p k = ( 8 17 log 2 8 17 + 9 17 log 2 9 17 ) = 0.998 \operatorname{Ent}(D)=-\sum_{k=1}^{2} p_{k} \log _{2} p_{k}=-\left(\frac{8}{17} \log _{2} \frac{8}{17}+\frac{9}{17} \log _ {2} \ frac {9} {17} \ right) = 0.998
  • We then calculate the information gain for each attribute in the current attribute set {\{{color, root, tap, texture, umbilical, tactile}. Take the attribute “color” for example, it has three possible values: {\{{green, black, light white}. If use this property to available are divided into three subsets DDD, respectively to D1D ^ {1} D1 (color jersey = green), D2 (), D ^ {2} (), D2 (= = = black) color, D3), D ^ {3}), D3 (= plain color). Subset D1D ^ {1} D1 contains Numbers for {1,4,6,10,13,17} \,4,6,10,13,17 \ {1} {1,4,6,10,13,17} 6 sample, positive cases of p1 = 36 p_ {1} = \ frac {3} {6} p1 = 63, Example of p2 = 36; D2p_{2}=\frac{3}{6} ; D^{2}p2=63; ,3,7,8,9,15 D2 contains Numbers for {2} \,3,7,8,9,15 \ {2} {2,3,7,8,9,15} 6 sample, which is, counter example respectively p1 = 46, p2 = 26; D3p_{1}=\frac{4}{6}, p_{2}=\frac{2}{6} ; D^{3}p1=64,p2=62; 5,11,12,14,16 D3 contains Numbers for {} \ {5,11,12,14,16 \} {5,11,12,14,16} 5 sample, which is, counter example respectively p1 = 15, p2 = 45 p_ {1} = \ frac {1} {5}, P_ {2} = \ frac {4} {5} p1 = 51, p2 = 54. According to the formula, the information entropy of the three branch nodes obtained after dividing by “color and lustre” can be calculated as:

Ent ( D 1 ) = ( 3 6 log 2 3 6 + 3 6 log 2 3 6 ) = 1.000 Ent ( D 2 ) = ( 4 6 log 2 4 6 + 2 6 log 2 2 6 ) = 0.918 Ent ( D 3 ) = ( 1 5 log 2 1 5 + 4 5 log 2 4 5 ) = 0.722 \begin{array}{l} \operatorname{Ent}\left(D^{1}\right)=-\left(\frac{3}{6} \log _{2} \frac{3}{6}+\frac{3}{6} \log _{2} \ frac {3} {6} \ right) = 1.000 \ \ \ operatorname {Ent} \ left (D ^ {2} \ right) = – \ left (\ frac {4} {6} \ log _ {2} \ frac {4} {6} + \ frac {2} {6} \ log _ {2} \ frac {2} {6} \ right) = 0.918 \ \ \ operatorname {Ent} \ left (D ^ {3} \ right) = – \ left (\ frac {1} {5} \ log _ {2} \ frac {1} {5} + \ frac {4} {5} \ log _ {2} \ frac {4} {5} {array} \ right) = 0.722 \ end
  • Therefore, according to the formula, the information gain of attribute “color and lustre” can be calculated as:

Gain ( D . Colour and lustre ) = Ent ( D ) v = 1 3 D v D Ent ( D v ) = 0.998 ( 6 17 x 1.000 + 6 17 x 0.918 + 5 17 x 0.722 ) = 0.109 \begin{aligned} \operatorname{Gain}(D, \ text color} {) & \ operatorname = {Ent} (D) – \ sum_ = 1} {v ^ {3} \ frac {\ left | D ^ {n} \ right |} {| D |} \ operatorname {Ent} \ left (D ^ {n} \ right) 0.998 \ \ \ & = left (\ frac {6} {17} \ times 1.000 + \ frac {6} {17} \ times 0.918 + \ frac {5} {17} \ \ 0.722 times right) \ \ & = 0.109 \end{aligned}
  • Similarly, we can calculate the information gain of other attributes:

Gain ( D . roots ) = 0.143 ; Gain ( D . Knock sound ) = 0.141 Gain ( D . texture ) = 0.381 ; Gain ( D . Belly button ) = 0.289 Gain ( D . touch ) = 0.006. \begin{array}{l} \ operatorName {Gain}(D, \text {root})=0.143; \quad \ OperatorName {Gain}(D, \text {tone})=0.141 \\ operatorName {Gain}(D, \text {texture})=0.381; \quad \ OperatorName {Gain}(D, \text {touch})=0.289 \\ OperatorName {Gain}(D, \text {touch})=0.006. \end{array}
  • Obviously, the attribute “texture” has the greatest information gain, so it was selected as the partition attribute. The following figure shows the result of dividing the root node based on “texture”. The sample set contained in each branch node is shown in the node:

  • Next, the decision tree algorithm will further divide each branch node. Note that “texture” will no longer be used as a candidate partition attribute.

  • Take the first branch node (texture = clear) in the above figure as an example, the sample set D1D^{1}D1 contained by this node is numbered {1\{1{1, ,3,4,5,6,8,10,15} 2 2,3,4,5,6,8,10,15,3,4,5,6,8,10,15 \}} 9 sample, the available attribute set for {\ {{color, roots, knock, navel, tactility} \}}. The information gain of each attribute is calculated based on D1D^{1}D1:


    Gain ( D 1 . Colour and lustre ) = 0.043 ; Gain ( D 1 . roots ) = 0.458 Gain ( D 1 . Knock sound ) = 0.331 ; Gain ( D 1 . Belly button ) = 0.458 Gain ( D 1 . touch ) = 0.458. \begin{array}{l} \ operatorName {Gain}\left(D^{1}, \text {color}\right)=0.043; \quad \ OperatorName {Gain}\left(D^{1}, \text {root}\right)=0.458 \\ OperatorName {Gain}\left(D^{1}, \text {root}\right)=0.458 \\ OperatorName {Gain}\left(D^{1}, \ \text {tap}\right)=0.331; \ \ quad \ operatorname {Gain} left (D ^ {1}, \ text {time} \ right) = 0.458 \ \ \ operatorname {Gain} \ left (D ^ {1}, \text {touch}\right)=0.458. \end{array}
  • The three attributes of “root”, “umbilical” and “tactile” all achieved the maximum information gain, so one of them can be selected as the partitioning attribute. Similarly, the above operations are performed on each branch node, and the decision tree is finally obtained as follows:

The death of trees

ID3\text {ID3}ID3 decision tree generation method, so, in the process of tree nodes continue to split, when to terminate the split? Splitting is usually terminated when a node meets one of the following three conditions:

  1. When all the samples contained in this node belong to the same category, the splitting of this node is terminated (even if there are still attributes not divided at this node), it is marked as a leaf node, and its category is set as the category of the samples below it.
  2. When there are no undivided attributes at this node, or all samples contained in this node have the same values in the remaining undivided attributes, the splitting of this node is terminated, it is marked as a leaf node, and its category is set as the category containing the most samples at this node.
  3. When the sample set contained in this node is empty, the node splitting is terminated, marked as a leaf node, and its category is set as the category containing the most samples in its parent node.


ID3 \text {ID3}
Algorithm shortcoming

  • No pruning strategy, easy to overfit;
  • The information gain criterion has a preference for features with a large number of values, and the information gain of features like “numbered” is close to 1.
  • Unable to handle features of continuous variables;
  • Missing values are not considered






C4.5 \ text {C4.5}
The decision tree

C4.5\text {C4.5}C4.5 is an improved version of ID3\text {ID3}ID3 algorithm proposed by the author of ID3\text {ID3}ID3 algorithm.

  1. The information gain rate is introduced as the criterion for attribute classification
  2. Introduce pruning strategy for pruning
  3. The continuous features are discretized
  4. Handling missing values

Information gain rate

  • In the preceding ID3\text {ID3}ID3 decision tree generation process, we deliberately ignored the “number” column in Table 4.1. If “number” is also taken as a candidate partition attribute, its information gain can be calculated according to the formula to be 0.9980.9980.998, far higher than other candidate partition attributes. This is because when “number” is used as a partition attribute, there will be 17 branches, and each branch node contains only one sample. Of course, a sample only corresponds to one category, so the “certainty” or “purity” will be very large. However, such a decision tree obviously does not have the ability of generalization and cannot effectively predict the new samples. In fact, the information gain criterion has a preference for attributes with a large number of values. In order to reduce the possible adverse effects of this preference, C4.5C4.5C4.5 does not directly use information gain, but uses “gain ratio” to select the optimal partitioning attributes:


    Gain ratio ( D . a ) = Gain ( D . a ) IV ( a ) \operatorname{Gain}_{\operatorname{ratio}}(D, a)=\frac{\operatorname{Gain}(D, a)}{\operatorname{IV}(a)}
  • Among them


IV ( a ) = v = 1 V D v D log 2 D v D \operatorname{IV}(a)=-\sum_{v=1}^{V} \frac{\left|D^{v}\right|}{|D|} \log _{2} \frac{\left|D^{v}\right|}{|D|}
  • According to this formula, the greater the number of possible values of attribute AAA (i.e., the greater the VVV), the greater the value of IV⁡(a)\ OperatorName {IV}(a)IV(a) will generally be. For example, for watermelon dataset 2.02.02.0 in table 4.14.14.1 above, Have the IV (\ mathrm {} IV (IV (touch) = 0.874 (n = 2), IV () = 0.874 (n = 2), \ mathrm {} IV () = 0.874 (n = 2), IV (color) = 1.580 (n = 3)) = 1.580 (n = 3)) = 1.580 (n = 3), IV (\ mathrm {} IV (IV (number) = 4.088 (n = 17)) = 4.088 (n = 17)) = 4.088 (n = 17).

  • It should be noted that the gain-rate criterion has a preference for attributes with a smaller number of values. Therefore, c4.5\ text {C4.5}C4.5 algorithm does not directly select the candidate partition attribute with the largest gain rate, but uses a heuristic method: Firstly, the attributes whose information gain is higher than the average level are found out from the candidate classification attributes, and then the attributes with the highest gain rate are selected.

Pruning process

  • Prune is the method of C4.5\text {C4.5}C4.5 algorithm to deal with “overfit”. In ID3\text {ID3}ID3, in order to classify training samples as correctly as possible, node division will be too complex, sometimes resulting in too many branches of the decision tree. At this time, it may be because the training samples learn “too well” that some characteristics of the training set itself are regarded as the general properties of all data, resulting in over-fitting. Therefore, the risk of overfitting can be reduced by actively removing some branches.

  • If pruning is to be realized, the data set should be divided before the generation of decision tree. One part is divided into “training set” for training decision tree, and the other part is divided into “verification set” for evaluating the generalization ability of decision tree. Taking the previous watermelon data set as an example, the divided data set is as follows:

  • The unpruned decision tree generated is as follows:

  • Following the introduction of pruning, pruning is usually divided into “pre-pruning” and “post-pruning” two kinds:

    • Pre pruning

      • Pruning is in the process of generating decision-making tree, before each node split to evaluate the generalization performance of the nodes before and after the split, if after the split decision tree on the validation set of prediction accuracy is lower than before the split, that is to say the current node split cannot ascend the decision tree of the generalization ability, then stop the node split and mark it as leaf node, The category was marked as the category with the largest number of training samples.
      • Here is the decision tree with pre-pruning:

      • Advantages: Prepruning not only reduces the risk of overfitting but also greatly reduces training time
      • Cons: Prepruning is based on a “greedy” strategy and carries the risk of under-fitting.
    • After pruning

      • After pruning is complete decision tree generation, to evaluate the non-leaf nodes and from bottom to top, if the nodes corresponding sub-tree replacement for leaf and its categories tags for training sample category, most will not reduce the generalization ability of the decision tree, which is on the validation set of accuracy will not fall, then the child off the tree and the node is marked as leaf nodes.

      • The following figure shows the decision tree after the introduction of pruning:

      • Advantages: Compared with pre-pruning, post-pruning can not only effectively prevent over-fitting, but also retain the generalization ability of decision tree, which is less prone to under-fitting.
      • Disadvantages: greatly increase the model training time.

Continuous value processing

  • C4.5\text {C4.5}C4.5 introduces continuous value processing by discretizing continuous features. Assuming that continuous features A has MMM values, C4.5\text {C4.5}C4.5 sorts them and takes the average value of each adjacent two values, A total of M −1m-1m−1 partition points. The information gain of the m− 1m-1m-1 partition points as the dichotomous points was calculated respectively, and the point with the maximum information gain was selected as the binary discrete classification point of the continuous feature.

  • Take the following data set for example:

  • For the attribute “density”, at the beginning of decision tree learning, the 17 training samples contained in the root node are evaluated as 17 different continuous values on this attribute. According to the above method, the candidate binary of this attribute is 16 candidate values: Density of T = {0.244, 0.294, 0.351, 0.381, 0.420, 0.459, 0.518 T_ {\ text density} {} = \ {0.244, 0.294, 0.351, 0.381, 0.420, 0.459, 0.518 T density = {0.244, 0.294, 0.351, 0.381, 0.420, 0.459, 0.518, 0.574, 0.600, 0.621, 0.636, 0.648, 0.661, 0.681, 0.708, 0.746} 0.574, 0.600, 0.621, 0.636, 0.648, 0.661, 0.681, 0.708, 0.746 \} 0.574, 0.600 , 0.621, 0.636, 0.648, 0.661, 0.681, 0.708, 0.746}. The maximum information gain of the attribute “density” is 0.2620.2620.262, corresponding to the equinoxes 0.3810.3810.381. Similarly, the “sugar content” was discretized, and finally, the decision tree was obtained as follows:

Missing value handling

  • C4.5\text {C4.5}C4.5 introduces missing value handling. Incomplete sample data is often encountered in real data. It is obviously a great waste of data information if incomplete samples are simply abandoned and only samples with no missing values are used for learning. To deal with data sets with missing values, the following two problems need to be solved:

    1. How to select partition attributes in case of missing attribute values?
    2. For a specific attribute node, if the value of a sample on the attribute is missing, how to divide the sample?
  • For question 1, C4.5\text {C4.5}C4.5 is: for features with missing values, the information gain is calculated by using a subset of samples without missing values, and the result is multiplied by the proportion of samples without missing values for this feature; For question 2, C4.5\text {C4.5}C4.5: divide the sample into all sub-nodes with different weight values, that is, divide the sample into each sub-node with different probabilities, and its probability is equal to the proportion of samples without missing values in each branch.


C4.5 \ text {C4.5}
Algorithm shortcoming

Although C4.5\text {C4.5}C4.5 algorithm has been greatly improved compared with ID3\text {ID3}ID3, it still has the following shortcomings:

  • Pruning strategies can be optimized
  • It uses multi-fork trees, but binary trees are more efficient
  • It can only be used for classification, not regression
  • The entropy model it uses involves a lot of time-consuming logarithmic operations






CART \text {CART}
The decision tree

  • CART\text {CART} The full name of CART is “Classification and regression treeClassification \ and \ regression \ treeClassification and Regression Tree “(classification regression tree), is a powerful algorithm, the famous GBDT\text {GBDT}GBDT algorithm is based on it integrated learning algorithm.

  • CART\text {CART}CART is much better than C4.5\text {C4.5}C4.5.

    1. Its structure is a binary tree and its operation speed is faster than that of multi – tree
    2. It can do both classification and regression
    3. It uses Gini coefficient as the attribute classification standard in classification, and the computation is much smaller than the number operation, and the speed is much faster
    4. It is based on the cost complexity method to prune, and the effect is better

The gini coefficient

  • CART\text {CART} When classifying CART, gini coefficient is introduced to replace information entropy. Gini coefficient is equivalent to a section of Taylor expansion of entropy model, which is a more efficient attribute classification standard. Its calculation is as follows:

Gini ( D ) = k = 1 K C k D ( 1 C k D ) = 1 k = 1 K ( C k D ) 2 \begin{aligned} \operatorname{Gini}(D) &=\sum_{k=1}^{K} \frac{\left|C_{k}\right|}{|D|}\left(1-\frac{\left|C_{k}\right|}{|D|}\right) \\ &=1-\sum_{k=1}^{K}\left(\frac{\left|C_{k}\right|}{|D|}\right)^{2} \end{aligned}

Gini ( D A ) = i = 1 n D i D Gini ( D i ) \operatorname{Gini}(D \mid A)=\sum_{i=1}^{n} \frac{\left|D_{i}\right|}{|D|} \operatorname{Gini}\left(D_{i}\right)
  • The Gini index reflects the probability that two samples randomly selected from the dataset have inconsistent category markers. Therefore, the smaller the Gini index, the higher the certainty and purity of the data set. Due to the
    CART \text {CART}
    It’s a binary tree, and for binary problems, the hypothesis set
    D D
    In the properties
    A A
    Under the condition of
    D 1 D_1
    and
    D 2 D_2
    Two parts, so properties
    A A
    Gini coefficient is calculated as:


    Gini ( D . A ) = D 1 D Gini ( D 1 ) + D 2 D Gini ( D 2 ) \operatorname{Gini}(D, A)=\frac{\left|D_{1}\right|}{|D|} \operatorname{Gini}\left(D_{1}\right)+\frac{\left|D_{2}\right|}{|D|} \operatorname{Gini}\left(D_{2}\right)

Binary recursive splitting

  • CART\text {CART}CART algorithm adopts binary recursive splitting. In the process of tree generation, the current sample set is always divided into two sub-sample sets, so that each non-leaf node of the generated decision tree has only two branches. The whole decision tree is a binary tree with simple structure. Therefore, CART algorithm is applicable to the scenario where the sample feature value is “yes” or “no”.

  • For features with multiple discrete values, CART\text {CART}CART will perform binary cutting and select the cutting method with the lowest Gini coefficient. If an eigenvalue has three values [‘young ‘, ‘middle’, ‘old’][‘young ‘, ‘middle’, ‘old’][‘young ‘, ‘middle’, ‘old’], then dichotomy has the following three possibilities: [(‘ young ‘), (‘ middle ‘and’ old ‘)], [(middle level), [‘ young ‘, ‘old’)], [[‘ old ‘), (‘ young ‘, ‘middle’)] [(‘ young ‘), (‘ middle ‘, ‘old’)], [(middle), ‘(‘ young’, ‘old’)], [(‘ old ‘), (‘ young ‘, ‘middle’)] [(‘ young ‘), (‘ middle ‘and’ old ‘)], [(middle level), [‘ young ‘, ‘old’)], [[‘ old ‘), (‘ young ‘, ‘middle’)], and then calculate the above List to do bifurcation when the gini coefficient, Select the optimal segmentation method.

  • Meanwhile, CART\text {CART}CART does not specify a termination rule for the tree, that is, the tree will grow to its maximum. So pruning is very important, and we’ll talk about the pruning strategy.


CART \text {CART}
Regression tree

  • The essential difference between regression and classification is whether the output is continuous or discrete. CART\text {CART}CART can not only be a classification tree, but also a regression tree. The applicable scenario of the regression tree is: although the label values of the results are continuously distributed, they can be divided into communities, that is, similar within communities but different among communities.


  • CART \text {CART}
    Loss function of regression tree: sum of error squares

    • In the decision tree we learned previously, we used information entropy or Gini coefficient as the loss function, but in the regression tree, since the result label is a continuous value, information entropy and Gini coefficient are no longer applicable. CART\text {CART} The loss function of the CART regression tree usually uses the “sum of error squares” :

    min j . s [ min c 1 x i R 1 ( j . s ) ( y i c 1 ) 2 + min c 2 x i R 2 ( j . s ) ( y i c 2 ) 2 ] \min _{j, s}\left[\min _{c_{1}} \sum_{x_{i} \in R_{1}(j, s)}\left(y_{i}-c_{1}\right)^{2}+\min _{c_{2}} \sum_{x_{i} \in R_{2}(j, s)}\left(y_{i}-c_{2}\right)^{2}\right]
    • In the above formula, yiy_iyi is the actual yiy_iyi value corresponding to xix_Ixi in the training sample, c1C_ {1} C1 is the mean value of YYY corresponding to D1D_{1}D1 data set sample, C2c_ {2} C2 is the mean value of YYY corresponding to the dataset sample of D2D_{2}D2.
  • CART\text {CART} Generation of CART regression tree

    For a given training set D

    1. The value of each variable is traversed in turn to find the optimal segmentation variable J and the segmentation point S, and then solve:

    min j . s [ min c 1 x i R 1 ( j . s ) ( y i c 1 ) 2 + min c 2 x i R 2 ( j . s ) ( y i c 2 ) 2 ] \min _{j, s}\left[\min _{c_{1}} \sum_{x_{i} \in R_{1}(j, s)}\left(y_{i}-c_{1}\right)^{2}+\min _{c_{2}} \sum_{x_{i} \in R_{2}(j, s)}\left(y_{i}-c_{2}\right)^{2}\right]
    • This step is to find the optimal point of segmentation for the current node to minimize the overall square error of the two nodes after segmentation.


    1. Then divide the region with the selected pair (j, s) and determine the corresponding output value:

    R 1 ( j . s ) = { x x ( j ) Or less s } . R 2 ( j . s ) = { x x ( j ) > s } c ^ m = 1 N m x i R m ( j . s ) y i . x R m . m = 1 . 2 \begin{array}{c} R_{1}(j, s)=\left\{x \mid x^{(j)} \leq s\right\}, R_{2}(j, s)=\left\{x \mid x^{(j)}>s\right\} \\ \hat{c}_{m}=\frac{1}{N_{m}} \sum_{x_{i} \in R_{m}(j, s)} y_{i}, x \in R_{m}, M = 1, 2 \ end {array}
    • R1 R_ (j, s) {1} (j, s) R1 (j, s), R2 (j, s) R_ {2} (j, s) R2 (j, s), respectively, represents the left child node and right child nodes, C ^m\hat{c}_{m}c^m \hat{c}_{m}c^m \hat{c}_{m}c^m \hat{c}_{m}c^m \hat{c}_{m}c^m \hat{c}_{m}c^m \hat{c}_{m}c^m \hat{c}_{m}c^m \hat{c}_{m}c^m When the loss function changes, it is not the mean.
    1. Continue to call step 1 and Step 2 for the two subareas until the stop condition is met (for example, the number of samples on the child node is too small or the decision tree has reached the specified depth).

    2. The input space is divided into M area R1, R2,…, RMR_ {1}, R_ {2}, \ \ cdots, R_ {M} R1, R2,…, RM, to generate the decision tree:


    f ( x ) = m = 1 M c ^ m I ( x R m ) f(x)=\sum_{m=1}^{M} \hat{c}_{m} I\left(x \in R_{m}\right)
    • F (x) is the CART decision tree we have learned. I\left(x \in R_{m}\right) represents the region to which the corresponding sample belongs. In the corresponding region, its value is 1, otherwise, it is 0.
    • Finally, the leaves of the regression tree obtained by training contain multiple Y values, so how to output the predicted values? The method is still to find the value that minimizes the loss function of this leaf node and output it as the predicted value. As for the “sum of squares of error” as the loss function, it can be directly used as the error after derivation, so the average value is the value that minimizes the loss function, and it only needs to take the average value to output.

Pruning based on cost complexity

  • Since CART\text {CART} does not specify a termination rule for the tree, which grows to its maximum, pruning is especially important. CART\text {CART}CART uses a “cost complexity based” pruning strategy, which results in a series of trees of different sizes, each of which is made by replacing some subtrees of the largest tree with leaves. The smallest tree contains only one leaf. Finally, cross-validation is used on the validation set to evaluate the performance of all trees and select the tree with the best classification performance.

  • Specific algorithm steps:

  • First, we call the largest tree T0T_0T0. We hope to reduce the size of the tree to prevent “over-fitting”, but worry about “under-fitting” after removing too many sub-trees, so we define a loss function to balance the two:


    C Alpha. ( T ) = C ( T ) + Alpha. T C_{\alpha}(T)=C(T)+\alpha|T|
    • TTT: arbitrary subtree;

    • C(T)C(T)C(T) : The prediction error of the subtree on the training set (Gini coefficient for classification, sum of squares of error for regression) is used to measure the degree of fitting of the subtree to the training data;

    • ∣ T ∣ | T | ∣ T ∣ : the number of leaf nodes subtree is used to measure the complexity of the subtree;

    • α\alphaα : Regularized penalty parameter used to balance the relationship between fitting degree and tree complexity

    • Our ultimate goal is to make the loss function of alpha C (T) C_ {\ alpha} minimum alpha (T), C (T), can be seen from the formula, with the increase of the regularization parameter \ alpha alpha, the complexity of the tree model ∣ T ∣ | T | ∣ T ∣ would be forced to increase, decrease, and pruning Therefore, the fitting degree decreases and the prediction error C(T)C(T)C(T) C(T) increases gradually, so the loss function Cα(T)C_{\alpha}(T)Cα(T) can not be determined intuitively. So we introduce the “error increase rate”.

  • Error increase rate:


    g ( t ) = C ( t ) C ( T t ) T t 1 g(t)=\frac{C(t)-C\left(T_{t}\right)}{\left|T_{t}\right|-1}
    • TTT: Any internal single node

    • TtT_tTt: subtree of node TTT

    • C(t)C(t)C(t) : prediction error with TTT as leaf node (TTT subtree is cut out)

    • C(Tt)C(T_t)C(Tt) : prediction error of subtree TtT_tTt

    • ∣ Tt ∣ | T_t | ∣ Tt ∣ : subtree TtT_tTt leaf node number

    • Understanding of the formula:

    1. The error increase rate is used to measure the “benefit” or “cost performance” of pruning.

    2. For the numerator, it’s the change in error before and after pruning, and of course we want it to be as small as possible; For the denominator, which represents the size of the subtree cut, we want it to be as big as possible (the tree model is less complex). Therefore, we hope that the above formula is generally as small as possible, so that the “benefit” of pruning is the largest, and the “cost performance” is the highest

  • Next, for the largest tree T0T_0T0, the “error increase rate” of each internal node was calculated respectively. The node with the lowest “error gain rate” was selected and its branches were cut off. The tree after pruning was marked as T1T_1T1. Next, the “error increase rate” of each node of T1T_1T1 was selected at the bottom, pruned and marked as T2T_2T2. Repeat this process until the last tree TnT_nTn has only one leaf, resulting in a series of trees: T0,T1,T2,T3… , TnT_ {0}, T_ {1}, T_ {2}, T_ {3}, \ ldots, T_ {n} T0, T1, T2, T3,… Tn, are the optimal subtrees under different number of nodes, At the same time, they are the optimal subtrees on each interval [α I,α I +1)\left[\alpha_{I}, \alpha_{I +1}\right)[α I,α I +1) when the regularization parameter α\alphaα increases from 000 to ∞\infty∞. Finally, each subtree is evaluated on the validation set (cross validation score) and the optimal subtree is selected.





If you have any questions, please leave a message.

Finally, if you’re interested in Python, data mining, machine learning, and more, please check out my blog.