The decision tree

Learning notes about Zhou Zhihua’s book Machine Learning record the learning process this blog is Chapter4

1 Basic Process

A decision tree consists of one root, several internal nodes and several leaves.

  • The leaf node corresponds to the decision result, and each other node corresponds to an attribute test
  • The sample set contained in each node is divided into child nodes according to the result of attribute test
  • The root contains the complete set of samples.
  • The path from the root to each leaf corresponds to a sequence of decision tests.

The purpose of decision tree is to generate a decision tree with strong generalization ability, that is, strong ability to deal with examples.

Basic flow: The idea of divide and rule

  • Data set D = {(x1, y1),… , xn, yn} D = \ {(x_1, y_1),… , (x_n, y_n) \} D = {(x1, y1),… ,(xn,yn)}, where xix_ixi represents an n-dimensional variable (independent variable) and yiy_iyi represents a label (a category in the multi-category classification problem)

  • Property set A = {a1, a2,… , an} A = \ {a_1 and a_2,… , a_n \} A = {a1, a2,… ,an},an n-dimensional vector, representing n decision variables, consistent with the dimension of XXX

  • TreeGenerate(D, A)

    Generate Node Node; Return (1) if A==None or the samples in D have the same value on A: Node is marked as a leaf node, and its category is marked as the category with the largest number of samples in D. // Means that all samples in D are consistent in their attributes, such as "all married" and "all 18 years old". // Then node is marked as the Y value of all data in D, with the largest frequency, such as most data // are of class E. Return (2) select the optimal partition attribute A * from A and iterate over each value A *' for A * in A * : generate A branch node* for node; If Dv == None: // there is no need to create a new child node for a*'. Node * is labeled as a leaf node in D with the largest number of samples. TreeGenerate(Dv, A\{A *}Copy the code

    Note: Three returns indicate:

    (1) : The samples you currently receive belong to the same category, so there is no need to divide them

    (2) : The current attribute set is empty, or all samples have the same value in attribute, which cannot be divided. Use the posterior distribution of the current node

    (3) : The sample set contained in the current node is empty and cannot be divided. Use the prior distribution of the parent node

2 Division selection

The key of decision tree is to select the optimal partition attribute A *. In general, we hope that as the decision tree is divided, the samples contained in the branch nodes of the decision tree belong to the same category as far as possible, that is, the purity of the nodes is getting higher and higher.

2.1 Information gain

Information entropy: Measures sample purity. It is assumed that the proportion of the KKK type samples in the current sample set D is PKP_KPK, and there are YYY categories in total, then the information entropy is defined as:


E n t ( D ) = k = 1 Y p k l o g 2 p k Ent(D)=-\sum_{k=1}^{|Y|}p_klog_2p_k

Information entropy Ent(D)\ boldSymbol {Ent(D)} The lower the value of Ent(D), the higher the purity of D. For personal understanding, information entropy can be interpreted as the degree of confusion of information. The greater the confusion, the lower the purity. On the other hand, with high purity we want a decision tree with less and less clutter. For example, the distribution of (0.1,0.9) has higher purity than that of (0.2,0.8), which is reflected in the information entropy of the former is smaller

** Information gain: ** Suppose the discrete attribute AAA has VVV possible values {A1, A2… , aV} \ {a ^ 1, a ^ 2,… , a ^ V \} {a1, a2,… , aV}. If the sample set D is divided by AAA, VVV branch nodes are obtained. The VVV branch contains all the sample DvD^vDv of D in A = AVA = A ^ VA =av, and its information entropy is Ent(Dv)Ent(D^ V)Ent(Dv). Considering the different sample sizes, different branch node to the branch node weights ∣ Dv ∣ / ∣ D ∣ | D | ^ v / | D | ∣ Dv ∣ / ∣ D ∣, namely the node number of the proportion of the total sample. The information gain is:


G a i n ( D . a ) = E n t ( D ) v = 1 V D v D E n t ( D v ) Gain(D,a)=Ent(D)-\sum_{v=1}^V\frac{|D^v|}{|D|}Ent(D^v)

The information gain can be understood as the difference between the chaos degree before and after the branch, and the larger the difference, it can be indicated that the greater the “purity improvement” of the pair partition with attribute AAA, the better the branch effect.

The ID-3 decision tree algorithm uses information gain as criterion to classify attributes. Attribute selection criteria:


a = arg max a A G a i n ( D . a ) a_*=\mathop{\arg\max}_{a\in A} Gain(D,a)

2.2 gain rate

In fact, the information gain criterion has a preference for attributes with a large number of values. In order to reduce the possible adverse effects of this preference, the C4.5 algorithm adopts “gain rate” to select the optimal partitioning attribute:


G a i n _ r a t i o ( D . a ) = G a i n ( D . a ) I V ( a ) I V ( a ) = v = 1 V D v D l o g 2 D v D Gain\_ratio(D,a) = \frac{Gain(D,a)}{IV(a)}\\ IV(a)=-\sum_{v=1}^V\frac{|D^v|}{|D|}log_2\frac{|D^v|}{|D|}

IV(a)IV(a)IV(a) is the inherent value of the attribute, indicating that the more possible values of the attribute, the greater the value of IV(a)IV(a)IV(a). The gain rate criterion may have a preference for attributes with fewer values. The C4.5 algorithm does not directly select the candidate partitioning attribute with the largest gain rate, but uses a heuristic: firstly, the attribute whose information gain is higher than average is found from the candidate partitioning attribute, and then selects the one with the highest gain rate.

2.3 Gini Index

CART decision tree adopts “Gini index” to measure the purity of data set D:


G i n i ( D ) = k = 1 Y k indicates k p k p k = 1 k = 1 Y p k 2 Gini(D)=\sum_{k=1}^{|Y|}\sum_{k’ \neq k}p_kp_{k’}=1-\sum_{k=1}^Yp_k^2

Gini index represents the probability that two randomly selected samples from dataset D have inconsistent category markers. Therefore, the smaller the probability, the smaller the Gini index, the higher the purity of D.

The Gini index of attribute AAA is defined as:


G i n i _ i n d e x ( D . a ) = v = 1 V D v D G i n i ( D v ) Gini\_index(D,a)=\sum_{v=1}^V\frac{|D^v|}{|D|}Gini(D^v)

Selection criteria:


a = arg min a A   G i n i _ i n d e x ( D . a ) a_*=\mathop{\arg\min}_{a\in A}\space Gini\_index(D,a)

3. Pruning

Pruning is a means for decision tree learning algorithm to deal with “over-fitting”. In decision tree learning, in order to classify training samples as correctly as possible, the node division process will be repeated repeatedly, sometimes resulting in too many branches of decision tree (too well learned), so that some characteristics of the training set itself are regarded as the general properties of all data, leading to over-fitting.

The basic pruning strategies include prepruning and post-pruning.

  • Pre-pruning: In the process of decision tree generation, each node is estimated before partition. If the partition of the current node fails to improve the generalization performance of the decision tree, the partition is stopped and the node is marked as a leaf node
  • Post-pruning: a complete decision tree is generated from the training set first, and then non-leaf nodes are investigated from bottom to top. If replacing the subtree corresponding to this node with leaf node can improve the generalization performance of decision tree, the subtree is replaced with leaf node.

By explaining the concepts of pre-pruning and post-pruning, we can find that the key of pruning is how to judge whether the generalization performance of decision tree is improved.

3.1 preliminary pruning

Process:

  • The data set is divided into training set and test set.

  • Develop a decision tree by training as follows:

  • Take the root node as an example:

    • If there is no division, the node classification result will be the one with the largest number of categories in all data sets, that is, “good melon” (in fact, bad melon is also ok, because there are 5 good melon and 5 bad melon).

      In the test set, {4,5,8} was successfully classified as the best melon, while the others were not successfully classified, with an accuracy rate of 42.9%

    • If it is divided, the test set {4,5,8,11,12} is successfully classified according to the results in the figure, and the accuracy rate is 71.4%, which is improved. Therefore, “umbilical” division should be selected.

  • Take ② as an example: when the attribute “color and luster” was selected according to the information gain criterion, it was found that the accuracy rate decreased, so classification was prohibited. And the same goes for everything else.

Evaluation:

  • Advantages: many branches of the decision tree are not expanded, which not only reduces the risk of over-fitting, but also significantly reduces the training time and testing time cost of the decision tree.
  • Disadvantages: Although the current division of some branches cannot improve the generalization performance, or even reduce the generalization performance, the subsequent division based on the branch may lead to significant performance improvement. Prepruning is based on the greedy idea, which essentially forbids the expansion of these branches, and brings the risk of under-fitting to the prepruning decision tree

3.2 after pruning

Process:

  • Firstly, a complete decision tree is developed according to the training, and the accuracy of the verification set of the decision tree is 42.9%

  • From bottom to top, node ⑥ is considered first: the branch leading by ⑥ is cut off, that is, ⑥ is replaced with leaf node, whose category corresponds to the category with the largest category proportion in the sample of its data set {7,15}, which can be selected as “good melon”. The accuracy of decision tree on the test set is improved to 57.1%. So I decided to prune.

  • Continue to consider the node ⑤, the accuracy is unchanged, can not prune.

  • Continue to consider node ②, the accuracy improved to 71.4%, decided to prune

  • Considering ③ and ①, the accuracy is not improved, so it is retained and pruning is not carried out

  • Finally, the post-pruning decision tree is obtained:

Evaluation:

  • Advantages: It can be seen that the post-pruning decision tree retains more branches than the pre-pruning decision tree. In general, the post-pruning decision tree has less risk of underfitting, and its generalization performance is usually better than that of pre-pruning decision tree.
  • Disadvantages: The post-pruning process is carried out after the complete generation of the decision tree, and is bottom-up. All non-leaf nodes are observed carefully, and the training time cost is much larger than that of unpruned decision tree and pre-pruned decision tree.

4 Consecutive and missing values

4.1 Continuous value processing

Since the number of values of continuous attributes is no longer finite, continuous attribute discretization can be adopted.

C4.5 Dichotomy is adopted in the decision tree:

  • Given the sample set DDD and the continuous attribute AAA, suppose that NNN different values of AAA appear on DDD, order the values from smallest to largest {A1, A2… , an} \ {a_1 and a_2,… , a_n \} {a1, a2,… ,an}

  • DDD is divided into Dt−D_t^-Dt− and Dt+D_t^+Dt+ based on partition point TTT. Aia ^iai and ai+1a^{I +1}ai+1


    T a = { a i + a i + 1 2 i Or less i Or less n 1 } T_a=\{\frac{a^i+a^{i+1}}{2}|i\le i \le n-1\}
  • The information gain formula is changed to:


    G a i n ( D . a . t ) = max t T a   E n t ( D ) Lambda. { . + } D t Lambda. D E n t ( D t Lambda. ) Gain(D,a,t)=\mathop{\max}_{t\in T_a}\space Ent(D)-\sum_{\lambda \in \{-,+\}}\frac{|D^\lambda_t|}{|D|}Ent(D^\lambda_t)
  • The information gain of each attribute AAA and the corresponding partition point TTT are calculated. The branches can be divided into two categories according to ≤t\le t≤t and ≥t\ge t≥t.

4.2 Handling missing Values

In reality, we are often faced with data sets with some missing value attributes (for privacy reasons, etc.). If we simply give up incomplete samples and only use samples with no missing values for learning, it is obviously a great waste of data information.

To deal with missing values, we need to consider two questions:

(1) How to select partition attributes in the case of missing attribute values?

(2) Given partition attribute, if the value of the sample on this attribute is missing, how to divide the sample?

Solution to problem 1:

  • Given data set DDD and attribute AAA, let D~\widetilde{D}D represent the subset of DDD samples with no missing values on attribute AAA, and use D~\widetilde{D}D to measure attribute selection in question 1.

  • Given that AAA has VVV selectable values, let D~v\widetilde{D}^vD v represent a subset of the samples of D~\widetilde{D}D whose values on attribute AAA are AVA ^vav, Is D~k\widetilde{D}^kD k denotes the subset of D~\widetilde{D}D belonging to class KKK

    Clearly has ~ D = ⋃ k = 1 yd ~ k, D= ⋃v=1VD~v\widetilde D=\bigcup_{k=1}^Y \widetilde D_k,\space \widetilde D=\bigcup_{v=1}^ v\widetilde D= ⋃k=1YD k, D = ⋃ vd v v = 1

  • Suppose we assign a weight wxw_xwx to each sample XXX and define


    rho = x D ~ w x x D w x . p ~ k = x D ~ k w x x D ~ w x    ( 1 Or less k Or less Y ) . r ~ v = x D ~ v w x x D ~ w x    ( 1 Or less v Or less V ) \rho = \frac{\sum_{x\in \widetilde D}w_x}{\sum_{x\in D}w_x},\\ \widetilde p_k = \frac{\sum_{x\in \widetilde D_k}w_x}{\sum_{x\in \widetilde D}w_x} \space\space(1\le k\le |Y|),\\ \widetilde r_v = \frac{\sum_{x\in \widetilde D^v}w_x}{\sum_{x\in \widetilde D}w_x} \space\space(1\le v\le V)

    It can be seen that ρ\rhoρ is the proportion of samples without missing; P ~k\widetilde p_kp k represents the proportion of KKK class in the sample without deletion, and R ~v\ Widetilde r_VR v represents the proportion of the sample without deletion whose value is AVA ^vav on attribute AAA

  • The formula of information gain is extended to:


    G a i n ( D . a ) = rho x G a i n ( D ~ . a ) = rho x ( E n t ( D ~ ) v = 1 V r ~ v E n t ( D ~ v ) ) E n t ( D ~ ) = k = 1 Y p ~ k l o g 2 rho ~ k Gain(D,a)=\rho \times Gain(\widetilde D,a)=\rho \times (Ent(\widetilde D)-\sum_{v=1}^V\widetilde r_vEnt(\widetilde D^v))\\ Ent(\widetilde D)=-\sum_{k=1}^{|Y|}\widetilde p_klog_2\widetilde \rho_k

Solution to problem 2:

  • If the value of the sample in attribute AAA is known, it is divided into the corresponding child node, and the weight wxw_Xwx remains unchanged
  • If the value of the sample in attribute AAA is unknown, it is divided into all child nodes, and the weight of each child node is adjusted to r~ V ×wx\widetilde r_v \times w_xr v×wx. You can kind of view it as taking the sample and assigning it to different children with different probabilities.

Multivariable decision tree

If we regard each attribute as an axis in the coordinate space, then the sample of DDD attribute description corresponds to a data point in the DDD dimension space. To classify samples means to find the classification boundary between different kinds of samples in this coordinate space. The classification boundary formed by decision tree has an obvious characteristic: axis-parallel, that is, its classification boundary is composed of about thousands of segments parallel to the coordinate axes. The following figure

Multivariable decision tree: A decision tree that can achieve “oblique partition” or even more complex partition, as shown in the figure below

  • Slant partition: non-leaf nodes no longer divide a single attribute, but test the linear combination of attributes ∑ I =1dwiai=t\sum_{I =1}^ dw_ia_i =t∑ I =1dwiai=t, where WIW_iwi and TTT can be learned from sample set and attribute set. Different from the traditional decision tree, this method no longer finds an optimal partition attribute for each non-leaf node, but tries to build an appropriate linear classifier.
  • Represents decision tree: OC1 decision tree

4.6 Incremental Learning

Some decision tree algorithms can perform “incremental learning”, such as ID4, ID5R, ITI, etc. Incremental learning means that after receiving new samples, the learned model can be adjusted without completely re-learning. The main mechanism is to partially reconstruct the tree by adjusting the partition attribute order on the branch path.