Read this article for background: a little programming knowledge

One, the introduction

In life, every time it’s time for a meal, you silently say “What’s next?” At this time, we decide that the distance of the restaurant should not be more than 200 meters. Then we look at the 20 yuan in our wallet and decide that the food should not be more than 20 yuan. Finally, we order lanzhou noodles. As can be seen from the above example, the lanzhou noodles we eat today are determined by the previous series of decisions.

FIG. 1-1

As shown in Figure 1-1, the above Decision process is represented by a binary Tree, which is called a Decision Tree. In machine Learning, the Decision Tree model shown in Figure 1-1 can also be developed through data intensive training, and this algorithm is called Decision Tree Learning algorithm 1.

Ii. Model introduction

model

Decision Tree Learning algorithm, first of all, must be a Tree structure, which is composed of internal nodes and leaf nodes. The internal node represents a dimension (feature), and the leaf node represents a classification. Nodes are connected with each other through certain conditions, so the decision tree can be regarded as a bunch of if… else… A collection of rules.

Figure 2-1

As shown in Figure 2-1, a basic decision tree data structure and its decision methods are shown.

Feature selection

Now that you have a decision to make, you need to decide which dimension (characteristic) to use to make the decision, such as the distance to the store, the amount of change in your wallet, etc. In machine learning, we need a quantitative index to determine which feature is more appropriate, that is, the “purity” of the subset obtained after using this feature partition is higher. At this time, three indicators — Information Gain, Gini Index and mean square deviation (MSE) are introduced to solve the problems mentioned above.

Information Gain

Formula 2-1 is an indicator of the purity of a sample set, known as Information Entropy, where D represents the sample set, K represents the number of categories in the sample set, and P_k represents the proportion of the k-th sample in the sample set. The smaller Ent(D) is, the higher the purity of the sample set is.


Ent ( D ) = k = 1 K p k log 2 p k \operatorname{Ent}(D)=-\sum_{k=1}^{K} p_{k} \log _{2} p_{k}

Type 2-1

Formula 2-2 represents the influence of a discrete attribute on the sample set, known as Information Gain, where D represents the sample set, A represents the discrete attribute, V represents the number of all possible values of the discrete attribute A, and D^ V represents the subsample set of the VTH value in the sample set.


Gain ( D . a ) = Ent ( D ) v = 1 V D v D Ent ( D v ) \operatorname{Gain}(D, a)=\operatorname{Ent}(D)-\sum_{v=1}^{V} \frac{\left|D^{v}\right|}{|D|} \operatorname{Ent}\left(D^{v}\right)

Type 2-2

When the attribute is a continuous attribute, its available value is not limited like discrete attribute. In this case, the values of the continuous attribute in the sample set can be sorted and then the average value of the two can be used as the dividing point, and formula 2-2 can be rewritten to obtain the result as Formula 2-3, where T_a represents the average value set and D_t^v represents the sub-set. When v = – represents the sample subset less than the mean t in the sample, and when v = + represents the sample subset greater than the mean T in the sample. The maximum information gain at the partition point is taken as the information gain value of this attribute.


T a = { a i + a i + 1 2 1 Or less i Or less n 1 } Gain ( D . a ) = max t T a Gain ( D . a . t ) = max t T a Ent ( D ) v { . + } D t v D Ent ( D t v ) \begin{aligned} T_{a} &=\left\{\frac{a^{i}+a^{i+1}}{2} \mid 1 \leq i \leq n-1\right\} \\ \operatorname{Gain}(D, a) &=\max _{t \in T_{a}} \operatorname{Gain}(D, a, t) \\ &=\max _{t \in T_{a}} \operatorname{Ent}(D)-\sum_{v \in\{-,+\}} \frac{\left|D_{t}^{v}\right|}{|D|} \operatorname{Ent}\left(D_{t}^{v}\right) \end{aligned}

Type 2-3

The larger the Gain(D, a) value is, the higher the purity of the sample set will be after partitioning according to this attribute. Thus, the most appropriate partition attribute can be found, as shown in Equation 2-4:


a best  = argmax a Gain ( D . a ) a_{\text {best }}=\underset{a}{\operatorname{argmax}} \operatorname{Gain}(D, a)

Type 2-4

Gini Index

Formula 2-5 is another indicator of purity of sample set, known as Gini value, where D represents the sample set, K represents the number of classification of sample set, p_k represents the proportion of the KTH sample in the sample set. The smaller Gini(D) is, the higher the purity of the sample set is.


Gini ( D ) = 1 k = 1 K p k 2 \operatorname{Gini}(D)=1-\sum_{k=1}^{K} p_{k}^{2}

Type 2-5

Formula 2-6 represents the influence of a discrete attribute on the sample set, known as Gini Index, where D represents the sample set, A represents the discrete attribute, V represents the number of all possible values of the discrete attribute A, and D^ V represents the subsample set of the VTH value in the sample set.


G i n i i n d e x ( D . a ) = v = 1 V D v D Gini ( D v ) \operatorname{Gini_{-}index}(D, a)=\sum_{v=1}^{V} \frac{\left|D^{v}\right|}{|D|} \operatorname{Gini}\left(D^{v}\right)

Type 2-6

In the same way as Formula 2-3, the average value of the two consecutive attributes is taken as the dividing point, and formula 2-6 is rewritten to obtain the result as Formula 2-7, where T_a represents the average value set, D_t^v represents the sub-set, and when v = – represents the subset of samples less than the mean value t. When v = + represents the sample subset larger than the mean t in the sample, the smallest Gini index in the partition point is taken as the gini index value of this attribute.


G i n i i n d e x ( D . a ) = min t T a v { . + } D t v D Gini ( D t v ) \operatorname{Gini_{-}index}(D, a)=\min _{t \in T_{a}} \sum_{v \in\{-,+\}} \frac{\left|D_{t}^{v}\right|}{|D|} \operatorname{Gini}\left(D_{t}^{v}\right)

Type 2-7

The smaller the value of Gini_index(D, a) is, the higher the purity of the sample set will be after partitioning according to this discrete attribute. Thus, the most appropriate partition attribute can be found, as shown in Formula 2-8:


a best  = argmin a Gini_index ( D . a ) a_{\text {best }}=\underset{a}{\operatorname{argmin}} \operatorname{Gini\_index}(D, a)

Type 2-8

Mean square deviation (MSE)

The above two indicators make the decision tree can be used for classification problems, so when the decision tree is used for regression problems, different indicators are needed to determine the characteristics of division, which is the mean square deviation (MSE) as shown in Formula 2-9, where T_a represents the mean set, y_t^v represents the subset label, When v = – represents the labels of the sample subset that is less than the mean t in the sample, when v = + represents the labels of the sample subset that is greater than the mean t in the sample, and the latter term is the mean value of the labels of the corresponding subset.


MSE ( D . a ) = min t T a v { . + } ( y t v y t v ^ ) 2 \operatorname{MSE}(D, a)=\min _{t \in T_{a}} \sum_{v \in\{-,+\}}\left(y_{t}^{v}-\hat{y_{t}^{v}}\right)^{2}

Type 2-9

The smaller the value of MSE(D, a) is, the higher the fitting degree of decision tree to the sample set is. Thus, the most appropriate partition attribute can be found, as shown in Formula 2-10:


a best  = argmin a MSE ( D . a ) a_{\text {best }}=\underset{a}{\operatorname{argmin}} \operatorname{MSE}(D, a)

Type 2-10

Now that you know the data structure of the decision tree model and how to partition the best data set, it’s time to learn how to generate a decision tree.

Three, algorithm steps

Since the data structure of the decision tree is a tree, its children must also be a tree. The decision tree can be generated recursively, and the steps are as follows:

Generate a new node node; When there is only one category C in the sample, the node node is marked as the leaf node of category C, and the node node is returned. Iterate over all features: calculate the information gain or Gini index or mean square error of the current feature; The best partition features are recorded in node nodes; After partitioning according to the optimal characteristics, the left part recursively calls the current method, which is regarded as the left child node of node. After partitioning according to the optimal characteristics, the right part recursively calls the current method, which is regarded as the right child node of node. Return node;

Fourth, regularization

When the recursive decision tree is generated, the model will be very accurate in the classification of training data, but not ideal in the performance of unknown predicted data, which is the so-called over-fitting phenomenon. In this case, the model can be regularized as the previous over-fitting solution learned from linear regression.

Depth of decision tree

It can be regularized by limiting the maximum depth of the decision tree to prevent it from over-fitting. At this time, only a parameter for recording the tree depth under the current recursion is added to the algorithm step. When the preset maximum depth is reached, no new child nodes are generated, and the current node is marked as the classification with the largest proportion of classification in the sample and the current recursion exits.

The size of the leaves in the decision tree

Another regularization method for decision tree is to limit the minimum number of samples contained by leaf nodes, which can also prevent the over-fitting phenomenon. When the nodules contained the number of samples, the current node was marked as the classification with the largest proportion of classification in the sample and the current recursion was exited

Pruning of decision trees

In addition, the decision tree can be pruned to prevent it from over-fitting and the redundant subtrees can be cut off. Pruning methods are divided into two kinds, namely, prepruning and post-pruning.

Pre pruning

As the name implies, pre-pruning is to decide whether to generate children when the decision tree is generated. The judgment method is to compare the accuracy of generated children with that of non-generated children by using validation data sets. When the accuracy of generated children is improved, children will be generated; otherwise, children will not be generated.

Figure 4-1 From Machine Learning by Zhou Zhihua

After pruning

Post-pruning is to form a complete decision tree, and then start from leaf nodes. The judgment method is the same as pre-pruning. When the precision of generating child nodes is improved, the child nodes are retained; otherwise, the child nodes are cut off.

Figure 4-2 From Machine Learning by Zhou Zhihua

Five, code implementation

Implementation of decision tree classification based on information gain using Python:

import numpy as np

class GainNode:
    The nodes in the classification decision tree are based on Information Gain.
    
    def __init__(self, feature=None, threshold=None, gain=None, left=None, right=None) :
        # Feature subscript of node division
        self.feature = feature
        # The critical value of node division, when the node is leaf node, is the classification value
        self.threshold = threshold
        # Information gain value of node
        self.gain = gain
        # left node
        self.left = left
        # right node
        self.right = right

class GainTree:
    "" Classification decision tree based on Information Gain ""
    
    def __init__(self, max_depth = None, min_samples_leaf = None) :
        # Maximum depth of decision tree
        self.max_depth = max_depth
        # Determine the minimum sample number of leaf nodes
        self.min_samples_leaf = min_samples_leaf
        
    def fit(self, X, y) :
        Classification Decision Tree Fitting Based on Information Gain
        y = np.array(y)
        self.root = self.buildNode(X, y, 0)
        return self
    
    def buildNode(self, X, y, depth) :
        "" Constructing classification Decision Tree Nodes Based on Information Gain
        node = GainNode()
        # Return directly when there are no samples
        if len(y) == 0:
            return node
        y_classes = np.unique(y)
        # If only one category exists in the sample, return that category directly
        if len(y_classes) == 1:
            node.threshold = y_classes[0]
            return node
        # Return the classification with the largest proportion of classification in the sample when the depth of the decision tree reaches the maximum depth limit
        if self.max_depth is not None and depth >= self.max_depth:
            node.threshold = max(y_classes, key=y.tolist().count)
            return node
        # Return the classification with the largest proportion in the sample when the number of decision leaf node samples reaches the minimum limit
        if self.min_samples_leaf is not None and len(y) <= self.min_samples_leaf:
            node.threshold = max(y_classes, key=y.tolist().count)
            return node
        max_gain = -np.inf
        max_middle = None
        max_feature = None
        # Iterate over all features to obtain the feature with the maximum information gain
        for i in range(X.shape[1) :# Calculate the information gain of the feature
            gain, middle = self.calcGain(X[:,i], y, y_classes)
            if max_gain < gain:
                max_gain = gain
                max_middle = middle
                max_feature = i
        # Feature of maximum information gain
        node.feature = max_feature
        # threshold
        node.threshold = max_middle
        # Information gain
        node.gain = max_gain
        X_lt = X[:,max_feature] < max_middle
        X_gt = X[:,max_feature] > max_middle
        # Recursively process the left set
        node.left = self.buildNode(X[X_lt,:], y[X_lt], depth + 1)
        # Recursively handle the right set
        node.right = self.buildNode(X[X_gt,:], y[X_gt], depth + 1)
        return node
    
    def calcMiddle(self, x) :
        "" calculate the average value of two continuous features. ""
        middle = []
        if len(x) == 0:
            return np.array(middle)
        start = x[0]
        for i in range(len(x) - 1) :if x[i] == x[i + 1] :continue
            middle.append((start + x[i + 1) /2)
            start = x[i + 1]
        return np.array(middle)

    def calcEnt(self, y, y_classes) :
        """ Calculating information entropy """
        ent = 0
        for j in range(len(y_classes)):
            p = len(y[y == y_classes[j]])/ len(y)
            ifp ! =0:
                ent = ent + p * np.log2(p)
        return -ent

    def calcGain(self, x, y, y_classes) :
        """ Compute information gain """
        x_sort = np.sort(x)
        middle = self.calcMiddle(x_sort)
        max_middle = -np.inf
        max_gain = -np.inf
        ent = self.calcEnt(y, y_classes)
        # Iterate over each average
        for i in range(len(middle)):
            y_gt = y[x > middle[i]]
            y_lt = y[x < middle[i]]
            ent_gt = self.calcEnt(y_gt, y_classes)
            ent_lt = self.calcEnt(y_lt, y_classes)
            # Calculate information gain
            gain = ent - (ent_gt * len(y_gt) / len(x) + ent_lt * len(y_lt) / len(x))
            if max_gain < gain:
                max_gain = gain
                max_middle = middle[i]
        return max_gain, max_middle
    
    def predict(self, X) :
        "" classification decision tree prediction ""
        y = np.zeros(X.shape[0])
        self.checkNode(X, y, self.root)
        return y
    
    def checkNode(self, X, y, node, cond = None) :
        "" Judging classification by classification decision tree nodes ""
        # If there are no children, return the current threshold
        if node.left is None and node.right is None:
            return node.threshold
        X_lt = X[:,node.feature] < node.threshold
        if cond is not None:
            X_lt = X_lt & cond
        # recursively determine the left node
        lt = self.checkNode(X, y, node.left, X_lt)
        if lt is not None:
            y[X_lt] = lt
        X_gt = X[:,node.feature] > node.threshold
        if cond is not None:
            X_gt = X_gt & cond
        # Recursively determine the right node
        gt = self.checkNode(X, y, node.right, X_gt)
        if gt is not None:
            y[X_gt] = gt
Copy the code

Implementation of Gini index based decision tree classification using Python:

import numpy as np

class GiniNode:
    The nodes in the classification decision tree are based on Gini Index.
    
    def __init__(self, feature=None, threshold=None, gini_index=None, left=None, right=None) :
        # Feature subscript of node division
        self.feature = feature
        # The critical value of node division, when the node is leaf node, is the classification value
        self.threshold = threshold
        # Gini index value of node
        self.gini_index = gini_index
        # left node
        self.left = left
        # right node
        self.right = right

class GiniTree:
    "" Classification decision tree based on Gini Index
    
    def __init__(self, max_depth = None, min_samples_leaf = None) :
        # Maximum depth of decision tree
        self.max_depth = max_depth
        # Determine the minimum sample number of leaf nodes
        self.min_samples_leaf = min_samples_leaf
        
    def fit(self, X, y) :
        Classification Decision Tree Fitting Based on Gini Index
        y = np.array(y)
        self.root = self.buildNode(X, y, 0)
        return self

    def buildNode(self, X, y, depth) :
        "" Construction of Classification Decision Tree Nodes Based on Gini Index
        node = GiniNode()
        # Return directly when there are no samples
        if len(y) == 0:
            return node
        y_classes = np.unique(y)
        # If only one category exists in the sample, return that category directly
        if len(y_classes) == 1:
            node.threshold = y_classes[0]
            return node
        # Return the classification with the largest proportion of classification in the sample when the depth of the decision tree reaches the maximum depth limit
        if self.max_depth is not None and depth >= self.max_depth:
            node.threshold = max(y_classes, key=y.tolist().count)
            return node
        # Return the classification with the largest proportion in the sample when the number of decision leaf node samples reaches the minimum limit
        if self.min_samples_leaf is not None and len(y) <= self.min_samples_leaf:
            node.threshold = max(y_classes, key=y.tolist().count)
            return node
        min_gini_index = np.inf
        min_middle = None
        min_feature = None
        # Iterate over all features to obtain the feature with the smallest Gini index
        for i in range(X.shape[1) :# Calculate the Gini index of features
            gini_index, middle = self.calcGiniIndex(X[:,i], y, y_classes)
            if min_gini_index > gini_index:
                min_gini_index = gini_index
                min_middle = middle
                min_feature = i
        # Features of the smallest Gini index
        node.feature = min_feature
        # threshold
        node.threshold = min_middle
        # Gini index
        node.gini_index = min_gini_index
        X_lt = X[:,min_feature] < min_middle
        X_gt = X[:,min_feature] > min_middle
        # Recursively process the left set
        node.left = self.buildNode(X[X_lt,:], y[X_lt], depth + 1)
        # Recursively handle the right set
        node.right = self.buildNode(X[X_gt,:], y[X_gt], depth + 1)
        return node
    
    def calcMiddle(self, x) :
        "" calculate the average value of two continuous features. ""
        middle = []
        if len(x) == 0:
            return np.array(middle)
        start = x[0]
        for i in range(len(x) - 1) :if x[i] == x[i + 1] :continue
            middle.append((start + x[i + 1) /2)
            start = x[i + 1]
        return np.array(middle)

    def calcGiniIndex(self, x, y, y_classes) :
        """ Calculate gini index """
        x_sort = np.sort(x)
        middle = self.calcMiddle(x_sort)
        min_middle = np.inf
        min_gini_index = np.inf
        for i in range(len(middle)):
            y_gt = y[x > middle[i]]
            y_lt = y[x < middle[i]]
            gini_gt = self.calcGini(y_gt, y_classes)
            gini_lt = self.calcGini(y_lt, y_classes)
            gini_index = gini_gt * len(y_gt) / len(x) + gini_lt * len(y_lt) / len(x)
            if min_gini_index > gini_index:
                min_gini_index = gini_index
                min_middle = middle[i]
        return min_gini_index, min_middle

    def calcGini(self, y, y_classes) :
        """ Calculate the Gini value """
        gini = 1
        for j in range(len(y_classes)):
            p = len(y[y == y_classes[j]])/ len(y)
            gini = gini - p * p
        return gini
    
    def predict(self, X) :
        "" classification decision tree prediction ""
        y = np.zeros(X.shape[0])
        self.checkNode(X, y, self.root)
        return y
    
    def checkNode(self, X, y, node, cond = None) :
        "" Judging classification by classification decision tree nodes ""
        if node.left is None and node.right is None:
            return node.threshold
        X_lt = X[:,node.feature] < node.threshold
        if cond is not None:
            X_lt = X_lt & cond
        lt = self.checkNode(X, y, node.left, X_lt)
        if lt is not None:
            y[X_lt] = lt
        X_gt = X[:,node.feature] > node.threshold
        if cond is not None:
            X_gt = X_gt & cond
        gt = self.checkNode(X, y, node.right, X_gt)
        if gt is not None:
            y[X_gt] = gt
Copy the code

Using Python to implement decision tree regression based on mean square deviation:

import numpy as np

class RegressorNode:
    "" nodes in regression decision tree ""
    
    def __init__(self, feature=None, threshold=None, mse=None, left=None, right=None) :
        # Feature subscript of node division
        self.feature = feature
        # The critical value of node division, when the node is leaf node, is the classification value
        self.threshold = threshold
        # mean square error value of node
        self.mse = mse
        # left node
        self.left = left
        # right node
        self.right = right

class RegressorTree:
    """ Regression decision tree """
    
    def __init__(self, max_depth = None, min_samples_leaf = None) :
        # Maximum depth of decision tree
        self.max_depth = max_depth
        # Determine the minimum sample number of leaf nodes
        self.min_samples_leaf = min_samples_leaf
        
    def fit(self, X, y) :
        "" regression decision tree fitting ""
        self.root = self.buildNode(X, y, 0)
        return self

    def buildNode(self, X, y, depth) :
        """ Constructing regression decision tree nodes ""
        node = RegressorNode()
        # Return directly when there are no samples
        if len(y) == 0:
            return node
        y_classes = np.unique(y)
        # If only one category exists in the sample, return that category directly
        if len(y_classes) == 1:
            node.threshold = y_classes[0]
            return node
        # Return the classification with the largest proportion of classification in the sample when the depth of the decision tree reaches the maximum depth limit
        if self.max_depth is not None and depth >= self.max_depth:
            node.threshold = np.average(y)
            return node
        # Return the classification with the largest proportion in the sample when the number of decision leaf node samples reaches the minimum limit
        if self.min_samples_leaf is not None and len(y) <= self.min_samples_leaf:
            node.threshold = np.average(y)
            return node
        min_mse = np.inf
        min_middle = None
        min_feature = None
        # Iterate over all features to obtain the feature with the smallest mean square deviation
        for i in range(X.shape[1) :# Calculate the mean square error of features
            mse, middle = self.calcMse(X[:,i], y)
            if min_mse > mse:
                min_mse = mse
                min_middle = middle
                min_feature = i
        # Feature of minimum mean square deviation
        node.feature = min_feature
        # threshold
        node.threshold = min_middle
        # mean square error
        node.mse = min_mse
        X_lt = X[:,min_feature] < min_middle
        X_gt = X[:,min_feature] > min_middle
        # Recursively process the left set
        node.left = self.buildNode(X[X_lt,:], y[X_lt], depth + 1)
        # Recursively handle the right set
        node.right = self.buildNode(X[X_gt,:], y[X_gt], depth + 1)
        return node
    
    def calcMiddle(self, x) :
        "" calculate the average value of two continuous features. ""
        middle = []
        if len(x) == 0:
            return np.array(middle)
        start = x[0]
        for i in range(len(x) - 1) :if x[i] == x[i + 1] :continue
            middle.append((start + x[i + 1) /2)
            start = x[i + 1]
        return np.array(middle)

    def calcMse(self, x, y) :
        Calculating the mean square deviation
        x_sort = np.sort(x)
        middle = self.calcMiddle(x_sort)
        min_middle = np.inf
        min_mse = np.inf
        for i in range(len(middle)):
            y_gt = y[x > middle[i]]
            y_lt = y[x < middle[i]]
            avg_gt = np.average(y_gt)
            avg_lt = np.average(y_lt)
            mse = np.sum((y_lt - avg_lt) ** 2) + np.sum((y_gt - avg_gt) ** 2)
            if min_mse > mse:
                min_mse = mse
                min_middle = middle[i]
        return min_mse, min_middle
    
    def predict(self, X) :
        "" regression decision tree prediction ""
        y = np.zeros(X.shape[0])
        self.checkNode(X, y, self.root)
        return y
    
    def checkNode(self, X, y, node, cond = None) :
        "" Judging classification by regression decision tree nodes ""
        if node.left is None and node.right is None:
            return node.threshold
        X_lt = X[:,node.feature] < node.threshold
        if cond is not None:
            X_lt = X_lt & cond
        lt = self.checkNode(X, y, node.left, X_lt)
        if lt is not None:
            y[X_lt] = lt
        X_gt = X[:,node.feature] > node.threshold
        if cond is not None:
            X_gt = X_gt & cond
        gt = self.checkNode(X, y, node.right, X_gt)
        if gt is not None:
            y[X_gt] = gt
Copy the code

Sixth, third-party library implementation

Scikit-learn2 decision tree classification implementation

from sklearn import tree

# Decision tree classification
clf = tree.DecisionTreeClassifier()
# Fit data
clf = clf.fit(X, y)
Copy the code

Scikit-learn3 Decision tree regression implementation

from sklearn import tree

# Decision tree regression
clf = tree.DecisionTreeRegressor()
# Fit data
clf = clf.fit(X, y)
Copy the code

Seven, animation demonstration

Figure 7-1 shows the classification results of an unregularized decision tree, and Figure 7-2 shows the classification results of a regularized decision tree (max_depth = 3, min_samples_leaf = 5)

Figure 7-1

Figure 7-2

Figure 7-3 shows the regression results of an unregularized decision tree, and Figure 7-4 shows the regression results of a regularized decision tree (max_depth = 3, min_SAMples_leaf = 5)

Figure 7-3

Figure 7-4

It can be seen that the unregularized decision tree obviously overfits the training data set, and the regularized decision tree is relatively better.

Mind mapping

Figure 8-1

Ix. References

  1. En.wikipedia.org/wiki/Decisi…
  2. Scikit-learn.org/stable/modu…
  3. Scikit-learn.org/stable/modu…

The full demo can be found here

Note: this article strives to be accurate and easy to understand, but because the author is a beginner, the level is limited, such as the existence of errors or omissions in the article, please readers through the way of comment criticism