4.6 CART algorithm

CART (Classification and Regression Trees) was proposed by L. Breiman, J. Friedman, R. Olshen, AND C. Stone in 1984.

The algorithm of ID3 and C4.5 to generate decision tree is introduced. Because the test data used above and the established model are relatively simple, its generalization ability is very good. However, when the amount of training data is large, the established decision tree model is often very complex and the tree depth is large. At this time, although the training data is well fitted, its generalization ability, namely the ability to predict new data, is not necessarily very good, that is, over-fitting phenomenon occurs. At this point, we need to prune the decision tree to simplify the model.

CART (Classification and Regression Tree) is also a widely used decision tree learning method. But the CART algorithm is more powerful and can be used as both a classification tree and a regression tree. As a classification tree, its essence is not much different from ID3 and C4.5, but the basis for selecting features is different. In addition, the decision tree established by CART algorithm is generally a binary tree, that is, the eigenvalue is only yes or no (PERSONALLY, it is not absolute, but depends on the actual needs). When CART is used as a regression tree, the least square error is used as the basis for dividing samples.

CART has classification and regression. This paper mainly talks about classification, that is, classification tree. Gini index was used to select the optimal feature in classification tree. Suppose there are K, K, K classes, and the probability that the sample point belongs to K th, K, K is p, K p_k PK, The Gini index of the probability distribution is defined as G a I n (p) = ∑ k = 1 k p k (1 − P k) = 1 − ∑ k = 1 k p k 2 \ color {red} Gain (p) = \ sum_ {k = 1} ^ Kp_k (1 – p_k) = 1 – \ sum_ {k = 1} ^ Kp_k ^ 2 Gain (p) = ∑ KPK k = 1 = 1 – (1 – (pk) ∑ kpk2 k = 1

For dichotic problems, if the probability of sample points belonging to the first class is p p P, then the Gini index of the probability distribution is Ga I n(p)=2p(1−p) \color{red}Gain(p)=2p(1-p) Gain(p)=2p(1−p)

For a given sample set D, D, D, the Gini index is



Here, C, K, C_k, Ck is the subset of the samples of class K, k, k of D, D, D, k, k is the number of classes.

The Python code is as follows:

Parameters: dataSet Returns: dataSet """
def calcGini(dataSet) :
    
    numEntries = len(dataSet)
    labelCounts = {}
    for featVec in dataSet: Iterate over each instance and count the frequency of tags
        currentLabel = featVec[-1]
        if currentLabel not in labelCounts.keys(): 
            labelCounts[currentLabel] = 0
        labelCounts[currentLabel] += 1
    Gini = 1.0
    for key in labelCounts:
        prob = float(labelCounts[key]) / numEntries
        Gini -= prob * prob Log base two
    return Gini
Copy the code

If the sample set D D D is divided into two parts D1 D_1 D1, D2 D_2 D2 according to whether the feature set A A A goes to A certain eigenvalue A A A A, i.e

D 1 = {(x, y) ∈ D ∣ A (x) = A}, 1 D_1 = D = D – 2 D \ {(x, y) \ | D in A (x) = A \}, D_2 = D – D_1 D1 = {(x, y) ∈ D ∣ A (x) = A}, D2 = D – D1

Then under the condition of given feature A, the Gini index of set D is defined as



Gini index G I n I (D) Gini(D) Gini(D) represents the uncertainty of the set D D D, Gini index G I n I (D,A) Gini(D,A) Gini(D,A) represents the uncertainty of the set D D D after A= A A= A A= A A= A A= A The greater the Gini index, the greater the uncertainty of the sample set, which is similar to entropy.

The Python code is as follows:

"" function description: Calculate the Gini index Parameters under a given feature: dataSet feature-feature dimension value - the value of the feature variable Returns: the calculation result ""
def calcGiniWithFeat(dataSet, feature, value) :

    D0 = []; D1 = []
    # Divide data by features
    for featVec in dataSet:
        if featVec[feature] == value:
            D0.append(featVec)
        else:
            D1.append(featVec)
    Gini = len(D0) / len(dataSet) * calcGini(D0) + len(D1) / len(dataSet) * calcGini(D1)
    return Gini
Copy the code

Input of algorithm (CART algorithm) : training data set D D D, conditions for stopping calculation; Output: CART decision tree; According to the training data set, each feature is operated recursively from the root node to build a cross binary tree.

  • Step.1 Set the training data set of node as D D D, calculate the Gini index of the data set of the existing feature pairs. At this point, for each feature A A A, and for the possible A A A values, According to the sample point A= A A= A A= A the test is “yes” or “no”, divide D D D into two parts D1 D_1 D1 and D2 D_2 D2, calculate the Gini index of A= A A= A.
  • Step.2 From all possible feature A and all their possible segmentation points A, the feature with the lowest Gini index and the optimal feature and the optimal segmentation point of the segmentation point are selected. According to the optimal feature and the optimal segmentation point, two child nodes are generated from the existing node, and the training data set is allocated to the two child nodes according to the characteristics.
  • Step.3 recursively calls step1 and step2 for the two child nodes until the stop condition is met.
  • Step.4 Generates the CART decision tree.

The Python code is shown below.

Parameters: dataSet Returns: bestFeat """
def chooseBestSplit(dataSet) :
    numFeatures = len(dataSet[0]) -1
    bestGini = 0; bestFeat = 0; newGini =0
    for i in range(numFeatures):
        featList = [example[i] for example in dataSet]
        uniqueVals = set(featList)
        for splitVal in uniqueVals:
            newGini = calcGiniWithFeat(dataSet, i, splitVal)
            if newGini < bestGini:
                bestFeat = i
                bestGini = newGini
    return bestFeat

Parameters: dataSet - dataSet to be partited axis - dataSet to be partited value - dataSet to be returned Returns: none ""
def splitDataSet(dataSet, axis, value) :		
    Create a list of returned datasets
    retDataSet = []										
    Walk through the data set
    for featVec in dataSet: 							
        if featVec[axis] == value:
            # Remove axis features
            reducedFeatVec = featVec[:axis]
            Add those that meet the criteria to the returned dataset
            reducedFeatVec.extend(featVec[axis+1:]) 	
            
            retDataSet.append(reducedFeatVec)
	
    Return the partitioned data set
    return retDataSet	

Returns: sortedClassCount[0][0] Returns: sortedClassCount[0]
def majorityCnt(classList) :
    classCount = {}
    
    for vote in classList:		
        Count the number of occurrences of each element in the classList
        if vote not in classCount.keys():
            classCount[vote] = 0	
        classCount[vote] += 1
    sortedClassCount = sorted(classCount.items(), key = operator.itemgetter(1), reverse = True)		Sort by descending dictionary values
    
    # return the most common element in classList
    return sortedClassCount[0] [0]								

Parameters: dataSet training datasets labels-classification attribute tags Featworths-Optimal feature tags for storing selected information. Returns: myTree featworths-decision tree
def createTree(dataSet, labels,featLabels) :
    
    # Take the category tag (whether to lend :yes or no)
    classList = [example[-1] for example in dataSet]			
    
    # Stop categorizing if the categories are exactly the same
    if classList.count(classList[0= =])len(classList):			
        return classList[0]
     
    # return the class tag with the most occurrences when all features are iterated
    if len(dataSet[0= =])1:									
        return majorityCnt(classList)
    
    bestFeat= chooseBestSplit(dataSet)	# Select the optimal feature
    bestFeatLabel = labels[bestFeat]# Optimal feature tag
    featLabels.append(bestFeatLabel)
    myTree = {bestFeatLabel:{}}			# Tag spanning tree based on optimal features
    del(labels[bestFeat])			# Delete feature tags that are already used
    
    Get the attribute values of all the optimal features in the training set
    featValues = [example[bestFeat] for example in dataSet]		
    
    uniqueVals = set(featValues)		# Remove duplicate attribute values
    # Iterate over features to create a decision tree.
    for value in uniqueVals:	
        subLabels = labels[:]
    					
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), subLabels,featLabels)

    return myTree

FeatLabels - Store selected optimal feature tags testVec - List of test data, in order corresponding to the optimal feature tags ClassLabel - Classification result """ 
def classify(inputTree, featLabels, testVec) :
    firstStr = next(iter(inputTree))		Get the decision tree node
    secondDict = inputTree[firstStr]				# Next dictionary
    featIndex = featLabels.index(firstStr)		
    
    for key in secondDict.keys():
        if testVec[featIndex] == key:

            if type(secondDict[key]).__name__ == 'dict':
                classLabel = classify(secondDict[key], featLabels, testVec)
            else: 
                classLabel = secondDict[key]
    return classLabel
Copy the code

Its code structure and ID3 and C4.5, the following is the test, is also applicable to the computer purchase data.

# test
if __name__ == '__main__':
    ## Step 1: load data
    print("Step 1: load data...")

    df=pd.read_csv('data.csv')
    data=df.values[:-1.1:].tolist()
    
    dataSet=data[:]
    label=df.columns.values[1: -1].tolist()
    labels=label[:]
    
    #print(dataSet)
    #print(labels)
    ## Step 2: training...
    print("Step 2: training...")

    featLabels = []
    myTree = createTree(dataSet, labels,featLabels)
    #print(myTree)
    
    ## Step 3: testing
    print("Step 3: testing...")
    # Test data
    testVec = ['middle_aged'.'yes'.'excellent'.'low']
    
    print("Test Example:"+ str(testVec))
    result = classify(myTree, featLabels, testVec)
    
    ## Step 4: show the result
    print("Step 4: show the result...")
    print("result:"+ str(result))
    if result == 'yes':
        print("To buy")
    else:
        print("Don't buy")
Copy the code

The results are shown below.



Dt-cart-buys_computer_classifty_v1 dT-CARt-Buys_computer_CLASSIfty_v1.0.py dT-CARt-Buys_computer_CLASSIfty_v1.0.py

The decision tree is also visualized. The visualized operation is exactly the same as the ID3 code, resulting in the following.



Dt-cart-buys_computer_classifty_v1 dT-CARt-Buys_computer_CLASSIfty_v1.py dT-CARt-Buys_computer_CLASSIfty_V1.py

Reference Documents: English documents: scikit-learn.org/stable/modu… The Chinese document: sklearn.apachecn.org/cn/0.19.0/m…

References: [1] Machine Learning by Zhou Zhihua [2] Statistical Learning Methods by Li Hang [3]L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth, Belmont, CA, 1984. [4] Quinlan, JR. (1986) Induction of Decision Trees. Machine Learning, 1, 81-106. [5]J.R. Quinlan. C4. 5: programs for machine learning. Morgan Kaufmann, 1993. [6]T. Hastie, R. Tibshirani and J. Friedman. Elements of Statistical Learning, Springer, 2009.

In this chapter, please refer to the attachment and click to enter