1. Decision tree method introduction

Decision tree is a basic classification and regression method, which is composed of nodes and directed edges, including root node, internal node and Leaf node.

  • Rule set nature: mutually exclusive and complete (each instance is covered by one path or one rule, and only one path or one rule).

Ii. Related Concepts: Take loan data as an example

  • Concepts: entropy, empirical entropy, conditional entropy, empirical conditional entropy, information gain, split information, information gain rate, Gini coefficient

  • Calculation of information gain: entropy of parent node minus weighted average entropy of child node, calculated as follows:
  1. The loan application data is divided into two groups according to “whether lending or not”. The first group accounts for 9/15 of “lending” and the second group accounts for 6/15 of “not lending”. The entropy of the root node can be calculated:

2. Considering the characteristics of “age”, there are youth, middle age and old age. In all sample sizes, youth accounts for 5/15, middle age for 5/15 and old age for 5/15.

(1) Youth: in all the youth samples, the youth in the first group accounted for 2/5, and the youth in the second group accounted for 3/5(2) Middle age: in all the middle age samples, middle age in the first group accounted for 3/5, and middle age in the second group accounted for 2/5(3) Elderly: in all elderly samples, the elderly in the first group accounted for 4/5, the elderly in the second group accounted for 1/5 Information gain of age group =0.971-0.888=0.083. The information gain of other features (real estate, work and credit) can be calculated in turn, and the feature with the maximum information gain can be selected to construct the decision tree.

  • Stop conditions of decision tree:

1. All the current nodes belong to the same category.

2. The sample set of the current node is empty (the current node is a leaf node, and the category is the category with the most samples of the parent node);

3. The attribute set of the current node is empty, or all the sample attributes have the same value (the current node is a leaf node, and the category is the category with the most samples of this node).

3. Methods of constructing decision tree — ID3, C4.5 and CART

(1) ID3 algorithm

  • The core of the algorithm is to select features at the nodes of the decision tree according to the maximum information gain criteria. Starting from the root node, the information gain of all possible features is calculated for the node, the feature with the maximum information gain is selected as the node feature and the child node is established, and then the decision tree is constructed by recursively calling the above method for the child node.
  • Recursive stop conditions: 1. All class tags are identical, and the class tag is returned directly. 2. After all the features are used, the data still cannot be divided into groups containing only unique categories, that is, the decision tree construction fails (features are not enough, data dimension is not enough). In this case, the category with the largest number is selected as the return value.
  • Decision tree pruning: To prevent overfitting problems, the decision tree can be pruned – pre pruning and post pruning. Prepruning is to establish some rules to limit the growth of decision tree, reduce the risk of over-fitting and shorten the time of tree establishment, but may bring about the problem of under-fitting. After pruning is a global optimization method, after building a tree to delete some subtrees (if the error after pruning and no pruning before the same can be reduced subtrees), this pruning effect is generally better.
  • Summary of advantages and disadvantages
The advantages and disadvantages conclusion
disadvantages 1. No pruning strategy, easy to overfit; 2. Attributes that tend to have more categories; 3, can only handle discrete distribution features, can only be classified; 4. Missing values are not considered.
#### (2) C4.5 algorithm
  • Algorithm introduction: The core is to select the feature through the maximum information gain rate, which is different from ID3 algorithm, which tends to select the attribute with multiple attribute values as the split attribute. It can process discrete and continuous attribute features and training data with missing attribute values.

  • Discrete processing of continuous data features: When the attribute type is continuous, data needs to be discretized. Multiple attribute values of attribute A are arranged in ascending order, and all attribute values of attribute A are divided into two parts by binary classification method (there are n-1 classification methods in total, threshold value = intermediate value of adjacent attribute values), and the threshold value with the maximum information gain is selected for partitioning.
  • Summary of advantages and disadvantages
The advantages and disadvantages conclusion
advantages 1. Overcome the shortcoming of ID — 3. Tend to select attributes with multiple attribute values; 2. Able to deal with discrete and continuous attribute variables; 3, can be pruned to prevent overfitting; 4. Able to process data with missing attribute values.
disadvantages 1. Low computational efficiency, especially for samples with continuous attribute values; 2,
#### (3) CART algorithm
  • Introduction: The typical binary decision tree can be classified (discrete data, Gini coefficient minimization principle) or regression (continuous data, square error minimization principle), continuous variables do not need to be discretized, the condition is established to the left, vice versa. During classification, the predicted samples fall on a certain leaf node, and the category with the largest number of sample categories in the leaf node is output. During regression, the predicted sample falls on a leaf node, and the mean value of all samples in the leaf node is output.

  • The evaluation principle of optimal binary scheme is as follows:

(I) Classification tree: calculate any attribute value of attribute A and select the minimum value of gini coefficient after dividing the data set into two parts as the optimal scheme of A; then calculate the optimal binary scheme of all attributes of sample set S and select the minimum value as the optimal scheme of the whole sample set. (2) Regression tree: similar to classification tree, the optimal binary scheme of attribute A is selected according to the minimum square error, and the optimal binary scheme of sample set S is further selected.

The advantages and disadvantages of decision tree

The advantages and disadvantages conclusion
advantages 1. White box model, easy to understand and explain; 2. Less data required for model establishment; 3. It can be used for classification and regression.
disadvantages 1. Easy to over-fit, requiring multi-parameter adjustment; 2, sensitive to data, can be optimized through the integration algorithm; 3. The optimization process is local optimization and may not achieve global optimization.
## 4. Python code implementation: Iris classification and Boston housing price forecast
  • Case 1: Irises classification
From sklearn. Datasets import load_iris from sklearn import tree import graphviz import OS iris=load_iris() X =iris.data y=iris.target The maximum depth of the tree = 3 CLF = tree. DecisionTreeClassifier (max_depth = 3, random_state = 10) CLF = CLF. Fit (x, y) # mapping dot_data=tree.export_graphviz(clf,out_file=None,feature_names=iris.feature_names,class_names=iris.target_names,filled=Tr Ue, rounded = True, special_characters = True) OS. Chdir (' data \ \ graphviz - 2.38 \ \ release \ \ bin ') graph = graphviz. Source (dot_data) graph.view()Copy the code

# Data features and class tags iris.feature_names iris.target_namesCopy the code

Set Chinese display: map code modification

with open('tree.dot','w',encoding='utf-8') as f: F =tree.export_graphviz(CLF,out_file=f,feature_names=[' sepal length ',' sepal width ',' petal length ',' petal width '],class_names=iris.target_names,filled=T rue,rounded=True,special_characters=True)Copy the code

Then go to the bin folder and replace the font of tree.dot file: fontName =”Microsoft YaHei” and use the Anaconda Prompt to enter the bin folder and type the command: Dot-tpng tree.dot-o iris.png, save the result as an image (image name iris.png).

  • Case 2: Boston housing price forecast
from sklearn.datasets import load_boston from sklearn import tree import graphviz import numpy as np Boston =load_boston() x=boston.data y=boston.target # Data set partition from sklearn.model_selection import train_test_split X_train, x_test y_train, y_test = train_test_split (x, y, test_size = 0.2, random_state = 42) dtr=tree.DecisionTreeRegressor(random_state=1) dtr.fit(x_train,y_train) dtr.score(x_test,y_test)Copy the code

  • Parameter tuning: Grid search cross validation
from sklearn.model_selection import GridSearchCV X_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=42) # ,0.01 gini_impure = np. Linspace (0, 10) Param_grid = {' min_impurity_decrease: gini_impure, 'max_depth: range (2, 10),' min_samples_split: range (2,50,2)} Reg = GridSearchCV (tree. DecisionTreeRegressor (), param_grid = param_grid, CV = 10) # fitting data modeling reg. Fit (x_train y_train) # optimal parameters and model  print(reg.best_params_) print(reg.score(x_test,y_test))Copy the code

  • Save and read the model: two methods
Externals import joblib joblib.dump(reg, 'regtree.pkl') # import pickle with open('regtree.pk2','wb') as f: pickle.dump(clf,f) with open('regtree.pk2','rb') as r: new_clf=pickle.load(r)Copy the code