Arrange according to the course of vegetables, easy to remember and understand

Decision tree in SkLearn

Module sklearn. Tree

The decision tree classes in Sklearn are all under the “tree” module. This module contains five classes in total:

tree.DecisionTreeClassifier Classification tree
tree.DecisionTreeRegressor Regression tree
tree.export_graphviz Export the generated decision tree to DOT format for drawing
tree.ExtraTreeClassifier Highly random version of the classification tree
tree.ExtraTreeRegressor Highly random version of the regression tree

The basic modeling process for SkLearn

from sklearn import tree            Import the required modules
clf = tree.DecisionTreeClassifier() # instantiation
clf = clf.fit(X_train,y_train)      Train the model with training set data
result = clf.score(X_test,y_test)   Import the test set and call the required information from the interface
Copy the code

DecisionTreeClassifier with the wine data set

Class sklearn. Tree. DecisionTreeClassifier (criterion = gini, splitter = ‘best’, max_depth = None, min_samples_split = 2, Min_samples_leaf = 1, min_weight_fraction_leaf = 0.0, max_features = None, random_state = None, max_leaf_nodes = None, Min_impurity_decrease = 0.0, min_impurity_split = None, class_weight = None, presort = False)

The important parameters

criterion
  • In order to turn a table into a tree, the decision tree needs to find the best node and the best branching method. For classification trees, this “best” indicator is called “impurity”. Generally speaking, the lower the impurity, the better the decision tree fits the training set. The core of the decision tree algorithm used in branching method is mostly around the optimization of a certain impurity related index.
  • Impurity is calculated based on nodes. Each node in the tree has an impurity, and the impurity of the child node must be lower than that of the parent node. In other words, in the same decision tree, the impurity of the leaf node must be the lowest.

Criterion is the parameter used to determine how impurity is calculated. Sklearn offers two options:

  • Input “entropy”, use entropy

  • Type in “gini” and use Gini coefficients.

T on behalf of a given node, I on behalf of any classification of the label, p (I) ∣ t p (I) | t p (I) ∣ t on behalf of the tag I t on the proportion of the nodes. Note that when using Information entropy, sklearn actually calculates Information Gain based on Information entropy, that is, the difference between the Information entropy of the parent node and the Information entropy of the child node.

Compared with gini coefficient, information entropy is more sensitive to impurity and has the strongest punishment for impurity. But in practice, the effect of information entropy and Gini coefficient is basically the same. Information entropy is slower to calculate the Bikini coefficient because the Gini coefficient does not involve logarithms. In addition, since information entropy is more sensitive to impurity, the growth of decision tree will be more “delicate” when it is used as an indicator. Therefore, for high-dimensional data or data with a lot of noise, information entropy is easy to overfit. In this case, gini coefficient is usually better. When the model fitting degree is insufficient, that is, when the model does not perform well in both the training set and the test set, information entropy is used. Of course, these are not absolute.

parameter criterion
How does it affect the model? Determining the calculation method of impurity helps to find out the best node and branch. The lower the impurity is, the better the decision tree fits the training set
What are the possible inputs? Without default gini coefficient, gini coefficient is used and entropy information gain is used
How to select parameters? Usually you use the Gini coefficient if you have a lot of dimension, if you have a lot of noise if you have a lot of dimension, if you have a lot of clarity, the information entropy is the same as the Gini coefficient and if you don’t fit the decision tree well enough, you use the information entropy and try both. If you don’t fit the decision tree well, try the other one

So far, the basic process of decision tree can be summarized as follows:

Until no more features are available, or the overall purity index is optimal, the decision tree stops growing.

Build a tree

The code for this section can be viewed at this location:

Build a tree

The code for this section can be viewed at this location:

Import the required libraries and modules
# guide package
from sklearn import tree
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
Copy the code
To explore the data
# use the default sklearn dataset - the wine dataset
wine = load_wine()

Display the various attributes of the dataset (whole: a dictionary)
wine

Tag data :target Tag name: target_NAMES Description: DESCR Feature name: Feature_names
wine.keys()

Shape of data is generally two-dimensional data with 13 features and 178 pieces of data
wine.data.shape

# data display: how dictionaries are called
wine.data

# label
wine.target

# Beautify the above DataFrame data
import pandas as pd
df = pd.concat([pd.DataFrame(wine.data),pd.DataFrame(wine.target)],axis=1)

You can use either a peer-to-peer list or a single-column rename
df.columns =["label"] + wine.feature_names

# Data display
df

# List of feature names
wine.feature_names

# List of categories
wine.target_names
Copy the code
It is divided into training set and test set
Pass in the form of direct call data
# divide data into training set and test set. The distribution ratio of 7/3 is specified by test_size
Xtrain, Xtest, Ytrain, Ytest = train_test_split(wine.data,wine.target,test_size=0.3)

Xtrain.shape
Xtest.shape
Copy the code
Build a model
The actual invocation takes only three steps
Step 1: Instantiate the decision tree model
# Step 2: Add training data and use FIT for training
# Step 3: Use Score to score (this mainly means that we have real tags and can accurately calculate the predicted tags and real tags)
clf = tree.DecisionTreeClassifier(criterion="entropy")
clf = clf.fit(Xtrain, Ytrain)
score = clf.score(Xtest, Ytest) # return accuracy of prediction
score
# 92.5925
Copy the code
Draw a tree
feature_name = ["Alcohol".'Malic acid'.'grey'.'Alkalinity of ash'.'magnesium'.'total phenol'.'flavonoids'.'Non-flavanoid phenols'.'anthocyanin'.'Color intensity'.'tone'.'OD280 / OD315 Diluted wine'.'proline']

import graphviz
dot_data = tree.export_graphviz(clf
                             ,out_file = None
                             ,feature_names= feature_name
                             ,class_names=["Gin"."Sherry"."Belmode"]
                             ,filled=True
                             ,rounded=True
)
graph = graphviz.Source(dot_data)
graph
Copy the code

Exploration decision tree
# Feature importance, the importance of the attributes used to make the decision tree branch
clf.feature_importances_

# Bind attributes with importance to show the importance of each attribute for reducing information confusion in the decision tree
[*zip(feature_name,clf.feature_importances_)]
Copy the code

We have built a complete decision tree with only one parameter. However, when we go back to Step 4 to build the model, score will fluctuate near a certain value, causing each tree drawn in Step 5 to be different. Why would it be unstable? Would it still be unstable if other data sets were used?

We mentioned earlier, no matter what the decision tree model evolution, on the branches are the essence of the pursuit of a certain impurity of the optimization of related indicators, and as we mentioned, the purity is calculated based on the node, that is to say, when done, a decision tree is by optimizing the node to pursue an optimized tree, but the optimal node can guarantee the optimal tree? Integration algorithms are used to solve this problem: Sklearn says that since one tree is not optimal, build more different trees and take the best of them. How to build different trees from a set of data? In each branch, not all features are used, but some features are randomly selected, and the node with the best impurity related index is selected as the branch. This way, the tree will be different each time.

conclusion

The reason for randomness:

  • Level 1 randomness, randomness of random;
  • The second layer of randomness: plant many trees and take the best one;
  • The third level of randomness: randomness of features. Different subsets of general features are selected for each branch to ensure the difference of the tree.
clf = tree.DecisionTreeClassifier(criterion="entropy",random_state=30)
clf = clf.fit(Xtrain, Ytrain)
score = clf.score(Xtest, Ytest) # return accuracy of prediction

score
Copy the code