Arrange according to the course of vegetables, easy to remember and understand

The code location is as follows:

DecisionTreeClassifier with the wine data set

Class sklearn. Tree. DecisionTreeClassifier (criterion = gini, splitter = ‘best’, max_depth = None, min_samples_split = 2, Min_samples_leaf = 1, min_weight_fraction_leaf = 0.0, max_features = None, random_state = None, max_leaf_nodes = None, Min_impurity_decrease = 0.0, min_impurity_split = None, class_weight = None, presort = False)

The important parameters

random_state & splitter
  • Random_state: The parameter used to set the random pattern in the branch. Default is None. Randomness is more apparent at higher dimensions, and hardly noticeable at lower dimensions (such as the iris dataset). If you put in any integer, it will always grow the same tree, and it will stabilize the model.

  • Splitter: Also used to control random options in a decision tree. This is also a way to prevent overfitting. When you predict that your model will overfit, use these two parameters to help you reduce the likelihood of overfitting once the tree is built. Of course, once the tree is built, we still use pruning parameters to prevent overfitting.

    • bestDecision trees branch randomly, but stillPrioritizing is more importantFeatures of the feature_importances_ (Importance can be viewed with the attribute Feature_importances_)
    • randomWhen the decision tree branchesMore random, the tree will be deeper and larger because it contains more unnecessary information, and the fit of the training set will be reduced because of this unnecessary information.
# Splitter prevents over-fitting, makes the model simpler and lowers the score
clf = tree.DecisionTreeClassifier(criterion="gini",random_state=30,splitter="best")
Xtrain,Xtest,Ytrain,Ytest = train_test_split(wine.data,wine.target,test_size=0.3)
clf.fit(Xtrain,Ytrain)
score = clf.score(Xtest,Ytest)
score

clf = tree.DecisionTreeClassifier(criterion="gini",random_state=30,splitter="random")
Xtrain,Xtest,Ytrain,Ytest = train_test_split(wine.data,wine.target,test_size=0.3)
clf.fit(Xtrain,Ytrain)
score = clf.score(Xtest,Ytest)
score
Copy the code
Pruning parameters

Left unchecked, a decision tree grows to the point where a measure of impurity is optimal, or no more features are available. Such a decision tree tends to be overfitted, that is, it will perform well on the training set and poorly on the test set. It is impossible for the sample data collected by us to be completely consistent with the overall situation. Therefore, when a decision tree has too excellent interpretation of training data, the rules it finds must contain the noise in the training sample, and make its fitting degree of unknown data insufficient.

In order to make the decision tree have better generalization, we need to prune the decision tree. Pruning strategy has great influence on decision tree, and correct pruning strategy is the core of optimal decision tree algorithm. Sklearn provides us with different pruning strategies:

max_depth

Limit the maximum depth of the tree and cut off any branches that exceed the set depth

This is the most widely used pruning parameter and is very effective at high dimensions and low sample sizes. As the decision tree grows one more layer, the demand for sample size will double, so limiting tree depth can effectively limit overfitting. It is also very useful in integration algorithms. In practice, it is recommended to start from =3 and see the effect of fitting before deciding whether to increase the setting depth.

min_samples_leaf & min_samples_split
  • Min_samples_leaf: Restricted, each child of a node after branching must contain at least min_SAMples_leaf training samples, otherwise branching will not occur, or branching will occur in such a way that each child node contains min_samples_leaf training samples

    • Usually used with max_depth, it has a magic effect in regression trees and can make the model smoother. Setting the number of this parameter too low causes overfitting, and setting it too high prevents the model from learning the data. In general, it is recommended to start with =5. If the sample size contained in the leaf node varies greatly, it is recommended to enter a floating point number as a percentage of the sample size. At the same time, this parameter can ensure the minimum size of each leaf, which can avoid the occurrence of low variance and over-fitting leaf nodes in regression problems. For classification problems with few categories, =1 is usually the best choice.
  • Min_samples_split: Restriction, a node must contain at least min_samples_split training samples before this node can be branched, otherwise branching will not occur.

# max_depth is used with min_samples_split
# We can clearly see that the bottom 44 sake have no branches and are not expanded
clf = tree.DecisionTreeClassifier(criterion="gini",random_state=30,max_depth=3,min_samples_split=15)
Xtrain,Xtest,Ytrain,Ytest = train_test_split(wine.data,wine.target,test_size=0.3)
clf.fit(Xtrain,Ytrain)
dot_data = tree.export_graphviz(
                                 clf
                                 ,feature_names=wine.feature_names
                                 ,class_names=["Gin"."Wine"."Maotai"]
                                 ,filled=True
                                 ,rounded= True
)
graph = graphviz.Source(dot_data)
graph
Copy the code

max_features & min_impurity_decrease

Max_depth is used for tree “finishing”.

  • Max_features: Specifies the number of features considered when limiting branching. Any features exceeding the limit are discarded. Similar to max_depth, max_features is a pruning parameter used to limit the over-fitting of high-dimensional data, but its method is more violent, which directly limits the number of features that can be used and forces the decision tree to stop. Without knowing the importance of each feature in the decision tree, Imposing this parameter may result in insufficient model learning. If you want to avoid overfitting by dimensionality reduction, PCA, ICA or dimensionality reduction algorithms in feature selection module are recommended.
  • Min_impurity_decrease: Limits the information gain. Branches with less information gain will not occur. This feature was updated in version 0.19 and used min_impurity_split prior to version 0.19.
Target weight parameter
class_weight & min_weight_fraction_leaf
  • Class_weight: parameter used to balance sample labels. Sample imbalance refers to the fact that in a data set, one category of labels is naturally a large proportion. For example, when asked at a bank whether a person with a credit card will default, the ratio is yes vs. no (1% vs. 99%). Even when the model did nothing but predict “no”, it got it right 99 per cent of the time. Therefore, we need to use the class_weight parameter to balance the sample labels to some extent, give more weight to a small number of labels, and make the model more inclined to a small number of classes, so as to capture a small number of classes in the direction of modeling. This defaults to None, which means that all labels in the dataset are automatically given the same weight.
  • Min_weight_fraction_leaf: with weightings, the sample size is not simply the number of records, but is affected by the input weight. Therefore, in this case, the pruning parameter min_weight_fraction_leaf should be used.

Learning curve (Focus on)

So how do you determine what value to fill in for each parameter? At this point, we need to use the curve to determine the hyperparameter for judgment, and continue to use the decision tree model CLF that we have trained. The learning curve of hyperparameter is a curve with the value of hyperparameter as the abscissa and the measurement index of the model as the ordinate. It is used to measure the performance of the model under different values of hyperparameter. In the decision tree we built, our model metric is score.

Find the optimal hyperparameter max_depth
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_wine
import matplotlib.pyplot as plt
wine = load_wine()

Xtrain,Xtest,Ytrain,Ytest = train_test_split(wine.data,wine.target,test_size=0.3)

test = []
for i in range(10):
    clf = tree.DecisionTreeClassifier(criterion="gini",random_state=30,max_depth=i+1,min_samples_leaf=10)
    clf.fit(Xtrain,Ytrain)
    score = clf.score(Xtest,Ytest)
    test.append(score)

plt.plot(range(1.11),test,color="red",label="max_depth")
plt.show()
Copy the code

Important properties and interfaces

Properties are properties that can be invoked to view the model after the model has been trained. Most important for the decision tree is Feature_importances_, which allows you to see how important each feature is to the model.

Many algorithms in SkLearn have similar interfaces, such as fit and Score, which are available for almost every algorithm. In addition to these two interfaces, the most commonly used interfaces for decision trees are Apply and Predict.

  • The input test set in Apply returns the index of the leaf node where each test sample resides
  • Predict The input test set returns labels for each test sample.

It must be mentioned here that the input of X_train and X_test in all interfaces must be at least a two-dimensional matrix. Sklearn does not accept any one-dimensional matrix to be entered as an eigenmatrix. 0 If your data really only has one feature, it must 0 0 (-1,1) add dimension to the matrix 0 If your data has only one feature and one sample, 0 Use 0 (1,-1) to add dimension to your data

# returns the index of the predicted node
clf.apply(Xtest)
# array ([3, 6, 3, 10, 3, 3, 7, 10, 3, 10, 3, 10, 3, 3, 10, 10, 3, 10, 3, 3, 6, 3, 3, 10, 10, 6, 3, 6, 3, 3, 6, 9, 9, 10,10, 7, 4, 9, 9, 6, 10, 3, 7, 10, 4, 4, 10, 9, 3], dtype=int64)

# returns the predicted value of the category
clf.predict(Xtest)
# array ([1, 2, 1, 0, 1, 1, 2, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 2, 1, 1, 0, 0, 2, 1, 2, 1, 1, 2, 0, 0, 0, 0, 2, 1, 0, 0, 2, 0, 1, 2, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0)
Copy the code

conclusion

  • Eight parameters: Criterion, two random-dependent parameters (random_state, splitter), five pruning parameters (max_depth,min_samples_split, min_samples_leaf, max_feature, Min_impurity_decrease)
  • One property: Feature_importances_
  • Four interfaces: FIT, Score, apply and predict