The decision tree

Introduction to decision Trees

Decision tree is a supervised learning method using if-then-else decision rules.

The three elements are branch node point, leaf node and branch condition. In order to reduce over-fitting, there is also pruning method. In order to facilitate memory, it can be called one method and three elements

Advantages of decision trees

  • Easy to understand and explain. The structure of the tree can be visualized.
  • Training requires less data. Other machine learning models typically require data normalization, such as building dummy variables and removing missing values, although note that this model does not support missing values.
  • Because of the number of data points in the training decision tree, the use cost of the decision tree is exponentially distributed (the time complexity of the training tree model is the log value of the data points participating in the training).
  • Ability to process numerical data and classified data. Other techniques are usually limited to analyzing data sets of a particular type of variable. See algorithms for details.
  • Able to handle multiple output problems.
  • Use a white box model. If a given situation is observable in the model, it can be easily explained by Boolean logic. In contrast, the results in the black box model are hard to say.
  • This model can be verified by numerical statistical tests. This makes it possible to verify the reliability of the model.
  • The model performs well even if its assumptions contradict the data provided by the real model.

Disadvantages of decision trees

  • Decision tree model tends to produce an overly complex model, which has poor generalization performance for data. This is called overfitting. Strategies such as pruning, setting the minimum number of samples required for leaf nodes, or setting the maximum depth of the number are the most effective ways to avoid this problem.
  • Decision trees can be unstable because small changes in the data can result in completely different tree generation. This problem can be mitigated through the integration of decision trees
  • Learning an optimal decision tree is usually a NP-hard problem under the requirements of multiple performance optimization and simplification concepts. Therefore, actual decision tree learning algorithms are based on heuristic algorithms, such as greedy algorithms that make locally optimal decisions at each node. Such an algorithm is not guaranteed to return a global optimal decision tree. This problem can be mitigated by ensemble learning to train multiple decision trees, which are typically generated by random sampling of features and samples that have been put back.
  • Some concepts are difficult to learn by decision trees because they are difficult to express clearly. Examples are XOR, parity, or multiplexer issues.
  • If certain classes dominate the problem, the decision tree created will be biased. Therefore, we suggest balancing the data set before fitting.

classification

%matplotlib inline
from sklearn import tree
X = [[0.0], [1.1]]
Y = [0.1]
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)
Copy the code

predict the class of samples

print(clf.predict([[2.2]]))
Copy the code
[1]
Copy the code

show the probability and log_probability of each class be predicted

print(clf.predict_proba([[2.2]]))
print(clf.predict_log_proba([[2.2]]))
Copy the code
[[1. 0. 0.]] [[0. - inf - inf]] / home/fonttian anaconda3 / lib/python3.6 / site - packages/sklearn/tree. The tree/py: 864: RuntimeWarning: divide by zero encountered in log return np.log(proba)Copy the code

At the very beginning, I put forward the method of three elements, in which the three elements not only correspond to the constitution of decision tree, but also correspond to a method, namely pruning

Because of the particularity of the algorithm, decision tree can fully fit the data set unless there are two or more different points with completely consistent features and inconsistent classification labels. It is this feature of decision tree that makes complete decision tree produce serious overfitting.

Therefore, we need to reduce the excessive branches to improve the generalization ability of the decision tree. Although there are many methods, they all revolve around the branches and leaves

-max_depth: the maximum depth of the tree, that is, when the depth of the tree reaches max_depth, the decision tree will stop no matter how many features can be split. -min_samples_split: The minimum number of nodes required to split. When the number of leaf nodes is smaller than this parameter, no branch is generated. The label classification of this branch is based on the category with the most labels – min_samples_leaf; The minimum number of samples required by a branch. If the number of characteristic samples of a new leaf node after the branch is smaller than this hyperparameter, the branch will be returned and no pruning will be carried out. -min_weight_fraction_leaf: specifies the minimum weight coefficient. -max_leaf_nodes: specifies the maximum number of leaf nodes. If the value is “None”, the value is unlimited

Branch way

For decision tree, there are three common decision tree branches. The first two are based on information entropy,ID3(information gain),C4.5(information gain ratio) and CART decision tree based on Gini coefficient

However, since skLearn has only implemented ID3 and CART decision trees so far, we can only use these two decision trees for the time being. The branching method is determined by super-parameter Criterion: -gini: default parameter, based on Gini coefficient — entropy: based on information entropy, which is our ID3

* The classification decision tree is used on the iris data set *

* Load data *

from sklearn import datasets,model_selection
def load_data() :
    iris=datasets.load_iris() # scikit-learn built-in IRIS dataset
    X_train=iris.data
    y_train=iris.target
    return model_selection.train_test_split(X_train, y_train,test_size=0.25,random_state=0,stratify=y_train)
Copy the code

* Test using Gini (CART decision tree) and information entropy (ID3) shows that Gini performs better in the Iris dataset *

* At the same time, it can be found that for decision trees, it is not difficult to completely fit data, only LG (N_samples) is needed at most. For 150 samples of IRIS, only max_depth is sufficient, but excessive increase of max_depth generally represents a reduction in generalization ability *

X_train,X_test,y_train,y_test=load_data()
criterions=['gini'.'entropy']
for criterion in criterions:
    clf = DecisionTreeClassifier(criterion=criterion)
    clf.fit(X_train, y_train)
    print(criterion,"Training score:%f"%(clf.score(X_train,y_train)))
    print(criterion,"Testing score:%f"%(clf.score(X_test,y_test)))
Copy the code
Gini Training score:1.000000 GINi Testing score:0.947368 Entropy Training score:1.000000 entropy Testing score:0.947368Copy the code

* Below are the train_ERROR and test_error curves drawn, and you can easily control the generation of the decision tree by modifying several other hyperparameters that control branches and leaves *

maxdepth = 40

X_train,X_test,y_train,y_test=load_data()
depths=np.arange(1,maxdepth)
training_scores=[]
testing_scores=[]
for depth in depths:
    clf = DecisionTreeClassifier(max_depth=depth)
    clf.fit(X_train, y_train)
    training_scores.append(clf.score(X_train,y_train))
    testing_scores.append(clf.score(X_test,y_test))

# # drawing
fig=plt.figure()
ax=fig.add_subplot(1.1.1)
ax.plot(depths,training_scores,label="traing score",marker='o')
ax.plot(depths,testing_scores,label="testing score",marker=The '*')
ax.set_xlabel("maxdepth")
ax.set_ylabel("score")
ax.set_title("Decision Tree Classification")
ax.legend(framealpha=0.5,loc='best')
plt.show()
Copy the code

The resources

  • Sklearn official documentation: decision tree