Decision tree

Decision tree learning is a common algorithm in machine learning. In a decision tree, the root node contains the complete set of samples. Each non-leaf node represents a partition of the sample, usually corresponding to a partition attribute, which divides the sample into different child nodes. Each leaf node corresponds to the result of the decision. So the path from the root node to each leaf node corresponds to a sequence of decisions.

It can be compared with the following figure:

In the decision tree above, the classification attributes include age, student status and credit rate.

Attribute classification generally adopts a feature of the sample or a combination of multiple features.

Decision tree learning is a kind of supervised learning, which classifies the attributes of samples from top to bottom by recursive method based on samples. The basic idea is to construct a tree with the fastest entropy decrease until the entropy at the leaf node is 0.

Second, information entropy

Definition 1.

Information entropy is one of the most commonly used indicators to measure the purity of sample sets. It is assumed that there are class k samples in the current sample set D (the label of the sample has class k), and the proportion of class I samples in the sample set is PI (I = 1,2… . K), then the information entropy of sample set D can be defined as

(Entropy: Entropy)

2. The relationship between information entropy and purity

According to the formula of information entropy, we can see that when there is only one kind of sample in the sample set, k=1, P1 =1, then the value of information entropy Ent(D) is 0. When there are 2 types of samples in the sample set, if the probability of each type of samples is 1/2, then the value of information entropy Ent(D) is 1. … When the value of information entropy Ent(D) is smaller, the purity of sample set D is higher.

If the decision tree we need is measured by information entropy, we can construct a tree whose entropy value of the sample decreases fastest and whose node purity is getting higher and higher. So how do we split the sample set? How to select our optimal partition attributes (features) as non-leaf nodes when we split the sample set?

Therefore, we need to measure each feature of the sample and select the current best partition attribute as the dividing node of the tree to make our classification the purest.

The usual measurement methods are as follows:

  • Information gain (ID3)

  • Information gain rate (C4.5)

  • Gini coefficient (CART)

Information gain

1. Information entropy of a single branch

If you have a discrete feature in the sample, and there are k possible values. If we use featuresIf the sample set D is divided as the separating nodes of the decision tree, k branch nodes will be generated.

Of the k branch nodes, the ith branch node contains all the attributes in sample set DThe values forOf the samples. Let’s call the sample set of the ith branch node

We can calculate it from the formula for entropyThe information entropy of is.

2. Weighted information entropy of all branches

Similarly, we can get the entropy of the other branches. Therefore, the characteristics can be calculatedThe weighted information entropy of the sample set of all branches after partition is

Including | D | says the number of samples, sample set D | Di | says the ith the number of sample collection of sample in the branch.

From above, the information entropy of sample set D before partition can be obtainedAnd according to sample characteristicsThe weighted information entropy of all branch sample sets after partition

3. Information gain

Therefore, we take the difference of information entropy before and after division as what we call “information gain”, so its formula is.

Among themIs a feature that divides the sample set.

Generally speaking, the larger the information gain is, the faster the information entropy decreases and the higher the purity of the branching sample set is.

Since our decision tree requires the fastest entropy decline, the optimal partition attribute we choose is, where AI is the ith feature used to divide sample set D.

However, we will find that when there are many branches of a feature (the value of the feature is finely divided, such as date), the purity of the sample set of branch nodes will be very high, and the weight of the information entropy of all branch sample sets will be sumIt’s going to be very small, so the information gain is going to be very large. Therefore, the information gain has a preference for features with more branches, and sometimes there will be over-fitting phenomenon.

Iv. Information gain rate (ratio)

Definition 1.

To reduce the possible adverse effects of information gain preference, the “information gain ratio/ratio” can be used to select the optimal partitioning feature.

Using the same notation as above, the information gain rate is defined as.

The molecule is characteristicAs the information gain when separating nodes. The denominatorKnown as the characteristics of theThe intrinsic value of.

2. Eigenvalues of features

Eigenvalues of features are defined as:. Where K is the featureThe sample set D can be divided into K branch nodes. Including | D | says the number of samples, sample set D | Di | says the ith the number of sample collection of sample in the branch. (Personally understood as purity of character)

When the characteristicWhen there are more possible values, the eigenvalue of the feature is usually large, which leads to the decrease of the information gain rate. To some extent, the problem caused by the feature with more branches of information gain preference is solved. However, it should be noted that the information gain rate may have a preference for attributes with fewer values in some programs.

Five, gini value

Definition 1.

Besides information entropy, the purity of sample set D can also be measured by Gini value.

The purity of sample set D can be measured by gini value:.

If there are class k labels in the current sample set D, the proportion of class I samples in the sample set is Pi (I = 1,2… . , k)

And as you can see from the formula,Reflects the probability that two samples randomly selected from sample set D have inconsistent category markers (one is PI and the other is (1-pi)).

Therefore, the Gini valuesmall, the purity of sample set D is higherhigh.

2. Weighted Gini values of all branches

Similarly, for some discrete feature in the sample, and there are k possible values. If we use featuresIf the sample set D is divided, k branch sub-nodes will be generated.

So we can get that from the sample characteristicsThe weighted Gini value of the branch sample set is. Where K is the featureThe sample set D can be divided into K branch nodes.

Including | D | says the number of samples, sample set D | Di | says the ith the number of sample collection of sample in the branch.

In order to achieve the purest classification of the generated decision tree, the optimal classification attribute to be selected according to the Gini value method is, where AI is the ith feature used to divide sample set D.

Prune the decision tree

1. The whole tree problem

Decision tree has good classification ability for training samples. However, a decision tree (complete tree) with complete classification of samples may not have good prediction ability for unknown test data, poor generalization ability and prone to over-fitting phenomenon.

Complete tree: The samples in the sample set of each leaf node of the decision tree T are of the same type.

Too fitting can lead to build a complex decision tree, in order to solve this problem, the pruning method to simplify the decision tree has been generated, the node branch of a divided property lose, and the node as a result some decision of leaf node (the corresponding decision results can be used in the majority of sample tag). So how do you tell if a partition node needs pruning? (Pruning is to reduce the number of leaf nodes)

2. Pruning algorithm

Therefore, we need to design a classification error evaluation function, and then achieve pruning by minimizing the loss function of the whole decision tree.

Let the decision tree T have n leaf nodes, and let the sample set Dt of the T leaf node have Nt samples, which may not have the same label categories. Let the samples in the sample set Dt have K label categories, and the samples of the i-th category have Nti.

So, the empirical entropy of the sample set of the t th leaf nodeDefined as:

It can be seen that when the sample set of the t leaf node has only one label type, the empirical entropy is 0.

The following is the definition of the loss function of our decision tree T:. Where n as the number of leaf nodes, you can also use | T |, alpha is experience value parameter (alpha 0 or higher).

C(T) represents the prediction error of the decision tree on the training data. The parameter α controls the influence between the two. The larger α is, the simpler the model is required.

The process of pruning is to choose the model with the smallest loss function when alpha is determined. Classification of the thinner, the experience of the leaf node entropy is smaller, the smaller the C (T), but because of the fine classification can lead to the number of leaf nodes will be many, alpha | T | (or alpha n). The loss function reflects the balance between the two.

The generation process of decision tree only considers the purity of sample set classification and the better fitting of training data, while the pruning process is to reduce the overall complexity.

Pruning algorithm of decision tree: a certain empirical value α is fixed to perform pruning prediction on the nodes with divided attributes. If the loss function of decision tree decreases after pruning, the leaf nodes of this node are subtracted and the node is regarded as a new leaf node. The corresponding decision result of this node can be marked by the majority of samples in its sample set. (Or construct different α worthy of some alternative trees, through the method of cross verification to get the optimal tree)

Random forest

1. Bagging

Bootstrap: A sampling method with a fallback.

Bagging (bootstrap aggregation) strategy: n samples are selected from the sample set with a return; In all the features of the samples, a classifier is established for the N samples. Repeat the above two steps m times to obtain M sample classifiers. Finally, all the test data are put on the M sample classifiers, and finally m classification results are obtained, from which the data belongs to which category (majority voting system).

2. Random forest

Random Forest adopts the Bagging strategy and makes some modifications on it, adopting two randomness:

  1. N samples were selected from the training sample set using Bootstrap sampling (randomly put back).

  2. Suppose there are b features in the sample, and only K features are randomly selected from the B features to segment the sample, and the optimal partition features are selected as nodes to divide the sample set by calculation to establish a decision tree. (Unlike Bagging: not all features are used so that over-fitting features are avoided and no pruning of the decision tree is done)

  3. Repeat the above two steps m times to establish M decision trees

  4. These M decision trees form the forest, and the output of the forest can be determined by simple majority voting (or other voting mechanism) to determine which type it belongs to. (In order to solve the regression problem, the average value of the sum of the results can be output by a single tree)

Random forest can improve generalization ability and generate single tree in parallel.

Eight, code examples

Prediction of handwritten numbers (digits data in SkLearn) using decision tree and random forest:

from sklearn import datasetsCopy the code
digits = datasets.load_digits(); X = digits.data // attribute matrix y = digits.target // label matrixCopy the code
from sklearn.model_selection import train_test_splitCopy the code
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/3., random_state=8) // Split training set and test setCopy the code
from sklearn import tree
from sklearn.ensemble import RandomForestClassifierCopy the code
estimators = {}

# criterion: gini/entropy
estimators['tree'] = tree.DecisionTreeClassifier(criterion='gini',random_state=8) # the decision tree

# n_estimators: The number of trees
# bootstrap: Whether there is a random place back
# n_jobs: Number of concurrent runs
estimators['forest'] = RandomForestClassifier(n_estimators=20,criterion='gini',bootstrap=True,n_jobs=2,random_state=8) # Random ForestCopy the code
from sklearn.model_selection import cross_val_score
import datetime

for k in estimators.keys():
    start_time = datetime.datetime.now()
    print '----%s----' % k
    estimators[k] = estimators[k].fit(X_train, y_train)
    pred = estimators[k].predict(X_test)
    print pred[:10]
    print(Score: "% s % 0.2 f" % (k, estimators[k].score(X_test, y_test)))
    scores = cross_val_score(estimators[k], X_train, y_train,scoring='accuracy' ,cv=10)
    print("%s Cross Avg. Score: %0.2f (+/- %0.2f)" % (k, scores.mean(), scores.std() * 2))
    end_time = datetime.datetime.now()
    time_spend = end_time - start_time
    print(Time: "% s % 0.2 f" % (k, time_spend.total_seconds()))Copy the code
----tree---- [6 2 1 50 9 0 1 8 7] tree Score: 0.84 tree Cross Avg. Score: 0.84 (+/ -0.08) tree Time: 0.22 ----forest---- [6 21 1 50 9 0 11 6] forest Score: 0.97 forest Cross Avg. Score: 0.96 (+/ -0.04) forest Time: 3.87Copy the code

It can be seen that random forest is more powerful than decision tree, but it takes more time to construct multiple trees

Copy the code