What is a decision tree

Decision tree is one of supervised learning algorithms and a basic classification and regression method. Decision tree can also be divided into regression tree and classification tree. This paper discusses classification tree. If you know or have studied data structures, you will be familiar with the concept of “tree”, and it will be easier to learn and master decision trees. Here is a small example to help you understand what decision trees are.

The flowchart shown in the following figure is a decision tree. Rectangle represents the judgment module and oval represents the termination module, indicating that the conclusion can be reached to terminate the operation of the program. The left and right arrows represent branches through which you can reach another judgment module or a termination module.This flow chart is basically a hypothetical mating system, and it’s not popular on the Internet to say,”Auntie, I don’t want to try“, the tree is judged by whether you want to keep trying. If you don’t want to keep trying, you can choose to find one.”women“; On the contrary, you want to find a girlfriend to fight with, and this is based on the girl’s personality, if you like a gentle personality, that is to choose.”Gentle girl“If you like someone with a cool personality, choose.”Cool girl“.

The entire decision tree can be viewed as an if-then rule, i.e. “If conditions are judged, then…” And pay attention to the following three points:

  1. The path from the root node to each child node can form a rule.
  2. The feature of the intermediate node on each path corresponds to the judgment condition of the rule, and the label of the leaf node should be the conclusion of the rule.
  3. Each instance is overridden by one and only one instance, that is, the characteristics of the instance are consistent with the characteristics on the path.

Second, the process of decision tree

  1. Collect data: expose data sources or crawlers, etc.
  2. Prepare data: The tree construction algorithm is only suitable for nominal data, so numerical data must be discretized.
  3. Analyze the data: You can use any method, and after building the tree, check that the graph is as expected.
  4. Training algorithm: construct the data structure of the tree.
  5. Test algorithm: calculate the accuracy of tree model.
  6. Using algorithms: This step can be applied to any supervised learning algorithm, and decision tree visualization can better understand the inherent meaning of the data.

The data for constructing decision tree must be sufficient. The data set with few features may lead to low accuracy of decision tree. If there are too many data features, not selecting features will also affect the accuracy of decision tree. Constructing an ideal decision tree can be roughly divided into the following three steps: feature selection, decision tree generation and decision tree pruning.

3. Feature selection

Feature selection is to decide which feature in the data set is used to divide feature space, mainly to select the features with classification ability for training data. The purpose of this is to improve the efficiency of the decision tree. If the classification result of a feature is not very different from the random classification result, then the feature has no classification ability. Generally speaking, such feature removal has little influence on the accuracy of the decision tree.

The following table shows the data about Marine creatures. The two characteristics of the data are “no surfacing”, “flippers” and “fish”.

no surfacing flippers fish
is is is
is is is
is no no
no is no
no is no

No matter how many features a dataset has, only one feature can be selected each time the dataset is divided. So which feature can be selected as the reference attribute of the first partition to classify the data faster? The answer must be the trait that is best at categorizing, but the question is, how do you determine which trait is best at categorizing? A new concept, information gain, is introduced.

What is information gain? The information changes before and after dividing the data set become the information gain. Knowing how to calculate the information gain, we can calculate the information gain obtained by dividing the data set with each feature value. The feature with the highest information gain is the best choice.

3.1 Entropy (Shannon entropy)

Before calculate the information gain, need to know the concept of “entropy”, “entropy” what thing, don’t have to dig into it, we just need to understand what entropy is defined as the expected value of information, in information theory and probability and statistics, the entropy uncertainty of measurement is said random variables, with a popular words is how is chaos of the system.

Suppose there is a dataset with sample N, and the sample of class I is Xi, then the information of the symbol Xi can be defined:Where p(Xi) is the probability of choosing the classification. By changing the value of I, information for all categories in the dataset can be obtained.

To calculate entropy, we need to calculate the expected value (mathematical expectation) of the information contained in all possible values of all categories, obtained by the following formula:The higher the entropy, the greater the uncertainty of the variable, that is, the higher the degree of mixing of the data.

3.2 calculated entropy

Before creating the data set, make a custom simplification of the data, using 1 for yes and 0 for no. The calculated entropy code is as follows:

Create a data set

def createDataSet(a):

DataSet = [[1.1.'yes'].

[1.1.'yes'].

[1.0.'no'].

[0.1.'no'].

[0.1.'no']]

# Matrix conversion Dataframe

DataSet = pd.DataFrame(DataSet,columns=['no surfacing'.'flippers'.'fish'])

return DataSet

# Calculate information entropy

def calculatent(DataSet):

# Classification of statistical class tags

fish_type = DataSet['fish'].value_counts()

' ' '

no 3

yes 2

' ' '


Get the number of samples

m = DataSet.shape[0]

# The probability of each category, i.e. P (Xi)

p = fish_type/m

# Calculate entropy

ent = (-p*np.log2(p)).sum()

return ent

' ' '

0.9709505944546686

' ' '


Copy the code

Finally, the returned entropy is 0.9709505944546686

In this data, there are two categories of “fish” and “not fish” in the last category, the former accounting for 2/5 of the total number of tags, and the latter accounting for 3/5. Therefore, the core part of this code is to apply the entropy formula:After the entropy is obtained, the data set can be divided according to the maximum information gain method, and the information gain can be calculated in the next section.

3.3 Information gain

So once you calculate the entropy, how do you calculate the information gain? The calculation of information gain is to subtract the entropy of the parent node from the sum of the entropy of all its children. In addition, in the summation, it is necessary to realize the weighted average due to the different proportion of categories.

In the example of “No surfacing”, among the 5 samples, there are 3 “1” and 2 “0”, so the weight of one of them is 3/5 and the other is 2/5.

For the three samples “1”, there are two “yes” and one “no” in the sample label fish, so the probability of classification is 2/3 and 1/3 respectively. Similarly, for the two samples “0”, the sample label FISH is “no”, so the probability of classification is 1.

The formula for calculating the information gain of this column is as follows:The information gain calculation results of the two features are as follows:The purpose of calculating the information gain of each feature is to select each classification timeThe current, so there must be a comparison process to get the maximum information gain, and return the index of the feature, and then you can use this optimal feature to slice the data set.

Therefore, two functions are constructed here, one is to select the optimal feature function, the other is to split the data set function, the specific code is as follows:

# Optimal eigenfunction

def ChooseBF(DataSet):

# Get basic information entropy

all_ent = calculatent(DataSet)

BestGain = 0 Initialize an information gain

col = - 1 Initialize an optimal feature index

for i in range(DataSet.shape[1]- 1) :# Walk through all features

Get the classification of column I

index_ = DataSet.iloc[:,i].value_counts().index

child_ent = 0 Initialize the entropy of a column

for j in index_:

# classify column I by category

child_DataSet = DataSet[DataSet.iloc[:,i]==j]

# Calculate the entropy of a certain type of information

ent = calculatent(child_DataSet)

# Weighted sum of all categories of information entropy

child_ent += (child_DataSet.shape[0]/DataSet.shape[0])*ent

print("The entropy of the %s column is %s."%(i,child_ent))

TheGain = all_ent-child_ent # Gain information gain

print("The information gain for column %s is %s"%(i,TheGain))

# Get the optimal feature index

if(TheGain>BestGain):

BestGain = TheGain

col = i

print("Optimal feature index is: %s"%col)

return col

Copy the code

This function makes two calls to calculate the information entropy functioncalculatent, once is to obtain the basic information entropy, that is, in the above formulaEnt (‘ total ‘); The other is to calculate the entropy of information of different categories of features, that is, in the formula aboveEnt (‘ yes ‘), ENT (‘ no ‘). The code operation screenshot is as follows:The optimal characteristic index returned is0That is,”no surfacing“This column; And the information gain of the two columns is consistent with the result of handwritten calculation above, so there is no problem with the code at all.

Next, use the optimal features to divide the data set, and the code is as follows:

# Shard data set function

def splitSet(DataSet,col,value):

# Find the optimal feature index tag

index = (DataSet.iloc[:,col]).name

# Split the dataset and delete the current optimal feature

resetdata = DataSet[DataSet.iloc[:,col]==value].drop(index,axis = 1)

return resetdata

Copy the code

This function needs to pass in three parameters, namely the data set to be sliced, the optimal feature index, and the classification to be preserved in the feature.

    ' ' '

col = 0,value = 1

flippers fish

0 1 yes

1 1 yes

2 0 no

col = 0,value = 0

flippers fish

3 1 no

4 1 no

' ' '


Copy the code

The above two dataframes are the results returned by the partition data set function. When value = 1, the three samples retained are “no surfacing” with a median value of 1. When value = 0, the remaining two samples are “no surfacing” with a median value of 0. In this way, it is easy to understand the function of value by comparison.

At the end of the article to summarize

Of entropy and information gain the calculation method is introduced in general, in this paper, from the characteristics of the data sets are very few, so lead to data set classification number will rarely, when the data characteristics is more long, after the first division, data sets to the branch of the decision tree down the next node, the node, we can divide the data again, Therefore, the principle of recursion is needed to process the data set.

The following sections will introduce how to recursively build a decision tree and how to visualize it. After all, one of the great advantages of decision trees is visualization, which enables a better understanding of how data is divided step by step.

Public number [milk candy cat] background reply “information gain” can be obtained source code for reference, thank you for reading.