This article originated from personal public account: TechFlow, original is not easy, for attention


Today, in the 21st article of our machine learning series, we’ll look at a new model — decision trees.

Definition of decision tree

Decision tree is my favorite machine learning model. It is very intuitive and easy to understand, and closely combined with data structure. We also have a very low threshold of learning, compared to those of a lot of formulas, it is much simpler.

We use decision trees a lot in our lives, but we don’t realize it. The essence of a decision tree is a bunch of if-else combinations, for example, if we go to a stand to buy a watermelon. What do fruit vendors do? Pick it up and roll it around, look at it, and then pat it to see if it’s sweet. We take away the elements that are associated with the movement, and we extract the core essence, which is basically three things:

  1. The color on the surface of a watermelon, bright colors tend to be sweeter
  2. The sound of watermelon beating, the sound of crisp is often more sweet
  3. Does watermelon have a vine? The vine tends to be sweeter

The three are obviously not equal, as the sound of the flap is the most important, perhaps second the surface color, and last the vine. So when we choose, we must also listen to the sound, then look at the vine, and finally look at the color. We abstracted the logic out of it and reorganized it into a tree structure, so this is a decision tree.


What the decision tree does is essentially classify watermelons into sweet and unsweet. In other words, decision tree is a tree classifier, which is also the basic definition of decision tree. In addition, we have another revelation from the figure. In this problem, the characteristics of decision trees are discrete values rather than continuous values. That is, decision trees can accept non-numerical features such as categories and identifiers, whereas logistic regression models are not so good.

If you don’t understand the details, let’s leave it at that, but at least we have an idea of what a decision tree looks like and how it works.

For each piece of data, the process of its classification is actually the process of traversing the decision tree. Each intermediate node will face a judgment, according to the judgment results to choose the next subtree. The leaf node on the tree represents a classification. When the data reaches the leaf node, the value of the leaf node represents its classification result.

Decision tree training

Now that you understand the structure of the decision tree and how it works, this is the training process.

Before we get to the basics, let’s take a look at the data. According to the structure of the decision tree above, it is easy to find that the training data should be in the following table:

classification Whether the sound is crisp Are there any melon vines Whether shiny or not
sweet is is is
sweet is is no
Not sweet no no is
Not sweet no no no

So what do we want to achieve in the end? Of course, the higher the accuracy, the better. According to the principle of decision tree, each leaf node on the tree represents a classification. So we obviously want the data that ends up at the leaf node to be as pure as possible, so for example, if a leaf node is sweet, then we definitely want the data that’s assigned here based on the tree structure to be as sweet as possible, and the percentage that’s not sweet to be as low as possible.

So how do we achieve this? This requires us to select some highly differentiated features as the basis of segmentation when we extract rules at the top level. A distinguishing feature is one that separates the data well. This is obviously greedy. Using this method, we can only guarantee the best classification results at the highest possible level, but it does not guarantee that the model thus obtained is optimal. The generation of optimal decision tree is also an NP problem in essence. Our current approach can ensure that a good enough solution can be obtained in a short time, but it cannot guarantee that it is the optimal solution.

Going back to the problem, we want to divide the data with distinguishing features. And the way to do that is to define and quantify the distinction so that we can make choices. Otherwise, you can’t measure the distinction by feeling, but fortunately, this distinction is easy to solve, we just need to introduce the concept of information entropy again.

Information entropy and information gain

The term information entropy is confusing, and it’s called information entropy in English, but it’s just as difficult to understand. Because entropy itself is a concept in physics and thermodynamics, it measures the non-uniformity of the dispersion of objects. In other words, the higher the entropy, the more dispersed the object is, which can be interpreted simply as the more scattered it is. For example, if we knock over a box of ping-pong balls in a room, the ping-pong balls inside will obviously scatter to all parts of the room, and this messy process can be understood as the process of increasing entropy.

Entropy is the same thing, and it measures how messy a piece of information is. The higher the entropy, the more chaotic the information; otherwise, the more organized the information. Information entropy comes from the famous great book of informatics, Information Theory, written by the famous Shannon. But shannon didn’t invent the word. It’s said to have been coined by Von Neumann, the father of computing, and the meaning of his name was simple because no one knew what it meant.

We explained this concept in detail earlier when we introduced cross entropy, so let’s review it briefly. For an event X, assuming that the probability of its occurrence is P(X), the amount of information of the event itself is:


For example, the probability of China winning the World Cup is 1/128, so we need to use 8 bits to express, indicating that it is very informative. If Brazil has a one in four chance of winning, then only two bits are enough to tell you little about it. The same thing, according to the probability of different, its information is also different.

So the meaning of information entropy is actually the expectation of information, that is, the probability of information multiplied by it:


Similarly, suppose we have a data set, and there are K types of samples, and the proportion of each type of sample is 1, so if we regard this ratio as probability, we can write the information entropy of the whole set:


Once you understand the concept of information entropy, it’s easy to look at information gain. Information gain is us in a nutshellThe change of information entropy before and after partition, suppose we select a feature for segmentation and divide data set D into D1 and D2. thenIt’s called the information gain, which is the change in entropy after the segmentation.

We know from the definition of entropy that if the data becomes pure, then the entropy of information should decrease. The more you reduce it, the better you cut it. So we found a way to measure the effect of segmentation, which is information gain. According to the definition of information gain, we can easily map the whole process of decision tree establishment. That is, every time we select the segmentation feature, we will traverse all the features, each value of the feature corresponds to a subtree, we find the feature with the maximum gain after segmentation by calculating the information gain. Once the upper structure is created, the tree continues to be built recursively until the shard data set becomes pure or all features are used.

This algorithm is called ID3 algorithm, it is also the most basic decision tree construction algorithm. There is a small detail here. According to the definition of ID3 algorithm, each shard selects the feature, not the value of the feature. And each value of the feature selected as the segmentation feature will establish a subtree, that is, each feature will appear in the decision tree only at most once. Because after using it once, all the values of this feature are used up.

For example, whether the tap sounds crisp or not is a feature that we chose from the beginning. I have a subtree based on its two values, yes and no. So it doesn’t make sense to split the data in the subtree based on this feature, because all the data in the subtree has the same feature. In addition, all features used by the ID3 algorithm must be discrete values, because continuous values cannot be fully segmented. If the weight of a watermelon is a feature, then theoretically all rational numbers could be the mass of a watermelon, and we obviously can’t exhaust all of them.

This is very important, not only for our implementation of the decision tree is correct, but also for our understanding of other tree building algorithms.

Code implementation

After understanding the principle and process of the algorithm, it comes to the coding part of our nervous stimulation. To be honest, the algorithm of decision tree is not difficult to achieve, and it is simpler than the previous FP-growth, so don’t be stressed.

First, let’s create experimental data:

import numpy as np
import math
def create_data(a):
    X1 = np.random.rand(50.1) *100
    X2 = np.random.rand(50.1) *100
 X3 = np.random.rand(50.1) *100   def f(x):  return 2 if x > 70 else 1 if x > 40 else 0   y = X1 + X2 + X3  Y = y > 150  Y = Y + 0  r = map(f, X1)  X1 = list(r)   r = map(f, X2)  X2 = list(r)   r = map(f, X3)  X3 = list(r)  x = np.c_[X1, X2, X3, Y]  return x, ['courseA'.'courseB'.'courseC'] Copy the code

The data simulates students taking three exams, with a total score of 150 or more required to pass. Since ID3 algorithm can only accept the characteristics of discrete values, we need to first convert continuous values into discrete values, and generate three grades according to the scores of each exam. Those above 70 are in grade 2, those between 40 and 70 are in grade 1, and those below 40 are in grade 0.

To facilitate coding, we put the predicted value Y at the end of the feature, and return the names of the three features for future tree construction.

Let’s run the data to see the result:


Next, we implement a function to calculate the entropy of information in a set. This function is also very simple, we just need to calculate the proportion of each category, and then apply the formula of information entropy.

from collections import Counter

def calculate_info_entropy(dataset):
    n = len(dataset)
    So let's use Counter to count Y
 labels = Counter(dataset[:, - 1])  entropy = 0.0  # Apply the information entropy formula  for k, v in labels.items():  prob = v / n  entropy -= prob * math.log(prob, 2)  return entropy Copy the code

With the information entropy calculation function, we then implement the splitting function, which is to split the data set according to the value of features.

def split_dataset(dataset, idx):
   # idx is the feature subscript to be split
    splitData = defaultdict(list)
    for data in dataset:
       [idx] [idx] [idx
 splitData[data[idx]].append(np.delete(data, idx))  return list(splitData.values()), list(splitData.keys()) Copy the code

It is essentially a process of sorting by feature values, so we can call any test we like:


As expected, the data is divided into several parts according to the value of features. Next, we will implement the selection function of the core feature, that is, to select the feature with the maximum information gain for data segmentation.

def choose_feature_to_split(dataset):
    n = len(dataset[0])- 1
    m = len(dataset)
    # Entropy of information before segmentation
    entropy = calculate_info_entropy(dataset)
 bestGain = 0.0  feature = - 1  for i in range(n):  # Shard according to feature I  split_data, _ = split_dataset(dataset, i)  new_entropy = 0.0  # Calculate the information entropy after segmentation  for data in split_data:  prob = len(data) / m  new_entropy += prob * calculate_info_entropy(data)  # Gain information gain  gain = entropy - new_entropy  if gain > bestGain:  bestGain = gain  feature = i  return feature Copy the code

At this point, all of our tools and methods have been developed, and now it’s time for our exciting construction part. Building a tree is nothing more than creating each branch node by recursively calling the above method repeatedly. If you’re familiar with tree data structures, you’ll see that it’s just like any other tree data structure.

If we look at the code, it’s only about a dozen lines.

def create_decision_tree(dataset, feature_names):
    dataset = np.array(dataset)
    counter = Counter(dataset[:, - 1])
    If there is one class of data set value left, return it directly
    if len(counter) == 1:
 return dataset[0.- 1]   # if all features have been shard, return  if len(dataset[0= =])1:  return counter.most_common(1) [0] [0]   # Search for the best shard features  fidx = choose_feature_to_split(dataset)  fname = feature_names[fidx]   node = {fname: {}}  feature_names.remove(fname)   # recursive call, build a tree recursively for each value cut out  split_data, vals = split_dataset(dataset, fidx)  for data, val in zip(split_data, vals):  node[fname][val] = create_decision_tree(data, feature_names[:])  return node Copy the code

When we run this code, we get a dict, where the hierarchy is the structure of a decision tree:


It may not be clear to look at it this way, but we expand the dict to get the tree structure shown below:


If we look at the red circle in the middle of the figure, this node has only two branches, while all the other nodes have three. It’s not that the code is buggy, it’s that the case is missing from the data, so there’s a fork missing. This is actually quite normal. When the sample size of our training data is not enough, it is likely that we cannot cover all the cases, and this situation without bifurcation will occur.

Now, the decision tree is done, but it’s not done yet, and there’s one key part that we haven’t done, and that’s the prediction. When we’re done training, we’re always going to use the model, and obviously we need a predictive function. This prediction function is also simple, it introduces a piece of data and the tree structure we have trained, and returns the result of classification. It is also a recursive call:

def classify(node, feature_names, data):
   Get the features of the current node judgment
    key = list(node.keys())[0]
    node = node[key]
    idx = feature_names.index(key)
  # Recurse by feature  pred = None  for key in node:  # find the corresponding fork  if data[idx] == key:  # if there are subtrees further down, recurse, otherwise return the result  if isinstance(node[key], dict):  pred = classify(node[key], feature_names, data)  else:  pred = node[key]   If there is no corresponding fork, a fork is found and returned  if pred is None:  for key in node:  if not isinstance(node[key], dict):  pred = node[key]  break  return pred Copy the code

Let’s create some simple data to test:



Basically what we expected, which means we’re done with the decision tree.

conclusion

Although our decision tree is finished, there are still many imperfections. For example, our current model can only accept features of discrete values, and cannot be split if they are continuous. And we can only use each feature once, sometimes we want to use the same feature more than once. In this case ID3 cannot be implemented. So we also need to introduce other optimizations.

We will discuss these related optimizations, as well as some of the characteristics of the decision tree model itself, in later articles. If you’re interested, don’t miss it.

Today’s article is here, the original is not easy, scan code to pay attention to me, get more wonderful articles.