This article originated from personal public account: TechFlow, original is not easy, for attention

Today, the 22nd article in our machine learning series, we continue the topic of decision trees.

In the last article, I introduced the simplest method to construct a decision tree — ID3 algorithm, which selects one feature at a time to split the data. How many values do we have for this feature and how many branches do we have, the whole tree building process is very simple. Those of you who missed the previous article can review it through the portal below:

If you don’t already know decision trees, you should definitely check them out

Since we already have ID3 algorithms to implement decision trees, why do we need new algorithms? Obviously there must be some optimization or some improvement, otherwise the new algorithm is obviously meaningless. So before we learn about new algorithms, we need to understand what improvements are being made and why they are being made.

Generally speaking, improvements are based on shortcomings and deficiencies, so let’s first look at some of the problems of ID3 algorithm.

The biggest problem, obviously, is that it can’t handle features of continuity. The reason for the failure is also very simple, because ID3 selects a specific feature rather than a feature value every time it splits data. There are as many bifurcations as there are values for this feature, if we use continuity, let’s say the diameter of the watermelon. So theoretically, the diameter of each watermelon is different, and such data thrown into the ID3 algorithm will produce the same fork as the number of samples, which is obviously meaningless.

There’s another problem, and it’s a little bit deeper, with information gain. We use the difference of information entropy before and after partition as the information gain, and then we choose the partition that brings the maximum information gain. There is a problem here, which leads to models that tend to pick features with more forks. In the extreme case, let’s say it’s a continuous feature, each feature has only one sample, then the entropy of information that you get is zero, and you get a very large information gain. This is not reasonable, because the bifurcation of many characteristics does not necessarily divide the effect is good, as a whole is not necessarily advantageous.

In view of these two problems, an improved scheme is proposed, that is, C4.5 algorithm. Strictly speaking, it is not an independent algorithm, but an improved version of ID3 algorithm.

Let’s take a look at how the C4.5 algorithm solves these two problems.

Information gain ratio

First, let’s look at the problem of information gain. As mentioned earlier, if we simply use information gain to filter the features of the partition, it is easy to fall into the trap of selecting more features.

So we can adjust this a little bit, we can change the information gain to the information gain ratio. The so-called information gain ratio is divided by the information entropy of the partition itself to get a ratio. For a feature with a lot of bifurcation, its own information entropy will be very high. Because there are many forks, the purity must be very low. So we can balance the deviation caused by feature bifurcation in this way, so that the model can make a more correct choice.

Let’s look at the formula. It’s really simple:


Here D is our training sample set, A is our selected feature, and IV(a) is the information entropy of this feature distribution.

Let’s look at the formula IV:


And just to explain the values here, V here is the set of all the values of feature A.It’s going to be the ratio for each of these v’s, so this is a formula for the entropy of information for feature A.

Processing continuous values

The C4.5 algorithm also optimizes continuous values and supports continuous values in a very simple way. For the value set V of feature A, we select a threshold value T to divide it into two parts less than t and greater than T.

That is to say, the C4.5 algorithm is different from discrete values in the segmentation of continuous values. For discrete value variables, we slice each value, while for continuous values, we slice only two. In fact, this design makes perfect sense, because in most cases, the continuous value characteristics of each piece of data are often different. Moreover, we have no good way to determine whether it is reasonable to divide the continuous value into several parts, so it is more intuitive that the fixed cut is divided into two parts, which is better than no use.

In extreme cases, the number of values of the continuous value feature is equal to the number of samples, so how do we choose this threshold? Even if we go through all the shards, there are n minus one of them, which is obviously very large, especially if you have a large sample size.

There is also a solution to this problem, that is, sorting according to the eigenvalues and selecting the true segmentation points. What does that mean? Let’s take a look at some data:

The diameter of Whether the sweet
3 sweet
4 sweet
5 Not sweet
6 Not sweet
7 sweet
8 Not sweet
9 Not sweet

This data is the result of sorting the diameter of watermelon in our team. We can see that there are only three values of training target change, which are 5,7 and 8 respectively. We only need to consider these three conditions, and the other conditions can be ignored.

Let’s combine these two points and add them to our previous implementation of the ID3 model.

Code implementation

Just say not to practice false handle, since we understand its principle, we have to do it ourselves to realize the following is really understand, many places pit is really understand. We can basically stick with the old code, but we need to make some changes.

First of all, we will transform the part of constructing the data. We still use the data of the last time, the students’ three exam grades and whether they pass the standard. We consider three courses with a score of 150 or more to be acceptable, courses with a score of 70 or more to be level 2, courses between 40 and 70 to be level 1, and courses below 40 to be level 0. On this basis, we added scores as features, and we added an error on the scores to avoid the model directly obtaining the results.

import numpy as np
import math
def create_data(a):
    X1 = np.random.rand(50.1) *100
    X2 = np.random.rand(50.1) *100
 X3 = np.random.rand(50.1) *100   def f(x):  return 2 if x > 70 else 1 if x > 40 else 0   # The student's score is a feature, and in order to be fair, a certain amount of noise is added  X4 = X1 + X2 + X3 + np.random.rand(50.1) * 20   y = X1 + X2 + X3  Y = y > 150  Y = Y + 0  r = map(f, X1)  X1 = list(r)   r = map(f, X2)  X2 = list(r)   r = map(f, X3)  X3 = list(r)  x = np.c_[X1, X2, X3, X4, Y]  return x, ['courseA'.'courseB'.'courseC'.'score'] Copy the code

Since we need to calculate the information gain ratio, we need to develop a special function to calculate the information gain ratio. Since the data this time involves continuous features, we need to pass an additional threshold to judge whether it is a continuous feature. If it is a discrete feature, the threshold is None, otherwise a specific value.

def info_gain(dataset, idx):
    # Calculate basic information entropy
    entropy = calculate_info_entropy(dataset)
    m = len(dataset)
    # Split data by features
 split_data, _ = split_dataset(dataset, idx)  new_entropy = 0.0  # Calculate the information entropy after splitting  for data in split_data:  prob = len(data) / m  # p * log(p)  new_entropy += prob * calculate_info_entropy(data)  return entropy - new_entropy   def info_gain_ratio(dataset, idx, thred=None):  If the threshold is not None, divide the data by the threshold  # Otherwise, divide according to eigenvalues  split_data, _ = split_dataset(dataset, idx, thred)  base_entropy = 1e-5  m = len(dataset)  # Calculate the information entropy of feature itself  for data in split_data:  prob = len(data) / m  base_entropy -= prob * math.log(prob, 2)  return info_gain(dataset, idx) / base_entropy, thred Copy the code

The split_dataset function also needs to be modified, because we have added a splitting method based on the threshold. We can determine whether the threshold is None to classify the threshold or the feature.

def split_dataset(dataset, idx, thread=None):
    splitData = defaultdict(list)
    # If the threshold is None, the partition is based on features
    if thread is None:
        for data in dataset:
 splitData[data[idx]].append(np.delete(data, idx))  return list(splitData.values()), list(splitData.keys())  else:  Otherwise, according to the threshold, it can be divided into two categories: greater than and less than  for data in dataset:  splitData[data[idx] < thread].append(np.delete(data, idx))  return list(splitData.values()), list(splitData.keys()) Copy the code

As mentioned above, we do not need to traverse all values when selecting thresholds, because some values do not cause changes in label distribution, so we can ignore such values. So we need a function to get all the possibilities of the threshold, this is also very simple, we directly sort by the threshold, and then iterate to see if the label will change, and record the value of all the label changes:

def get_thresholds(X, idx):

    # numpy multidimensional index usage
    new_data = X[:, [idx, - 1]].tolist()
    # Sort by eigenvalue
 new_data = sorted(new_data, key=lambda x: x[0], reverse=True)  base = new_data[0] [1]  threads = []   for i in range(1, len(new_data)):  f, l = new_data[i]  Record if the label changes  ifl ! = base: base = l  threads.append(f)   return threads Copy the code

With these methods, we need to develop a function to select the split value, that is, to calculate the information gain ratio of all features and find the feature with the maximum information gain ratio for splitting. Now that we’ve developed all the functions for splitting and getting all the thresholds, it’s easy to find the best splitting point, basically using the code we’ve developed before and searching for all the possibilities:

def choose_feature_to_split(dataset):
    n = len(dataset[0])- 1
    m = len(dataset)
    Record the best gain ratio, characteristics, and thresholds
    bestGain = 0.0
 feature = - 1  thred = None  for i in range(n):  The default integer feature is not a continuity feature  # Here is just my personal judgment logic, can diy  if not dataset[0][i].is_integer():  threds = get_thresholds(dataset, i)  for t in threds:  Pass through all thresholds and calculate the information gain ratio of each threshold  ratio, th = info_gain_ratio(dataset, i, t)  if ratio > bestGain:  bestGain, feature, thred = ratio, i, t  else:  # Otherwise follow the normal feature-splitting logic and calculate the gain ratio  ratio, _ = info_gain_ratio(dataset, i)  if ratio > bestGain:  bestGain = ratio  feature, thred = i, None  return feature, thred Copy the code

At this point, the basic method has been developed, leaving only two methods, tree construction and prediction. These two methods and the previous code changes are minor, basically minor changes. Let’s start with build trees, where the only difference is that an additional set of threshold information needs to be stored in dict. If it is None, it indicates a discrete feature. If it is not None, it indicates a continuous feature.

def create_decision_tree(dataset, feature_names):
    dataset = np.array(dataset)
    If both are of the same type, return the category directly
    counter = Counter(dataset[:, - 1])
    if len(counter) == 1:
 return dataset[0.- 1]   If there is only one feature, return the category with the largest percentage  if len(dataset[0= =])1:  return counter.most_common(1) [0] [0]   Record the characteristics and thresholds for optimal splitting  fidx, th = choose_feature_to_split(dataset)  fname = feature_names[fidx]   node = {fname: {'threshold': th}}  feature_names.remove(fname)   split_data, vals = split_dataset(dataset, fidx, th)  for data, val in zip(split_data, vals):  node[fname][val] = create_decision_tree(data, feature_names[:])  return node Copy the code

The final function of the prediction is the same logic as before, except that the threshold is None, which should be very simple:

def classify(node, feature_names, data):
    key = list(node.keys())[0]
    node = node[key]
    idx = feature_names.index(key)
    
 pred = None  thred = node['threshold']  If the threshold is None, the dict is traversed directly  if thred is None:  for key in node:  ifkey ! ='threshold' and data[idx] == key:  if isinstance(node[key], dict):  pred = classify(node[key], feature_names, data)  else:  pred = node[key]  else:  Otherwise direct access  if isinstance(node[data[idx] < thred], dict):  pred = classify(node[data[idx] < thred], feature_names, data)  else:  pred = node[data[idx] < thred]   # Leave pred empty and select a leaf node as a substitute  if pred is None:  for key in node:  if not isinstance(node[key], dict):  pred = node[key]  break  return pred Copy the code

conclusion

By now, the C4.5 algorithm of the whole decision tree has been developed. On the whole, due to the addition of the information gain ratio and the logic of continuity characteristics, the overall code is more complex than before, but the basic logic and routine are in the same line, basically there is no big change.

The principle of a decision tree is simple enough, but there are many details that you don’t realize until you’ve done it yourself. For example, how to find the threshold set of continuous features, for example, the mixture of continuous features and discrete features, how to distinguish in the code, and so on. You don’t realize these problems until you actually do them. Although usually do not use the decision tree model, but it is the basis of many advanced models, it is very helpful to learn and progress, if you have time, I recommend everyone to have a try.

Today’s article here, the original is not easy, need your attention, your lift a finger for me is very important.