This article originated from personal public account: TechFlow, original is not easy, for attention
In today’s 24th article on machine learning, we’ll talk about regression tree models.
In fact, the so-called regression tree model is to use tree model to solve the regression problem. The most classic tree model is decision tree model, which is also the basis of almost all tree models. Although the basic structure uses decision tree, it can be divided into two types according to different prediction methods. In the first case, the leaves in the tree correspond to a predicted value and a classification tree, which is called a regression tree. In the second case, the leaves on the tree correspond to a linear model, which gives the final result. This approach is called model tree.
Today we are going to look at one of the regression trees.
Regression tree model
The core algorithm of the regression tree model, that is, the algorithm to build the decision tree, is the CART algorithm we talked about in the last article. If you are unfamiliar or missed, you can review it through the portal below:
Machine learning — Decision tree CART algorithm, one of the top ten data mining algorithms
The essence of the CART algorithm is that we always split the data set every time we select features to split the data. Regardless of discrete or continuous features, the same. CART also has a feature that uses the GINI index rather than the information gain or the information gain ratio to select split features, but this is not used in regression problems. Since the loss function of regression problem is mean square error rather than cross entropy, it is difficult to measure the accuracy of continuous value by entropy.
In the classification tree, a leaf node represents the predicted value of a category, and the value of this category is the mode of the category of training samples falling into this leaf node, that is, the category with the highest frequency. In a regression tree, the leaf node corresponds to a continuous value. This continuous value is the mean of the training samples that fall to this node, and its error is the mean square error of these samples.
In addition, we optimized the selection of threshold values when selecting the classification threshold of features, and only selected those thresholds that would cause the change of predicted categories. However, in regression problems, this optimization also does not exist because the predicted value is a floating point number. On the whole, it is actually easier to implement regression tree than classification tree.
In actual combat
Let’s start by loading the data, this time using the classic Boston house price forecast from the SciKit-Learn library. For house price forecasting, Kaggle has a similar competition called house-prices-Advanced-regression Techniques. However, there are more features and some defects, which require us to carry out a lot of feature engineering. If you’re interested, you can do your own research.
First, let’s get the data. Since the SkLearn library already has the data, we can call the API directly to get the data. It’s very simple:
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
boston = load_boston()
X, y = boston.data, boston.target Copy the code
Let’s output the first few pieces of data to see:
The quality of this data is very high, skLearn library has done data filtering and feature engineering for us, it can be used directly. To make it easier for us to pass data, we merge X and y together. 0 Since y is a 1-d array that can’t be merged with 2-D X, we need to 0 0 for Y first before 0 0
y = y.reshape(- 1.1)
X = np.hstack((X, y))
Copy the code
The hstack function can concatenate two NP arrays horizontally. The corresponding vstack function concatenates two arrays vertically, which is also a common operation. After the merge, y is added after X as a new column. With the data out of the way, it’s time to implement the model.
Before implementing the main part of the decision tree, let’s implement two auxiliary functions. The first auxiliary function is to calculate the sum of the variances of a batch of samples, and the second auxiliary function is to obtain the mean of the samples, which is the predicted value of the child nodes.
def node_mean(X):
return np.mean(X[:, - 1])
def node_variance(X):
return np.var(X[:, - 1]) * X.shape[0] Copy the code
With that out of the way, we move on to implementing a function that splits the data based on thresholds. This can also be reused from previous code:
from collections import defaultdict
def split_dataset(X, idx, thred):
split_data = defaultdict(list)
for x in X:
split_data[x[idx] < thred].append(x)
return list(split_data.values()), list(split_data.keys()) Copy the code
The next two important functions are get_thresholds and split_variance. As the name implies, the first function is used to get the threshold, and as I said before, since we are doing regression models, theoretically every value of the feature can be used as a basis for segmentation. However, it is not ruled out that there may be multiple data with the same eigenvalue, so we de-weight it. The second function splits the data according to the threshold and returns the sum of the variances after splitting.
def get_thresholds(X, i):
return set(X[:, i].tolist())
# Bottom line of variance optimization for each iteration
MINIMUM_IMPROVE = 2.0
# Minimum number of samples per leaf node MINIMUM_SAMPLES = 10 def split_variance(dataset, idx, threshold): left, right = [], [] n = dataset.shape[0] for data in dataset: if data[idx] < threshold: left.append(data) else: right.append(data) left, right = np.array(left), np.array(right) # pre pruning If the split result has too few sides, return None to prevent overfitting if len(left) < MINIMUM_SAMPLES or len(right) < MINIMUM_SAMPLES: return None The sum of the variances is equal to the sum of the variances of the left subtree plus the sum of the variances of the right subtree Because it is the sum of variance and not the mean square variance, it can be added return node_variance(left) + node_variance(right) Copy the code
Here we use the MINIMUM_SAMPLES parameter, which is used for pre-pruning. Since we are a regression model, if the growth of decision tree is not restricted, it is likely to get the same number of leaf nodes of decision tree as the number of training samples. This obviously falls into the trap of over-fitting, which is detrimental to the effect of the model. So we have to limit the number of samples per node, which is a parameter that we can adjust as needed.
Next, the function for feature and threshold filtering. We need to develop a function to go through all the features and thresholds that can be split, split the data, and find the best possible split from all the features.
def choose_feature_to_split(dataset):
n = len(dataset[0])- 1
m = len(dataset)
Record optimal variances, characteristics and thresholds
var_ = node_variance(dataset)
bestVar = float('inf') feature = - 1 thred = None for i in range(n): threds = get_thresholds(dataset, i) for t in threds: Iterate over all thresholds and calculate variance for each threshold v = split_variance(dataset, i, t) # if v equals None, the split is overfitted, skip it if v is None: continue if v < bestVar: bestVar, feature, thred = v, i, t # If the best splitting effect is not enough, then do not split, control the number of subtrees if var_ - bestVar < MINIMUM_IMPROVE: return None.None return feature, thred Copy the code
As above, this function also uses a pre-pruning parameter MINIMUM_IMPROVE, which measures the revenue generated each time a subtree is generated. When the income generated by a certain subtree is less than a certain value, it indicates that the income is small and not cost-effective, so we give up the generation of the subtree this time. This is also a type of pre-pruning.
Once you’ve done that, you’re ready to build. The tree building process is similar to that before, except that there is no feature name in our data this time, so we remove the logic related to feature name.
def create_decision_tree(dataset):
dataset = np.array(dataset)
# If the current number is less than 10, the partition is not continued
if dataset.shape[0] < MINIMUM_SAMPLES:
return node_mean(dataset) Record the characteristics and thresholds for optimal splitting fidx, th = choose_feature_to_split(dataset) if fidx is None: return th node = {} node['feature'] = fidx node['threshold'] = th # Recursive build tree split_data, vals = split_dataset(dataset, fidx, th) for data, val in zip(split_data, vals): node[val] = create_decision_tree(data) return node Copy the code
Let’s test the tree completely, first we need to split the raw data. Split the raw data into training data and test data, but since our scenario is relatively simple, we will not set up validation data.
We do not need to split the data by ourselves, sklearn provides the corresponding tool, we can directly call:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=23)
Copy the code
The first is test_size, which can be either an integer or a floating-point number. If it is an integer, it represents the number of samples in the test set. If it is a floating point number from 0 to 1.0, it represents the proportion of the test set. Random_state is the random seed used to generate random numbers.
Let’s output the generated tree. Due to the large amount of data, we can see a huge tree structure. After the building part is done, the prediction part is left.
The prediction part of the code is the same as the previous classification tree, with the logic of feature_names removed.
def classify(node, data):
key = node['feature']
pred = None
thred = node['threshold']
if isinstance(node[data[key] < thred], dict): pred = classify(node[data[key] < thred], data) else: pred = node[data[key] < thred] # Leave pred empty and select a leaf node as a substitute if pred is None: for key in node: if not isinstance(node[key], dict): pred = node[key] break return pred Copy the code
Since this function can only accept one data at a time, if we want to make a batch prediction, it is not possible. Therefore, it is better to implement a batch prediction predict function.
def predict(node, X):
y_pred = []
for x in X:
y = classify(node, x)
y_pred.append(y)
return np.array(y_pred) Copy the code
After pruning
The original English version of post-prune is post-prune, but the translation of post-prune is also a bit strange. Anyway, let’s just use the word post-pruning.
In the regression tree, we used the very simple idea of building a tree as complex and as large as possible. Then, the tree was pruned through the test set. The logic of pruning was also very simple. We judged the error of a subtree with bifurcation and without bifurcation when it became leaf nodes alone.
The whole pruning process is the same as the tree building process, from top to bottom, recursively executed.
The logic is easy to understand, so let’s go straight to the code:
def is_dict(node):
return isinstance(node, dict)
def prune(node, testData):
testData = np.array(testData) if testData.shape[0] = =0: return node # split data split_data, _ = split_dataset(testData, node['feature'], node['threshold']) # Recursively prune left and right subtrees if is_dict(node[0) : node[0] = prune(node[0], split_data[0]) if is_dict(node[1]) and len(split_data) > 1: node[1] = prune(node[1], split_data[1]) # Determine if the current subtree needs pruning if the left and right are leaf nodes if len(split_data) > 1 and not is_dict(node[0]) and not is_dict(node[1) : Calculate the sum of the variances before pruning baseError = np.sum(np.power(np.array(split_data[0[:]),- 1] - node[0].2)) + np.sum(np.power(np.array(split_data[1[:]),- 1] - node[1].2)) # Calculate the sum of the variances after pruning meanVal = (node[0] + node[1) /2 mergeError = np.sum(np.power(meanVal - testData[:, - 1].2)) if mergeError < baseError: return meanVal else: return node return node Copy the code
Finally, let’s verify the effect after pruning:
As can be seen from the figure, the mean square error of our test data was 19.65 before pruning, but decreased to 19.48 after pruning. From the point of view of numerical value, it is effective, but because we have less training data, pre-pruning at the same time, affecting the effect of post-pruning. However, for actual machine learning engineering, as long as a method has a clear effect and the cost is affordable, it is valuable, never think that the improvement is not obvious, and just deny a method.
The sklearn library function mean_square_error is used to calculate the mean square error of two Numpy arrays.
conclusion
So that’s the end of the regression tree model, not only did we implement it ourselves, but we also experimented with it on real data sets. If you are the implementation of the model of the code, I believe you will learn a lot.
Although practically we will not use the tree model to do regression tasks, the regression tree model itself is very meaningful. Because on the basis of it, we have developed many models with better effects, such as the famous GBDT. Therefore, it is very important to understand the regression tree for our further learning. Before deep learning was popularized, most high-effect models were based on tree models, such as random forest, GBDT, Adaboost and so on. We can say that the tree model is responsible for half an era of machine learning, so I’m sure you can understand its importance.
This is the end of today’s article, if you like this article, if you can, please click a follow, give me a little encouragement, but also convenient to get more articles.
This article is formatted using MDNICE