Introduction to the

Random forest refers to a classifier that uses multiple trees to train and predict samples. It is composed of multiple CART(Classification And Regression Tree). For each tree, the training set used is sampled from the total training set, which means that some samples in the total training set may appear multiple times in a tree’s training set or may not appear at all in a tree’s training set. When training the nodes of each tree, the features used are randomly extracted from all the features in a certain proportion. Assuming that the total number of features is M, the ratio can be (√ M),1 2 (√ M), 2 (√ M).

The training process

The training process of random forest can be summarized as follows:

(1) Given training set S, test set T, and feature dimension F. Determine parameters: the number of CART used t, the depth of each tree D, the number of features used by each node F, termination conditions: the minimum number of samples on the node S, the minimum information gain on the node M

For the 1-t tree, I =1-t:

(2) The training set S(I) with the same size as S is extracted from S and used as the sample of the root node to start training from the root node

(3) If the termination condition is met on the current node, the current node is set as the leaf node. If it is a classification problem, the prediction output of the leaf node is the category C (j) with the largest number in the sample set of the current node, and the probability P is the proportion of C (j) in the current sample set; If it is a regression problem, the forecast output is the average value of each sample in the current node sample set. Then continue to train the other nodes. If the current node does not meet the termination condition, the f-dimension features are randomly selected from the f-dimension features that are not put back. Using the F-dimensional feature, the one-dimensional feature k with the best classification effect and its threshold th are found. The samples with k-dimensional feature less than TH on the current node are divided into the left node, and the rest into the right node. Continue training the other nodes. The criteria for classification effectiveness will be discussed later.

(4) Repeat (2) and (3) until all nodes have been trained or marked as leaf nodes.

(5) Repeat (2),(3),(4) until all CART has been trained.

Forecasting process

The prediction process is as follows:

For the 1-t tree, I =1-t:

(1) Starting from the root node of the current tree, judge whether to enter the left node (

=th) according to the threshold value of the current node th, until it reaches a leaf node, and output the predicted value.

)>

(2) Repeat (1) until all t trees output the predicted value. If it is a classification problem, the output is the class with the largest sum of predicted probabilities in all trees, that is, p of each C (j) is accumulated. If it is a regression problem, the output is the average of the output of all trees.

The evaluation standard of classification effect is different from C3.0 and C4.5, because it is CART and therefore CART evaluation standard is used.

For classification problems (classifying a sample into a certain category), namely discrete variable problems, CART uses Gini values as a criterion. Is defined asGini(p)=1Kk=1p2k.pkIs the proportion of type K samples in the data set on the current node.

For example: divided into 2 categories, there are 100 samples on the current node, 70 samples belonging to the first category and 30 samples belonging to the second category, thenGini=10.7x070.3x03=0.42, it can be seen that the more average the category distribution is, the larger the Gini value is; the more uneven the class distribution is, the smaller the Gini value is. When searching for the best classification features and thresholds, the evaluation criteria are as follows:()argmax(GiniGiniLeftGiniRight)In other words, find the best feature f and threshold th to maximize the Gini value of the current node minus the Gini value of the left child node and the Gini value of the right child node.

For regression problems, argmax(Var−VarLeft−VarRight) is directly used as the evaluation standard, that is, the variance of the current node training set Var minus minus the variance VarLeft of the left child node and the variance VarRight of the right child node have the maximum value.

advantages

  • Performs well on data sets
  • In many current data sets, it has great advantages over other algorithms
  • It can process data with high dimensions (many features) without feature selection
  • After the training, it can give which features are important
  • When creating a random forest, an unbiased estimation of generlization error is used
  • Train fast
  • In the training process, the interaction between features can be detected
  • Easy to parallelize
  • It’s easy to implement

Code implementation

A simple example of using the random forest algorithm in Sklearn:

#Import Library
from sklearn.ensemble import RandomForestClassifier
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset

# Create Random Forest object
model= RandomForestClassifier()

# Train the model using the training sets and check score
model.fit(X, y)

#Predict Output
predicted= model.predict(x_test)Copy the code

In addition, random forest algorithm is also implemented in OpenCV. See the RandomForest RandomForest summary for examples.