Abstract: Random forest is one of the most advanced integration algorithms. Random forest is an upgrade of Bagging. The main difference from Bagging is the introduction of random feature selection.

This article is written by Chengxiaoli from the random Forest in Integrated Learning.

Random forest is one of the most advanced integration algorithms. Random forest is an upgrade of Bagging. The main difference from Bagging is the introduction of random feature selection. That is, when selecting segmentation points for each decision tree, the random forest randomly selects a subset of features first, and then performs traditional segmentation point selection on this subset.

Random forests

The construction process of random forest: if there are N samples, there will be N samples randomly selected back (one sample randomly selected each time, and then returned to continue selection). The selected N samples are used to train a decision tree as samples at the decision root node.

When each sample has M attributes, when each node of the decision tree needs to be split, M attributes are randomly selected from the M attributes, meeting the condition M << M. Then some strategy (such as information gain) is used to select one of the m attributes as the split attribute of the node.

In the process of decision tree formation, each node should be split according to the above steps (it is easy to understand that if the attribute selected by the node next time is the attribute used when its parent node was split, then the node has reached the leaf node and there is no need to continue splitting). Until it can no longer split. Note that there is no pruning during the formation of the decision tree.

Follow the above steps to build a large number of decision trees, thus forming a random forest. In the process of building each decision tree, two points need to be noted: sampling and complete splitting.

The first is two random sampling process, RandomForest on the input data to carry out row, column sampling. For row sampling, the method of putting back is adopted, that is, there may be duplicate samples in the sample set obtained by sampling. If we take N samples, we take N samples. In this way, the input samples of each tree are not all samples during training, making over-fitting relatively difficult. Then, column sampling is carried out, and M features are selected (M << M) from M features.

Then, a decision tree is established by completely splitting the sampled data. In this way, a certain leaf node of the decision tree either cannot continue splitting, or all the samples in the tree point to the same classification. In general, pruning is an important step in many decision tree algorithms, but this is not the case here. Because the randomness of the previous two random sampling processes is guaranteed, over-fitting will not occur even without pruning.

Input: data set D={(x1,y1),(x2,y2)… , (xm, ym)};

Feature subset size K.

Steps:

1. Nß Construct nodes for given data set D;

2. If all samples belong to one category thenreturn N;

3. Fß Feature set that can continue to be classified;

4. If F is null thenreturn N;

5. Fiß randomly selects K features from F;

6. N.fß feature Fi has the feature of the best segmentation point;

7. The best part of N.pß F;

8. Subset of samples in DlßD whose N.f value is less than N.p;

9. Subset of samples in DrßD whose N.f value is not less than N.p;

10. Nlß continues with argument (Dl,K);

11. Nrß continue to call this program with parameter (Dr,K);

12. Return N.

Output: a random decision tree.

Random forest and Bagging

The above is the random decision tree algorithm used in random forest. The parameter K is used to control randomness. When K is equal to the total number of all features, the constructed decision tree is equivalent to the traditional deterministic decision tree. When K=1, a random feature is selected; The K value is suggested to be the logarithm of the eigennumber. Random tree only introduces randomness in feature selection stage, but not in segmentation point selection.

Figure A: Bagging’s 10 base learners; Figure B: 10 base learners of random forest; Figure C: Bagging; Figure D: Random forest.

Comparison of prediction errors between Random forest and Bagging on 40 UCI datasets. Each point is a data set, and the coordinates of the points are the prediction errors of the two comparison algorithms. The diagonal lines indicate the position of the same prediction error between the two comparison algorithms.

The decision boundaries of random forest, Bagging and their base classifier are compared. It can be seen from the figure that the decision boundary of random forest and its base learner is more flexible, so it has better generalization ability. On the Gaussian dataset, the test error rate of random forest was 7.85%, while Bagging was 8.3%. The above figure compares the prediction errors of random forest and Bagging on 40 UCI datasets. It is clear that the prediction accuracy of random forest is better than that of Bagging, regardless of whether pruned or unpruned decision trees are used.

Effects of integration scale on random forest and Bagging on two UCI datasets.

The convergence of random forest is similar to Bagging. The starting point of random forest is generally poor, especially when the integration scale is 1, random feature selection will lead to the degradation of decision tree performance. However, it usually converges to very low test errors. In the training phase, random forest converges faster than Bagging. This is because in the decision tree construction phase, Bagging uses a full range of features for segmentation and selection to generate a deterministic decision tree, whereas random forest only needs to use randomly selected features to generate a random decision tree.

Conclusion:

Combined classifier has better classification effect than single classifier. Randomforest is a method to discriminate and classify data by using multiple classification trees. While classifying data, it can also give the importance score of each variable (gene) and evaluate the role of each variable in classification.

[1] Chen Lei. Deep Learning and MindSpore Practice [M]. Tsinghua University Press: 2020.

[2] Zhuge Yue, Gourd Baby. Hundred side Machine Learning [M]. Posts and Telecommunications Press: 2020.

[3] Zhou Zhihua. Machine Learning [M]. Tsinghua University Press: 2016.

Click to follow, the first time to learn about Huawei cloud fresh technology ~