0 x00 the

This paper will try to use easy to understand way, as far as possible does not involve mathematical formula, but from the overall idea, using perceptual intuition thinking to explain integrated learning. And extends the concrete application scenarios from the famous works to help us understand this concept.

In the process of machine learning, we will encounter many obscure concepts and related mathematical formulas, and it is very difficult for people to understand. In similar situations, we should think more intuitively, and use analogy or examples to supplement, which will often have a better effect.

My requirement to myself in the process of explanation and exposition is: find an example in life or in the famous works, and then explain it with their own words.

0x01 Related Concepts

1. Bias vs variance

  • Square bias, representing the difference between the average forecast for all data sets and the expected regression function.
  • And variance measures how the solutions given by the model fluctuate near the average value for individual data sets.

Variance measures the change in learning performance caused by changes in training sets of the same size, and measures the degree of change in the estimation results of the learning algorithm when facing different training sets of the same size (related to the error of the observation sample, it describes the accuracy and specificity of a learning algorithm: A high variance means a weak match.

Bias to measure the learning algorithm is expected to forecast with the actual results of deviation degree, the measure for some kind of learning algorithm can estimate of the average results close to the extent of the learning objectives (independent of the training sample error, the accuracy of matching and quality: a high bias means that a bad match), namely the fitting ability of the algorithm itself.

One of the better examples on the web is the targeting example, and you can check it out

  • “High Variance” and “low bias” correspond to that the points are hit near the bull’s eye but scattered, so the aim is accurate but the hand may not be steady.
  • “Low variance” and “high bias” correspond to points that are played very closely, but not necessarily near the bull ‘s-eye. The hand is very steady, but the aim is not accurate.

Here is an attempt to explain these two concepts in terms of water margin.

2. Take Liangshan as an example to popularly say variance

For example, the average force of the five tiger generals is Huyanzhuo, followed by Qin Ming and Dong Ping. Above them are Lin Chong and Guan Sheng. Let's say the force is Qin Ming95, Dong Ping96, HuYan burning97, GuanSheng98, Lin99For this sample, the variance is the difference between the force value of Qin Ming, Dong Ping, Lin Chong and Guan Sheng and that of Huyanzhuo. If Du and Huyanzhu need to fight 800 rounds to decide the outcome, then the variance is small. If three lilies can decide the winner, that means a lot of variance.Copy the code

3. Taking Liangshan as an example, bias is commonly said

Assume that the force value of ma Jun eight psars is as follows: Xu Ning92, SuoChao90, ZQQ86, zhu 8,90History into,88, MuHong85The flower glory90, yangzhi95. If you have a model, use eight horses to fit the five tiger generals. This model is: an arbitrary selection of eight piao to fit. (In actual work, the bias should be the average value of eight horses, and the bias here is an optional one, for convenience.) If Yang Zhi is used to fit the five generals, the five generals can be closest, and the bias is small at this time. If mu Hong is selected, the bias is the largest. For no shield Mu Hong, we estimate that only know its name do not know its matter, visible its force value is the lowest. Of course, mu Hong's story may have been in the ancient edition of "Water Margin", but it was later removed from this edition of "Water Margin" after generations, so it was looked down upon. The water Margin story has systematic written record first see "Xuan he legacy". The book tells the story of the taihang mountain and liangshan mountain, and later "Water Margin" is said to be a gang of luxi Liangshanbo of 38 people. One of the no blocking Mu Heng is undoubtedly "Water Margin" in Mu Hong. In xuanhe Legacy, Mu Heng was one of the twelve escorts of hua Shi Gang. These 12 people because of official duty and sworn, after the rescue of Yang Zhi and with the Taihang mountain grass, then became the basic members of the shanzhai. "Outlaws of the Marsh" deal with the twelve instigators, there are eight people and mu Heng exactly the same treatment (Lin Chong, Hua Rong, Zhang Qing, Xu Ning, Li Ying, Guan Sheng, Sun Li, Yang Zhi). All eight of them became dominant figures in Water Margin. If everything conforms to the tradition to develop, Mu Hong will not become a blank figurehead.Copy the code

The following is The Legacy of Sun Hwa.

First zhu 勔 transported the stone, Yang Zhi, Li Jinyi, Lin Chong, Wang Xiong, Hua Rong, Chai Jin, Zhang Qing, Xu Ning, Li Ying, Mu Heng, Guan Sheng, Sun Li twelve people instigate, went to Taihu lake and other places, escorted the man to carry the stone.

The twelve who had received the letter became brothers, sworn to help each other in trouble. Li Jinyi and other ten, carrying flower stones to the capital; Only Yang Zhi was waiting for Sun Li in Yingzhou, where it was blocked by snow. That snow scene is how: disorderly floating monk’s house chayan wet, dense sprinkling song floor wine strength is weak. That Yang Zhiwei and sun Li did not come, and the value of the snow, the brigade tu poverty, lack of fruit foot, rather a baodao out of the market goods to sell. No one negotiates price all day long. Xingzhili feeding, in the case of a young evil to sell baodao, two crossmouth fight, the young was a knife Yang Zhi chop, saw the neck with the knife. Yang Zhi went up cangue, took recruit form, send prison push survey. Closing the case shen Played the text back, the magistrate judge: “Yang Zhi incident is big, the feeling can be mercy. Yang Zhigao born to burn, with the state army city.” Break off, poor two people sent to wei state. Positive line times, bump a Han, high call: “Yang instigate!” Yang Zhi looked up a consternation, but recognize Sun Li instigation. Sun Li is surprised: “elder brother how so crime”. Yang Zhi sold the sword to kill, one said with Sun Li. Then go every man his way. Sun Li thought to himself, “Yang Zhi has been waiting for me. He is guilty of this crime. At the beginning of the pledge, I vowed to help in trouble.” Had to run back to the Capital night, the newspaper and Li Jinyi know why Yang Zhi crime. After consulting with Sun Li, Li Jinyi and his 11 brothers marched to the bank of the Yellow River and waited for Yang Zhi to arrive. Then they killed the soldiers and went to Mount Taihang to cut grass for kou.

0x02 Ensemble Learning

1. Why integration

In integrated learning theory, we refer to weak learners (or base models) as “models” that can be used as building blocks for designing more complex models. In most cases, these basic models themselves do not perform very well, either because they have high bias (for example, low-dOF models) or because their variances are so large that they are not robust (for example, high-Dof models).

For example, there are certain differences among weak classifiers (such as different algorithms, or different parameter configurations of the same algorithm), which will lead to different classification decision boundaries, that is, they will make different mistakes in decision making.

The idea of the integration approach is to achieve better performance by combining the bias and/or variance of these weak learners to create a “strong learner” (or “integration model”).

Ensemble learning is a machine learning paradigm. The basic idea is that “two heads are better than one”. “Unity is strength”. “Learn from others.”

In ensemble learning, we train multiple models (often referred to as “weak learners”) to solve the same problem and combine them to get better results. The most important assumption is that when weak models are combined correctly, we can get more accurate and/or more robust models.

We can generalize the idea of ensemble learning. For the training set data, we can form a strong learner by training several individual learners and combining certain strategies, so as to achieve the purpose of learning from others.

  • Ensemble learning method builds a set of base classifiers from training data, and then classifies them by voting on the predictions of each base classifier

  • Strictly speaking, ensemble learning is not a classifier, but a classifier combination method.

  • If a single classifier is compared to a decision maker, ensemble learning is equivalent to multiple decision makers making a decision together.

Classification of 2.

For the classification of ensemble learning, there are two common classification methods:

Category 1

Individual learners can be divided into two categories according to whether there is a dependency relationship between individual learners:

  • Individual learners are highly dependent on each other, and a series of individual learners need to be generated sequentially, representing boosting series algorithms.
  • There is no strong dependence between individual learners, and a series of individual learners can be generated in parallel, representing bagging and Random Forest algorithms.

Category 2

Ensemble learning can be divided into heteromorphic ensemble learning and homomorphic ensemble learning according to the relationship between basic classifiers.

  • Heteromorphic ensemble learning refers to the difference between weak classifiers.

  • Homomorphic ensemble learning means that weak classifiers themselves are the same but the parameters are different.

3. Main issues

There are two main problems with integrated learning that need to be solved:

1) How to train each algorithm? How to get a number of individual learners.

2) How to fuse each algorithm? That is, how to choose a combination strategy to aggregate these individual learners into a strong learner.

How do I get the individual learner?

Mainly from the following five aspects:

  • The classification of basic classifier itself, that is, its constituent algorithm is different;
  • Boosting, Bagging, stacking, Cross-validation, hold-out test;
  • Processing and selection of input features;
  • Processing the output results, such as the error correction code proposed by some scholars;
  • Introduce random disturbance;

It is important that our choice of weak learners be consistent with how we aggregate these models. If we choose the base model with low bias and high variance, we should use an aggregation approach that tends to reduce the variance; If we choose a base model with low variance and high bias, we should use an aggregation approach that tends to reduce the bias.

How to choose a combination strategy

Once weak learners have been selected, we still need to define how they fit (what information from previous models should be considered when fitting the current model?). And aggregation (how do you aggregate the current model into previous models?).

This leads to the question of how to combine these models. We can use three main “meta-algorithms” designed to combine weak learners:

  • Bagging, which generally considers homogeneous weak learners, learns these weak learners independently in parallel and combines them according to some deterministic average process.

  • Boosting, the method usually considers homogeneous weak learners as well. It learns these weak learners sequentially (each base model depends on the previous model) in a highly adaptive way, and combines them according to some deterministic strategy.

  • Stacking, this approach typically considers heterogeneous weak learners, learns them in parallel, and combines them by training a “metamodel” to output a final prediction based on the predictions of the different weak models.

Very roughly speaking, we can say that bagging focuses on getting an ensemble model with a smaller variance than its components, while Boosting and Stacking will primarily generate strong models with a lower bias than their components (even though the variance can be reduced).

We can look at how to apply the combined strategy in water Margin

Liang Shan wants to attack a certain state capital, what method on earth use? Listen to Wu? Wu Xuegu in the performance of the program is not good, too many loopholes. Fighting daimyo and so on is basically a trick: infiltrate the city beforehand, and then join the outside. So it's better to pool your ideas and use integrated learning. There are two approaches1Five people, zhu Wu, Li Jun, Wu Yong, Sun Sheng and Fan Rui, each have an idea and then vote for a comprehensive plan. methods2Wu Yong first came up with an idea, then Zhu Wu made some adjustments to the idea and gave a new idea. Li Jun then tinkered with Zhu Wu's idea and gave a third idea...... Give a final idea. Way to1It's an approximation of bagging2Boosting is an approximate simulation of boostingCopy the code

0x03 Bootstrap

First of all, I need to introduce Bootstrap, which is not integrated learning but a method of statistics and a pioneer of integrated learning.

Boostrap was the straps of a boot and the name came from “pull up your own boostraps”, which had to go through pulling a boot to improve, something that was never going to happen, but evolved into something that would make things better. In combinatorial classifier, it means to improve the performance of classification by classifier itself.

Boostrap is a sampling method.

People want to get all the information about the whole group, because that’s how it works — the whole group knows, what other samples we can’t have? In fact, it is difficult or even impossible to obtain information on the whole, hence the statistical term “sampling”. In other words, we can only obtain the information of some samples of the whole population, and people expect to estimate the population as accurately as possible through these limited sample information, so as to provide the basis for decision making. According to the Bootstrap method, since the samples obtained are “extracted” from the whole population, why not take these samples as a whole and then extract them with some back? This method seems simple, but in fact it is very effective.

Specific methods are as follows:

(1) M samples are selected from n original samples at a time by repeated sampling method (m is set by itself) (2) For m samples, calculate statistics (3) Repeat steps (1) (2N times (N is generally greater than1000), so we can calculate N statistics (4For example, I am going to classify some unknown samples and choose one of the classification algorithms, such as SVM. The general parameter I want to estimate is accuracy. For n original samples, from the steps (1) At the beginning, each time the extracted samples were trained with a SVM model, and then the model was used to classify the unknown samples to obtain an accuracy rate. Repeat N times, N accuracy can be obtained, and then calculate the variance of N accuracy. Why do I calculate the variance of these N statistics instead of the expected value or the mean. Variance represents a set of data with the average deviation degree, if the calculated variance values within a certain range, said fluctuations in the data is not very big, so they can be used to estimate the parameters of the overall average, and if the variance is very big, the sample statistics fluctuation is very big, so the overall parameter estimation is not accurate?Copy the code

It seems too simple, which is why Efron’s submission to a top statistical journal was rejected as “too simple”.

Bootstrap simply provides the idea of combining the training results of the base classifier into a comprehensive analysis, while other names such as Bagging. Boosting is a specific derivation of the combinatorial method. So starting with the Bootstrap method, we’ll import into integrated learning.

0 x04 bagging method

Bagging, short for boostrap aggregation, or self-help aggregation, is a technique of repeated sampling (with fallbacks) from data based on a uniform probability distribution. As for why it is called bootstrap aggregation, it uses the bootstrap method to extract training samples!

The size of the sub-training sample set is the same as the original data set. When constructing the training sample of each subclassifier, the same sample data may appear several times in the same training sample set because the original data is put back into the sample.

1. Bagging strategy

  • Select n training samples from the sample set using Bootstrap sampling (put back, because other classifiers will also be used to extract training samples)
  • Train the classifier (CART or SVM or…) with these N samples on all attributes.
  • Repeat the above two steps m times to get M classifiers (CART or SVM or…).
  • Put the data on these M classifiers to run, and finally the voting mechanism (majority obey minority) to see which category it falls into (classification problem).
  • For classification problem: classification result is produced by voting; For regression problem: the mean value of the predicted results of N models is taken as the final prediction result. (All models are of equal importance)

2. Summarize bagging methods

  • Bagging improves the generalization error by reducing the variance of the base classifier
  • Its performance depends on the stability of base classifier. If the base classifier is unstable, bagging helps to reduce the error caused by the random fluctuation of training data. If stable, the error of ensemble classifier is mainly caused by the bias of base classifier
  • Since each sample is selected with equal probability, Bagging does not focus on any particular instance of the training data set

0x05 Random Forest

Random forest is a combination of Breimans’ “Bootstrap Aggregating “idea and Ho’s “random Subspace Method” to build a decision tree. Bagging + decision tree = random forest. In this way, k features are randomly selected to form N decision trees in a decision tree with m features in total, and then the prediction result mode is selected (if it is a regression problem, the average value is selected).

1. How can Bagging features be applied

Assuming N samples and M characteristics, let’s combine the characteristics of Bagging.

Bootstrap sampling (Bagging feature application) : Each tree has randomly selected training samples, which are randomly selected as the training set, and then m features are randomly selected as the basis for the branch of the tree.

Random: A random sample selection (Bagging feature application) and a random feature selection. So you build a tree. Then the optimal feature is selected from the random features. In this way, the decision trees in the random forest can be different from each other to improve the diversity of the system and thus improve the classification performance.

Picking out good features: The real power of a random forest is not that it synthesizes multiple trees to produce a final result, but that the trees in the forest are constantly getting better through iteration (trees in the forest branch with better features).

Iteration results in a number of classifiers (Bagging feature application) : Repeat the above steps for several times, gradually removing the relatively poor features and generating new forests each time until the number of remaining features is M. Let’s say I iterate x times, I get x forests.

Voting (Bagging feature application) : Use the other 1/3 sample (also known as out of bag sample) as the test set to predict x iterated forests. The results of all samples were predicted and compared with the real value, and the forest with the lowest out-of-set error rate was selected as the final random forest model.

2. Pick out good traits

I need to explain a little bit more about picking out good traits.

The idea of a random forest is to build good trees, and good trees need good features. Then we need to know how important each feature is. For every tree has a characteristics, to find out whether a feature in the tree has played a role, can change the characteristics of random values, make “this tree is the characteristics of it doesn’t matter”, after comparing the test set error rate before and after the change, the error rate of the gap as an important degree of the features in the tree, The test set refers to the remaining samples (out of bag samples) after the tree is sampled (the error caused by out of bag samples doing the test set is called out of bag error).

By calculating each feature once in a tree, the importance of each feature in the tree can be calculated. We can figure out how important features in all trees are in each tree. But that’s just how important these features are in the tree not how important they are in the forest as a whole. So how do we figure out how important these features are in the forest? Each feature appears in multiple trees, and the mean value of the importance of this feature value in multiple trees is the importance of this feature in the forest. That gives you the importance of all the features in the forest. All the features were sorted according to their importance, and some features with low importance in the forest were removed to obtain a new feature set. At this point, we’re back to where we started, and we’ve actually completed an iteration.

0 x06 Boosting method

The basic idea of Boosting is to make each round of base learners pay more attention to the previous round of learning errors during training.

Boosting and Bagging work in the same way: we build a series of models and aggregate them to get a stronger learner with better performance. Unlike Bagging, however, which focuses on reducing variance, Boosting looks at fitting multiple weak learners sequentially in a highly adaptable way: each model in a sequence, in its fitting process, gives greater weight to observations that previous models in the sequence handled poorly.

Boosting is an iterative algorithm, in view of the same training set different classifiers (weak classifier), then classify, the correct classification of the sample weight is low, high classification errors of sample weights (usually near the border samples), the final classifier is a lot of weak classifier linear superposition (weighted array), a classifier is quite simple. It’s actually a simple process of weak classification algorithm boost.

Each time the samples are added, they are usually given different weights according to their classification accuracy in the previous round. Data is often reweighted to reinforce the classification of previously misclassified data points.

As Boost runs each model, it keeps track of which data samples are the most successful and which are not. The data sets that output the most misclassified data are given more weight. This data is considered more complex and requires more iterations to train the model properly.

Boosting model is also different in the practical classification stage. In Boosting, model error rates are tracked because better models are given better weights. That is, the smaller the error of the weak classifier, the larger the weight. This way, when “voting” occurs, just like bagging, the model with the better results has a stronger pull on the final output.

0x07 AdaBoost

Boosting algorithm has different types due to different loss functions, and AdaBoost is Boosting algorithm with exponential loss function. The reason for using exponential losses is that minimizing exponential losses each round is actually training a Logistic regression model, thus approximating log odds.

1. Two main issues

Boosting algorithm is a process of Boosting “weak learning algorithm” to “strong learning algorithm”. Boosting algorithm needs to solve two major problems.

  1. How to select a group of weak learners with different strengths and weaknesses so that they can complement each other.
  2. How to combine the outputs of weak learners to obtain better overall decision performance.

Corresponding to these two problems are addition model and forward fractional algorithm.

  • The addition model means that the strong classifier is a linear sum of a series of weak classifiers.
  • Forward step means that in the training process, the classifier generated in the next iteration is trained on the basis of the previous one.

The forward distribution algorithm says, “I can provide a framework, no matter what form the basis function and loss function are, as long as your model is an additive model, you can follow the guidance of my framework to solve.” In other words, the forward stepwise algorithm provides a universal method to learn the addition model. Different forms of basis function and different forms of loss function can use this universal method to find the optimization parameters of the addition model. It is a kind of meta-algorithm.

The idea of the forward step algorithm is that there are M basis functions and corresponding M coefficients in the addition model, and one basis function and its coefficients can be learned from front to back.

2. AdaBoost solutions

Select the weak learner

To solve the first problem, AdaBoost’s solution was to change the weight of the sample at the end of each round. In this way, AdaBoost makes each new weak learner reflect some pattern in the new data.

To achieve this,AdaBoost maintains a weight distribution for each training sample. So for any sample XI, I have a distribution D(I) that corresponds to the importance of that sample.

When measuring the performance of weak learners, AdaBoost takes into account the weight of each sample. The misclassification samples with larger weight will contribute more to the training error rate than the misclassification samples with smaller weight. In order to obtain smaller weighted error rate, the weak classifier must focus more on the high-weight samples to ensure accurate prediction of them.

Could understand by mistake the weight of the same meaning: when “the next classifier” points to the wrong again after these points, can raise “the next classifier” the overall error rate, thus result in “the next classifier” own weight is small, eventually lead to the classifier in the whole weight lower hybrid classifier.

By modifying the weight D(I) of the sample, the probability distribution of the sample is changed, and the focus is on the samples incorrectly classified, so as to reduce the weight of the samples correctly classified in the last round and improve the weight of the samples incorrectly classified. This allows the weak learner to learn different parts of the training sample.

During the training, we update the weight of the training set by using the error rate of the weak learner of the previous iteration, so that the iterations go on.

Combinatorial weak learner

Now, we have obtained a group of trained weak learners with different advantages and disadvantages. How to combine them effectively so that they complement each other’s advantages to produce more accurate overall prediction effect? That’s the second question.

For the second question, AdaBoost adopted the weighted majority voting method to increase the weight of the weak classifier with small classification error rate and reduce the weight of the weak classifier with large classification error rate. This is easy to understand, the high accuracy of the weak classifier in the strong classifier should certainly have a greater say.

Each weak learner is trained with a different weight distribution. We can see that different weak learners are assigned different tasks, and each weak learner tries to complete the given task. Intuitively, when it comes to combining the judgments of each weak learner into a final prediction, we will trust it more if it has performed well on previous tasks, and less if it has performed poorly on previous tasks.

In other words, we will combine weak learners on a weighted basis, giving each weak learner a value of wi, which indicates its trustworthiness, depending on its performance on the assigned task. The better the performance, the greater the WI, and vice versa.

0x08 Derive an example from outlaws of the Marsh how to use integrated learning

Let’s find an example to look in detail at using the concept of integrated learning. Now Liangshan needs to vote on whether to accept the offer. Jiang Song said, we want to make decisions scientifically and democratically, so we use the latest technology: integrated learning.

1. Bagging

The first to consider is the Bagging method

If bagging is used. Samples are put back, and voting can be done in parallel. Each extract5Individual vote on whether to accept. Xu Ning, Suo Chao, Zhu Tong, Hua Rong, Yang Zhi were selected for the first time. the5Tickets are accepted for the second drawing of Zhao 'an lu Zhishen, Wu Song, Zhu GUI, Liu Tang, Li Kui. the5Votes are against the third time out of xu Ning, Suo Chao, Lu Zhishen, Wu Song, Ruan Xiaoer. the2Ticket to accept,3Vote against, this sample is against recruitment. The final result is: against recruitment. The odds are already stacked against Him, and the outcome of a parallel vote would be even harder to gauge. In this case, whether the recruitment is really liangshan group democratic evaluation, Gongming brother and Wu Xueju can not control the background black box.Copy the code

If Song Jiang wants to do black-box operation, he cannot use Bagging method, but needs to choose Boosting, because the selection of training set of this algorithm is not independent, and each training set selected depends on the last learning result, so it is conducive to Song Jiang’s black-box manipulation.

2. Boosting

Boosting is a Boosting method that Song Jiang decided to adopt. We can see how Song Jiang uses Boosting to adjust bias step by step to achieve the final expectation of best fitting “receiving zhaoan”.

"Boosting"The basic idea is to somehow make each round of the base learner pay more attention to the sample of the previous round of learning errors during training. So we have two schemes. Solution a: every time every time to eliminate abnormal numerical scheme 2: every time adjust abnormal numerical weights -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- one way: Each time the anomalous values are eliminated, this suitable for song Jiang's sample all out of control of the case of iteration1Sample: Lu Zhishen, Wu Song, Zhu GUI, LIU Tang, Li Kui. the5Song Jiang said: "If the monks are not fighting with the world, they should not participate in the vote. Instead, they should be Xu Ning and Suo Chao brothers." Song Jiang tracking error rate, because the error of this weak classifier is too large, so the weight of this time is reduced. The iteration2Samples: Xu Ning, SUo Chao, ZHU GUI, LIU Tang, LI Kui. the2Ticket to accept,3Vote against, then this sample is against recruitment song Jiang said: the iron cattle brothers are not sensible, disorderly voting, people disorderly beat out, replaced by Yang Zhi brothers. Song Jiang tracking error rate, because the error of this weak classifier is too large, so the weight of this time is reduced. The iteration3Sample: Xu Ning, SUo Chao, ZHU GUI, LIU Tang, Yang Zhi. the3Ticket to accept,2Song Jiang said: Zhu GUI brothers usually run hotels, lack of knowledge about current politics, replace guan Sheng brothers. Song Jiang tracks the error rate. Because the error of this weak classifier is small, the weight is increased this time. The iteration4Samples: Xu Ning, SUo Chao, Guan Sheng, LIU Tang, Yang Zhi. the4Ticket to accept,1Vote against, then this sample is accepted zhaoan Song Jiang said: Liu Tang brothers hair dyed, do not use liangshan image, replaced by Huarong brothers. Song Jiang tracks the error rate. Because the error of this weak classifier is small, the weight is increased this time. The iteration5Samples: Xu Ning, SUo Chao, Guan Sheng, Hua Rong, Yang Zhi. the5Tickets accepted, then this sample is accepted. Song Jiang tracks the error rate. Because there is no error in this weak classifier, the weight is increased this time. That is, the classifier with small error rate is of great importance in the final classifier. Finally the comprehensive5As a result of the election, Liang Shan finally decided to accept the recruitment. -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 2: every time reduce abnormal numerical weights, the suitable samples have sung river can control chief iteration1Samples: Wu Song, Hua Rong, Zhu GUI, Yang Zhi, Li Kui. the2Ticket to accept,3Vote against, then the conclusion is against the recruitment of song Jiang said: monks stand aloof, reduce the weight of Wu Song1/2. Song Jiang tracking error rate, because the error of this weak classifier is too large, so the weight of this time is reduced. The iteration2Sample: Wu Song (weight1/2), Hua Rong, Zhu GUI, Yang Zhi, Li Kui. the2Ticket to accept,2again1/2Vote against, then the conclusion is against the recruitment of an song Jiang said: the iron cattle brothers are not sensible, disorderly voting, reducing the weight of Li Kui1/2. Song Jiang tracking error rate, because the error of this weak classifier is too large, so the weight of this time is reduced. The iteration3Sample: Wu Song (weight1/2), Rong Hua, GUI Zhu, Zhi Yang, Kui Li (Weight1/2). the2Ticket to accept,2Song Jiang said: Zhu GUI brothers usually run hotels, lack of awareness of current politics, reduce zhu GUI's weight1/2. Song Jiang tracks the error rate. Since there is no conclusion for this weak classifier, the weight of this time is zero. The iteration4Sample: Wu Song (weight1/2), Hua Rong, Zhu GUI (Weight1/2), Zhi Yang, Kui Li (Weight1/2). the2Ticket to accept,1again1/2Vote against, then the conclusion is to accept zhaoan Song Jiang said: this time good, flower rong did know village, insight, increase flower rong weight. Keep voting. Song Jiang tracks the error rate. Because the error of this weak classifier is small, the weight is increased this time. The iteration5Sample: Wu Song (weight1/2), Hua Rong (weight2), Zhu GUI (Weight1/2), Zhi Yang, Kui Li (Weight1/2). the3Ticket to accept,1again1/2Vote against, then this conclusion is to accept zhaoan Song Jiang said: this is good, Yang Zhi has done the system to make, have insight, increase Yang Zhi weight. Keep voting. Song Jiang tracks the error rate. Because the error of this weak classifier is small, the weight is increased this time. The iteration6Sample: Wu Song (weight1/2), Hua Rong (weight2), Zhu GUI (Weight1/2), Yang Zhi (Weight2), Li Kui (Weight1/2). the4Ticket to accept,1again1/2If the vote is against, the conclusion is to accept the tracking error rate of Song River in Zhaoan. Because the error of weak classifier is small this time, the weight is increased this time. Finally the comprehensive6As a result of the election, Liang Shan finally decided to accept the recruitment. -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- here, Boosting algorithm for sensitivity to outliers of samples, Because the input of each classifier in Boosting algorithm depends on the classification result of the previous one, errors will accumulate exponentially. Each training set song selected depended on the results of the previous study. Eliminate abnormal values or adjust the weight of abnormal values every time. (In the actual Boosting algorithm, the weight of the outlier values is increased). Song Jiang also got the weight of the prediction function according to the training error of each training.Copy the code

3. Why bagging means reducing variance and boosting means reducing bias?

Bias describes the gap between the expected result of the output prediction of the model based on sample fitting and the real result of the sample. Simply speaking, it is whether the sample fitting is good or not. In order to perform well on bias, low bias needs to complicate the model and increase the parameters of the model, but this is easy to overfit.

Varience describes the performance of the model trained on the sample on the test set. To perform well on variance, low varience means simplifying the model and reducing the parameters of the model, but this is easy to be unfitting.

You can see that in our example.1.Bagging does not adjust the classifier specifically, but simply increases the number of samples and sampling times to approximate the average results. So Bagging's base model should be strong in itself (with low bias and high variance). So bagging should be about averaging a lot of strong (or even too strong) classifiers. Here, the bias of each individual classifier is low, and the bias remains low after the average; Each individual classifier is strong enough to cause overfitting, that is, high variance. The function of averaging is to reduce this variance.2.Boosting is all about getting the results of each classifier closer to the desired goal. That is, Song Jiang expects boosting step by step to finally accept zhaoan. In this way, the expected result of "recruitment" can be best fitted on the sample, and the expected result of "acceptance of recruitment" can be transferred from the initial high bias of "rejection of recruitment". Boosting is a combination of many weak classifiers into one strong one. Boosting plays a role in reducing BIAS because weak classifier Bias is higher and strong classifier Bias is lower. Variance is not a major consideration in Boosting. Boosting is an iterative algorithm, in which each iteration weights the samples according to the prediction results of the previous iteration. Therefore, with the continuous iteration, the error will be smaller and smaller, so the bias of the model will be constantly reduced. This algorithm cannot be parallel, for example Adaptive Boosting.Copy the code

4. Bagging vs Boosting

Bagging and Boosting

  • In terms of sample selection, Bagging adopts Bootstrap sampling with random placing, and each training set is independent. The selection of Boosting training set is not independent, and each training set selected depends on the last learning result. If the training set doesn’t change, all that changes is the weight of each sample.
  • Sample weight: Bagging uses uniform sampling, where each sample is equally weighted; Boosting adjusts the sample weight according to the error rate, and the larger the error rate, the larger the sample weight.
  • Prediction function: Bagging all prediction functions are equally weighted; Boosting gets the weight of the prediction function based on the training error of each training, and the smaller the error, the greater the weight of the prediction function.
  • Parallel computing: Bagging prediction functions can be generated in parallel. Boosting each prediction function must be generated iteratively in order.

0x09 Random forest code

Bagging can be argued in code for those interested. Here are two pieces of code.

One is from blog.csdn.net/colourful_s…

# -*- coding: utf-8 -*-
"""
Created on Thu Jul 26 16:38:18 2018
@author: aoanng
"""

import csv
from random import seed
from random import randrange
from math import sqrt


def loadCSV(filename) :Load the data line by line into the list
    dataSet = []
    with open(filename, 'r') as file:
        csvReader = csv.reader(file)
        for line in csvReader:
            dataSet.append(line)
    return dataSet

All columns except the label column are converted to float
def column_to_float(dataSet) :
    featLen = len(dataSet[0]) - 1
    for data in dataSet:
        for column in range(featLen):
            data[column] = float(data[column].strip())

The data set is randomly divided into N blocks to facilitate cross-validation, one of which is a test set and the other four blocks are training sets
def spiltDataSet(dataSet, n_folds) :
    fold_size = int(len(dataSet) / n_folds)
    dataSet_copy = list(dataSet)
    dataSet_spilt = []
    for i in range(n_folds):
        fold = []
        while len(fold) < fold_size:  # we can't use if, if is only used for the first time, while loop until the condition is not true
            index = randrange(len(dataSet_copy))
            fold.append(dataSet_copy.pop(index))  The # pop() function removes an element from the list (the last element by default) and returns the value of that element.
        dataSet_spilt.append(fold)
    return dataSet_spilt

Construct a subset of data
def get_subsample(dataSet, ratio) :
    subdataSet = []
    lenSubdata = round(len(dataSet) * ratio)Return a floating point number
    while len(subdataSet) < lenSubdata:
        index = randrange(len(dataSet) - 1)
        subdataSet.append(dataSet[index])
    # print len(subdataSet)
    return subdataSet

# Split data set
def data_spilt(dataSet, index, value) :
    left = []
    right = []
    for row in dataSet:
        if row[index] < value:
            left.append(row)
        else:
            right.append(row)
    return left, right

# Calculate the segmentation cost
def spilt_loss(left, right, class_values) :
    loss = 0.0
    for class_value in class_values:
        left_size = len(left)
        ifleft_size ! =0:  Prevent divisor from being zero
            prop = [row[-1] for row in left].count(class_value) / float(left_size)
            loss += (prop * (1.0 - prop))
        right_size = len(right)
        ifright_size ! =0:
            prop = [row[-1] for row in right].count(class_value) / float(right_size)
            loss += (prop * (1.0 - prop))
    return loss

# Select any n features, and among these n features, select the optimal feature during segmentation
def get_best_spilt(dataSet, n_features) :
    features = []
    class_values = list(set(row[-1] for row in dataSet))
    b_index, b_value, b_loss, b_left, b_right = 999.999.999.None.None
    while len(features) < n_features:
        index = randrange(len(dataSet[0]) - 1)
        if index not in features:
            features.append(index)
    # print 'features:',features
    for index in features:# find the column that is the best index for the node (with the least loss)
        for row in dataSet:
            left, right = data_spilt(dataSet, index, row[index])# the left and right branches of this node
            loss = spilt_loss(left, right, class_values)
            if loss < b_loss:# Find the minimum segmentation cost
                b_index, b_value, b_loss, b_left, b_right = index, row[index], loss, left, right
    # print b_loss
    # print type(b_index)
    return {'index': b_index, 'value': b_value, 'left': b_left, 'right': b_right}

# determine the output tag
def decide_label(data) :
    output = [row[-1] for row in data]
    return max(set(output), key=output.count)


# Sub-segmentation, the process of continuously building leaf nodes
def sub_spilt(root, n_features, max_depth, min_size, depth) :
    left = root['left']
    # print left
    right = root['right']
    del (root['left'])
    del (root['right'])
    # print depth
    if not left or not right:
        root['left'] = root['right'] = decide_label(left + right)
        # print 'testing'
        return
    if depth > max_depth:
        root['left'] = decide_label(left)
        root['right'] = decide_label(right)
        return
    if len(left) < min_size:
        root['left'] = decide_label(left)
    else:
        root['left'] = get_best_spilt(left, n_features)
        # print 'testing_left'
        sub_spilt(root['left'], n_features, max_depth, min_size, depth + 1)
    if len(right) < min_size:
        root['right'] = decide_label(right)
    else:
        root['right'] = get_best_spilt(right, n_features)
        # print 'testing_right'
        sub_spilt(root['right'], n_features, max_depth, min_size, depth + 1)

# Construct decision tree
def build_tree(dataSet, n_features, max_depth, min_size) :
    root = get_best_spilt(dataSet, n_features)
    sub_spilt(root, n_features, max_depth, min_size, 1)
    return root
  
# Predict test set results
def predict(tree, row) :
    predictions = []
    if row[tree['index']] < tree['value'] :if isinstance(tree['left'].dict) :return predict(tree['left'], row)
        else:
            return tree['left']
    else:
        if isinstance(tree['right'].dict) :return predict(tree['right'], row)
        else:
            return tree['right']
            # predictions=set(predictions)

def bagging_predict(trees, row) :
    predictions = [predict(tree, row) for tree in trees]
    return max(set(predictions), key=predictions.count)

Create a random forest
def random_forest(train, test, ratio, n_feature, max_depth, min_size, n_trees) :
    trees = []
    for i in range(n_trees):
        train = get_subsample(train, ratio)Select a subset from the cut dataset
        tree = build_tree(train, n_features, max_depth, min_size)
        # print 'tree %d: '%i,tree
        trees.append(tree)
    # predict_values = [predict(trees,row) for row in test]
    predict_values = [bagging_predict(trees, row) for row in test]
    return predict_values
  
# Calculation accuracy
def accuracy(predict_values, actual) :
    correct = 0
    for i in range(len(actual)):
        if actual[i] == predict_values[i]:
            correct += 1
    return correct / float(len(actual))


if __name__ == '__main__':
    seed(1) 
    dataSet = loadCSV('sonar-all-data.csv')
    column_to_float(dataSet)#dataSet
    n_folds = 5
    max_depth = 15
    min_size = 1
    ratio = 1.0
    # n_features=sqrt(len(dataSet)-1)
    n_features = 15
    n_trees = 10
    folds = spiltDataSet(dataSet, n_folds)Start by cutting the data set
    scores = []
    for fold in folds:
        train_set = folds[
                    :]  Train_set =folds, the value of train_set will change as well, so the folds should be copied. (L [:]) can copy sequences, D.C. opy() can copy dictionaries, list can generate copies list(L)
        train_set.remove(fold)# Select the training set
        # print len(folds)
        train_set = sum(train_set, [])  # Combine multiple fold lists into a train_set list
        # print len(train_set)
        test_set = []
        for row in fold:
            row_copy = list(row)
            row_copy[-1] = None
            test_set.append(row_copy)
            # for row in test_set:
            # print row[-1]
        actual = [row[-1] for row in fold]
        predict_values = random_forest(train_set, test_set, ratio, n_features, max_depth, min_size, n_trees)
        accur = accuracy(predict_values, actual)
        scores.append(accur)
    print ('Trees is %d' % n_trees)
    print ('scores:%s' % scores)
    print ('mean score:%s' % (sum(scores) / float(len(scores))))
Copy the code

Second comes from github.com/zhaoxingfen…

# -*- coding: utf-8 -*-
"" @env: Python2.7 @time: 2019/10/24 13:31 @author: zhaoxingfeng @function: Random Forest (RF) @version: V1.2 references: [1] UCI. Wine [DB/OL]. https://archive.ics.uci.edu/ml/machine-learning-databases/wine. "" "
import pandas as pd
import numpy as np
import random
import math
import collections
from sklearn.externals.joblib import Parallel, delayed

class Tree(object) :
    "" define a decision tree """
    def __init__(self) :
        self.split_feature = None
        self.split_value = None
        self.leaf_value = None
        self.tree_left = None
        self.tree_right = None

    def calc_predict_value(self, dataset) :
        "" Find the leaf node of the sample by recursive decision tree
        if self.leaf_value is not None:
            return self.leaf_value
        elif dataset[self.split_feature] <= self.split_value:
            return self.tree_left.calc_predict_value(dataset)
        else:
            return self.tree_right.calc_predict_value(dataset)

    def describe_tree(self) :
        "" Print decision tree in JSON form for easy viewing of tree structure ""
        if not self.tree_left and not self.tree_right:
            leaf_info = "{leaf_value:" + str(self.leaf_value) + "}"
            return leaf_info
        left_info = self.tree_left.describe_tree()
        right_info = self.tree_right.describe_tree()
        tree_structure = "{split_feature:" + str(self.split_feature) + \
                         ",split_value:" + str(self.split_value) + \
                         ",left_tree:" + left_info + \
                         ",right_tree:" + right_info + "}"
        return tree_structure

class RandomForestClassifier(object) :
    def __init__(self, n_estimators=10, max_depth=-1, min_samples_split=2, min_samples_leaf=1,
                 min_split_gain=0.0, colsample_bytree=None, subsample=0.8, random_state=None) :
        """ Random forest parameter ---------- n_estimators: number of trees max_depth: tree depth, -1 indicates unrestricted depth min_samples_split: Min_samples_leaf: indicates the minimum number of samples required for node splitting. Min_split_gain: indicates the minimum gain required for node splitting. Colsample_bytree: Column sampling Settings, take [SQRT, log2]. SQRT means random selection of SQRT (N_features) features, log2 means random selection of log(n_features) features. If set to other features, no column sampling will be performed. Random seed, n_ESTIMators generated each time after setting the sample set will not change, to ensure that the experiment can be repeated ""
        self.n_estimators = n_estimators
        self.max_depth = max_depth ifmax_depth ! = -1 else float('inf')
        self.min_samples_split = min_samples_split
        self.min_samples_leaf = min_samples_leaf
        self.min_split_gain = min_split_gain
        self.colsample_bytree = colsample_bytree
        self.subsample = subsample
        self.random_state = random_state
        self.trees = None
        self.feature_importances_ = dict(a)def fit(self, dataset, targets) :
        """ Model training entrance """
        assert targets.unique().__len__() == 2."There must be two class for targets!"
        targets = targets.to_frame(name='label')

        if self.random_state:
            random.seed(self.random_state)
        random_state_stages = random.sample(range(self.n_estimators), self.n_estimators)

        # Two column sampling methods
        if self.colsample_bytree == "sqrt":
            self.colsample_bytree = int(len(dataset.columns) ** 0.5)
        elif self.colsample_bytree == "log2":
            self.colsample_bytree = int(math.log(len(dataset.columns)))
        else:
            self.colsample_bytree = len(dataset.columns)

        # Build multiple decision trees in parallel
        self.trees = Parallel(n_jobs=-1, verbose=0, backend="threading")(
            delayed(self._parallel_build_trees)(dataset, targets, random_state)
                for random_state in random_state_stages)
        
    def _parallel_build_trees(self, dataset, targets, random_state) :
        """ Bootstrap has been put back into the sample to generate the training sample set and establish the decision tree.
        subcol_index = random.sample(dataset.columns.tolist(), self.colsample_bytree)
        dataset_stage = dataset.sample(n=int(self.subsample * len(dataset)), replace=True, 
                                        random_state=random_state).reset_index(drop=True)
        dataset_stage = dataset_stage.loc[:, subcol_index]
        targets_stage = targets.sample(n=int(self.subsample * len(dataset)), replace=True, 
                                        random_state=random_state).reset_index(drop=True)

        tree = self._build_single_tree(dataset_stage, targets_stage, depth=0)
        print(tree.describe_tree())
        return tree

    def _build_single_tree(self, dataset, targets, depth) :
        """ Recursive establishment of decision tree """
        # If the categories of this node are all the same/the samples are less than the minimum number of samples required for splitting, then the category with the most occurrences is selected. Termination of division
        if len(targets['label'].unique()) <= 1 or dataset.__len__() <= self.min_samples_split:
            tree = Tree()
            tree.leaf_value = self.calc_leaf_value(targets['label'])
            return tree

        if depth < self.max_depth:
            best_split_feature, best_split_value, best_split_gain = self.choose_best_feature(dataset, targets)
            left_dataset, right_dataset, left_targets, right_targets = \
                self.split_dataset(dataset, targets, best_split_feature, best_split_value)

            tree = Tree()
            If the sample number of left/right leaf nodes is smaller than the set minimum sample number of leaf nodes after the parent node splits, the parent node terminates splitting
            if left_dataset.__len__() <= self.min_samples_leaf or \
                    right_dataset.__len__() <= self.min_samples_leaf or \
                    best_split_gain <= self.min_split_gain:
                tree.leaf_value = self.calc_leaf_value(targets['label'])
                return tree
            else:
                # If this feature is used when splitting, then the importance of this feature is increased by 1
                self.feature_importances_[best_split_feature] = \
                    self.feature_importances_.get(best_split_feature, 0) + 1

                tree.split_feature = best_split_feature
                tree.split_value = best_split_value
                tree.tree_left = self._build_single_tree(left_dataset, left_targets, depth+1)
                tree.tree_right = self._build_single_tree(right_dataset, right_targets, depth+1)
                return tree
        If the tree depth exceeds the preset value, the split is terminated
        else:
            tree = Tree()
            tree.leaf_value = self.calc_leaf_value(targets['label'])
            return tree

    def choose_best_feature(self, dataset, targets) :
        "" To find the best way to divide the data set, find the optimal splitting feature, splitting threshold and splitting gain ""
        best_split_gain = 1
        best_split_feature = None
        best_split_value = None

        for feature in dataset.columns:
            if dataset[feature].unique().__len__() <= 100:
                unique_values = sorted(dataset[feature].unique().tolist())
            # If the dimension features too many values, select 100 percentile values as the split threshold to be selected
            else:
                unique_values = np.unique([np.percentile(dataset[feature], x)
                                           for x in np.linspace(0.100.100)])

            # Calculate the splitting gain for the possible splitting threshold and select the threshold with the maximum gain
            for split_value in unique_values:
                left_targets = targets[dataset[feature] <= split_value]
                right_targets = targets[dataset[feature] > split_value]
                split_gain = self.calc_gini(left_targets['label'], right_targets['label'])

                if split_gain < best_split_gain:
                    best_split_feature = feature
                    best_split_value = split_value
                    best_split_gain = split_gain
        return best_split_feature, best_split_value, best_split_gain

    @staticmethod
    def calc_leaf_value(targets) :
        """ Select the category that occurs most frequently in the sample as the value of the leaf node. ""
        label_counts = collections.Counter(targets)
        major_label = max(zip(label_counts.values(), label_counts.keys()))
        return major_label[1]

    @staticmethod
    def calc_gini(left_targets, right_targets) :
        "" Gini index is used as an index to select the optimal splitting point in classification tree.
        split_gain = 0
        for targets in [left_targets, right_targets]:
            gini = 1
            # Count how many samples there are for each category, then calculate gini
            label_counts = collections.Counter(targets)
            for key in label_counts:
                prob = label_counts[key] * 1.0 / len(targets)
                gini -= prob ** 2
            split_gain += len(targets) * 1.0 / (len(left_targets) + len(right_targets)) * gini
        return split_gain

    @staticmethod
    def split_dataset(dataset, targets, split_feature, split_value) :
        "" According to the characteristics and thresholds, the samples are divided into left and right parts, with the left side less than or equal to the threshold and the right side greater than the threshold.
        left_dataset = dataset[dataset[split_feature] <= split_value]
        left_targets = targets[dataset[split_feature] <= split_value]
        right_dataset = dataset[dataset[split_feature] > split_value]
        right_targets = targets[dataset[split_feature] > split_value]
        return left_dataset, right_dataset, left_targets, right_targets

    def predict(self, dataset) :
        "" input sample and predict category. ""
        res = []
        for _, row in dataset.iterrows():
            pred_list = []
            # Count the predicted results of each tree and select the results with the most occurrences as the final category
            for tree in self.trees:
                pred_list.append(tree.calc_predict_value(row))

            pred_label_counts = collections.Counter(pred_list)
            pred_label = max(zip(pred_label_counts.values(), pred_label_counts.keys()))
            res.append(pred_label[1])
        return np.array(res)


if __name__ == '__main__':
    df = pd.read_csv("source/wine.txt")
    df = df[df['label'].isin([1.2])].sample(frac=1, random_state=66).reset_index(drop=True)
    clf = RandomForestClassifier(n_estimators=5,
                                 max_depth=5,
                                 min_samples_split=6,
                                 min_samples_leaf=2,
                                 min_split_gain=0.0,
                                 colsample_bytree="sqrt",
                                 subsample=0.8,
                                 random_state=66)
    train_count = int(0.7 * len(df))
    feature_list = ["Alcohol"."Malic acid"."Ash"."Alcalinity of ash"."Magnesium"."Total phenols"."Flavanoids"."Nonflavanoid phenols"."Proanthocyanins"."Color intensity"."Hue"."OD280/OD315 of diluted wines"."Proline"]
    clf.fit(df.loc[:train_count, feature_list], df.loc[:train_count, 'label'])

    from sklearn import metrics
    print(metrics.accuracy_score(df.loc[:train_count, 'label'], clf.predict(df.loc[:train_count, feature_list])))
    print(metrics.accuracy_score(df.loc[train_count:, 'label'], clf.predict(df.loc[train_count:, feature_list])))

Copy the code

0xEE Personal information

★★★★ Thoughts on life and technology ★★★★★

Wechat official account: Rosie’s Thoughts

If you want to get a timely news feed of personal articles, or want to see the technical information of personal recommendations, please pay attention.

0 XFF reference

The “simplest” Bootstrap method in statistics and its application blog.csdn.net/SunJW_2017/…

The Bootstrap classifier combination methods, Boosting, Bagging, random forests (a) blog.csdn.net/zjsghww/art…

Ensemble Learning Algorithm and Boosting algorithm principle baijiahao.baidu.com/s?id=161998…

Machine learning algorithm GBDT

Finally, someone said it — The XGBoost algorithm

The principle of xgboost didn’t you think so difficult to www.jianshu.com/p/7467e616f…

Why hasn’t anyone applied Boosting’s idea to deep learning? www.zhihu.com/question/53…

Ensemble Learning Algorithm and Boosting algorithm principle baijiahao.baidu.com/s?id=161998…

Bagging method and Boosting Method for integrated learning method blog.csdn.net/qq_30189255…

Conclusion: the Bootstrap method (self-help), Bagging, Boosting (increased) www.jianshu.com/p/708dff71d…

Blog.csdn.net/zjsghww/art…

Integrated Learning (Ensemble Learning) blog.csdn.net/wydbyxr/art…

Adaboost introductory tutorial, the principle of the most accessible introduction to www.uml.org.cn/sjjmwj/2019…

Boosting and Bagging: How to Develop a Robust machine learning algorithm ai.51cto.com/art/201906/…

Why bagging reduces variance and Boosting reduces bias? Blog.csdn.net/sinat_25394…

Bagging and boosting the deviation of two kinds of integration model the understanding of the bias and variance variance blog.csdn.net/shenxiaomin…

In a comprehensible way analyze the random forest blog.csdn.net/cg896406166…

Python implementation of random forest blog.csdn.net/colourful_s…

Github.com/zhaoxingfen…

Data analysis (tool) – Boosting zhuanlan.zhihu.com/p/26215100

AdaBoost principle explained in detail

Knowledge – based on AdaBoost classification problem www.jianshu.com/p/a6426f4c4…

Interview data mining problem of gradient increase tree www.jianshu.com/p/0e5ccc88d…