This is the 14th day of my participation in the August Text Challenge.More challenges in August

An important parameter

1.1 criterion

Mean squared Error (MSE) was used to measure branch quality of regression tree. The difference of mean squared error between parent node and leaf node was used as the criterion for feature selection. L2 loss was minimized by using the mean of leaf nodes. This improves the mean square error MAE (MEAN Absolute error MAE), median value of leaf nodes to minimize L1 loss. The most important attribute is still feature_importances_, Interface is still the core of Apply, FIT, Predict and Score.

N is the number of samples, I is each data sample, Fi is the regression value of the model, and Yi is the actual value label of sample point I. The essence of MSE is the difference between real sample data and regression results. In regression trees, MSE is not only a branch quality indicator, but also the most commonly used indicator to measure the regression quality of regression trees. Cross validation is often used to select mean square error as the evaluation (score represents the prediction accuracy in classification trees). In regression, the smaller the MSE, the better. However, score, the interface to the regression tree, returns R squared, not MSE.

U is the sum of squares of residuals (MSE * N), v is the total sum of squares, N is the number of samples, I is each data sample, Fi is the value of model regression, yi is the actual value label of sample point I. The y-cap is the average of real value tags. R squared can be positive or negative (if the model’s residual sum of squares is much greater than the total sum of squares of the model, and the model is really bad, R squared will be negative), while the mean square error will always be positive. Sklearn will consider the nature of the index itself when calculating the evaluation index of the model. The mean square error itself is a kind of error, so it is divided into a loss of the model by SKLEARN. Therefore, all negative numbers are represented in SKLearn. The true value of MSE is neg_mean_squared_error minus the negative sign.

Cross-validation of the working principle of two regression trees

From sklearn.datasets import load_boston from sklearn.model_selection import cross_val_score sklearn.tree import DecisionTreeRegressor boston=load_boston() regressor=DecisionTreeRegressor(random_state=42) cross_val_score(regressor,boston.data,boston.target,cv=10,scoring='neg_mean_squared_error')Copy the code

Because we were just on the training sample data set and testing set out very good parameter values, but by the unknown data sets whether also can get very good result, so the application by cross validation, cross-validation was used to observe the stability of the model of a kind of method, we could be divided into n copies of the data, in turn, use one as a test set, The other N-1 parts were used as training sets, and the accuracy of the model was calculated several times to evaluate the average accuracy of the model. The division of the training and test sets can interfere with the results of the model, so the average obtained from the results of n cross-validation is a better measure of the model’s effectiveness.

Other parameters can be referred to the decision tree classification.

The advantages and disadvantages of decision tree

3.1 Advantages of decision tree

  • Easy to understand and interpret, trees can be drawn and seen
  • Very little data preparation is required. Many other algorithms typically require data normalization, creating dummy variables and removing null values, etc. The decision tree module in SkLearn does not support handling of missing values.
  • The low cost (for example, when predicting data) is the logarithm of the number of data points used to train the tree, which is very low compared to other algorithms.
  • Able to deal with both numerical and classified data, can do both regression and classification. Other techniques are often dedicated to analyzing data sets that have only one type of variable class
  • Can handle multiple output problems, that is, problems with multiple tags,
  • Is a white box model, the results can be easily interpreted. If a given situation can be observed in the model, the condition can be easily interpreted by Boolean logic. Conversely, in black-box models (for example, in artificial neural networks), the results may be more difficult to interpret.
  • Statistical tests can be used to validate the model, which allows us to consider the reliability of the model.
  • They can perform well even if their assumptions somehow violate the real model that generated the data

3.2 Disadvantages of decision trees

  • Decision tree learners may create trees that are too complex to generalize the data well. Overfitting. Pruning, setting the minimum number of samples required for leaf nodes, or setting the maximum depth of the tree are necessary to avoid this problem, and the integration and adjustment of these parameters can be obscure to beginners
  • Decision trees can be unstable, and small changes in the data can lead to a completely different tree, which needs to be solved by integration algorithms.
  • The learning of decision tree is based on greedy algorithm, which tries to reach the global optimal by optimizing local optimal (the optimal of each node), but this approach cannot guarantee the return of global optimal decision tree. This problem can also be solved by integration algorithms, in which features and samples are randomly sampled during branching.
  • Some concepts are difficult to learn because they are not easily expressed in decision trees
  • If some class in the tag is dominant, the decision tree learner will create a tree that favors the dominant class. Therefore, it is recommended to balance the data set before fitting the decision tree.