This is the third installment of the Machine Learning Bible, and you will be able to master the evaluation metrics for classification and regression algorithms.

PS: Attached at the end of the articleexercises

After reading about machine learning algorithms, you already know what underfitting and overfitting, bias and variance, and Bayesian error are. In this article, I will introduce some indicators of off-line evaluation model performance in machine learning.

After we have trained multiple models, how do we measure the performance of these models? That is to say, we need a measure of how “good” the model is, which we call an evaluation metric. When comparing the effects of different models, different evaluation indicators often lead to different conclusions, which means that the effects of models are relative.

We have different evaluation indexes for different types of learning tasks. Here we introduce some evaluation indexes of the most common classification and regression algorithms.

Classification indexes

Most classification problems in life belong to dichotomous problems, so here to dichotomous as an example, to explain some indicators related to classification.

Before the formal introduction of indicators, a few basic concepts: sometimes “positive,” “true,” “positive,” “1” refers to one thing, and “negative,” “false,” “negative,” “0” refers to the same thing. For example, if the prediction result of the model for this sample is 1, it can be considered that the prediction result of the model for this sample is true, or positive, or positive. In fact, it means the same thing.

Confusion matrix

Confusion matrix is a commonly used tool to evaluate classification problems. For k-meta classification, it is actually a K x K table, which is used to record the predicted results of classifier. For common dichotomies, the confusion matrix is 2×2.

In dichotomies, samples can be divided into true positive (TP), true negative (TN), false positive (FP), false negative (FALSE negative, TN) based on the combination of their real results and the predicted results of the model. FN). According to TP, TN, FP and FN, the binary confusion matrix can be obtained.

The accuracy of

Accuracy refers to the proportion of the number of samples correctly predicted by the model (including true and false) to the total number of samples, that is

Among them,Represents the number of samples correctly classified by the model,That’s the total number of samples.

In dichotomies, accuracy can be obtained by the following calculation formula.

Accuracy is one of the simplest and most intuitive evaluation indexes in classification problems, but it has some limitations. For example, in dichotomies, when negative samples account for 99 percent, the model can achieve 99 percent accuracy if all samples are predicted to be negative. Although the accuracy seems high, this model is actually useless because it can’t find a positive sample.

Accurate rate

Precision refers to the proportion of the number of samples predicted by the model to be true and actually true to the number of samples predicted by the model to be true, i.e

For example, if the police want to catch a thief, and 10 people are arrested, 6 of them are thieves, then the accuracy ratio is 6/10 = 0.6.

The recall rate

Recall rate, sometimes called recall rate, refers to the proportion of the number of samples predicted by the model to be true and actually true to the number of all samples actually true, i.e

For example, it is the same as the above example of the police catching a thief. When 10 people are caught, 6 of them are thieves, and the other 3 thieves escape, the recall rate is 6 / (6 + 3) ≈ 0.67.

F1 / F alpha value

Generally speaking, accuracy and recall are mutually exclusive, that is to say, if the accuracy is high, the recall rate will be lower; With a high recall rate, accuracy will be lower. Therefore, an index F1 value considering both accuracy and recall rate is designed. The value of F1 is the harmonic average of accuracy and recall, i.e

In some cases, we pay different attention to accuracy and recall, and the more general form Fα of the F1 value can be satisfied. The Fα values are defined as follows

Where, the size of α represents the relative importance of recall rate to accuracy rate.

Multiple categories

Most of the time we encounter multi-classification problems, which means that every combination of two categories corresponds to a binary confusion matrix. Suppose we have n dichotomous confusion matrices, how can we average the n results?

Macro average

The first method is to calculate the results separately in each confusion matrix and then calculate the average, which is called “macro average”.

Micro, on average,

In addition to the above macro average, we can also average the corresponding elements of binary confusion matrix to obtain the average values of TP, TN, FP and FN, and then calculate according to these average values, which is called “micro average”.

ROC

In these indicators (such as accuracy rate, accuracy rate, recall rate, etc.), the predicted results of the model (positive or negative class) need to be obtained. For many models, the predicted value is a probability value that belongs to the positive class, so a threshold value needs to be specified. The value above the threshold is positive class, otherwise, it is negative class. This and its size directly determines the generalization ability of the model.

There is an assessment called the Receiver Operating Characteristic (ROC) curve, which may not specify thresholds. The vertical axis of the ROC curve is the true positive rate (TPR) and the horizontal axis is the false positive rate (FPR).

The formula of true positive rate and false positive rate is as follows:

It can be found that the calculation formula of TPR and Recall is the same. So how do you plot the ROC curve? It can be seen that the ROC curve is composed of a series of (FPR, TPR) points, but a specific model can only get one classification result, that is, there is only one group (FPR, TPR) corresponding to a point on the ROC curve, how to get multiple?

We arranged the predicted values (probability values belonging to the positive category) of all samples by the model in descending order, and then took the predicted probability values as the threshold value in turn to obtain the number of samples with positive and negative category predicted by the model under this threshold value each time, and then generated a group of (FPR, TPR) values, so as to obtain a point on the ROC curve. Finally, when all the points are connected, the ROC curve appears. Obviously, the more times the threshold is set, the more (FPR, TPR) values will be generated and the smoother the ROC curve will be drawn. In other words, the smoothness of ROC curve has an absolute relationship with the number of threshold Settings, and has no necessary relationship with the number of samples. In reality, most ROC curves we draw are not smooth.

The closer the ROC curve is to the upper left corner, the better the effect. The upper-left coordinate is (0,1), that is, FPR =0, TPR = 1, which means FP (false positive) =0, FN (false negative) =0, which is a perfect model, because it can classify all samples correctly. All points on the diagonal (y=x) of the ROC curve indicate that the model’s discriminating ability is no different from random guesses.

AUC

AUC (Area Under Curve) is defined as the Area Under the ROC Curve. Obviously, the result of AUC will not exceed 1. Usually, the ROC Curve is on the straight line y = X, so the VALUE of AUC is generally between 0.5 and 1.

How to understand the role of AUC? A positive sample (P) and a negative sample (N) are randomly selected, and the model predicts these two samples to obtain the probability value that each sample belongs to the positive class. After sorting the samples according to the probability value, the probability that the positive sample ranks ahead of the negative sample is the AUC value.

AUC can be calculated by the following formula.

Among them, the rank for the model to predict the sample after the probability value of the smallest positive samples after serial number (sequence starting from 1), | | as is sample, P | | N negative sample.

It should be noted that if multiple samples have the same probability value predicted by the model, then all the original ranks need to be added up and averaged. So it doesn’t matter who comes first or who comes last for samples with equal probability scores.

Logarithmic loss

Logistic Loss (Logloss) is the likelihood estimation of the predicted probability, and its standard form is:

Logarithmic loss minimization is essentially using the known distribution in the sample to solve the optimal model parameters leading to the distribution, so as to maximize the probability of occurrence of the distribution.

Logarithmic loss can be calculated by the following dichotomies:

Where, N is the number of samples,.Is the probability that the ith sample is predicted to be 1.

Logarithmic loss can also be used in multi-classification problems, and its calculation formula is as follows:

Where, N is the number of samples, C is the number of categories,Denotes the category of the ith sample is J,Is the probability that the ith sample belongs to class J.

Logloss measures the difference between the predicted probability distribution and the real probability distribution. The smaller the value, the better.

Return to the index

In the regression learning task, we also have some evaluation indicators. Take a look!

Mean absolute error

The Mean Absolute Error (MAE) formula is:

Where, N is the number of samples,Is the true value of the ith sample,Is the predicted value of the ith sample.

Mean square error (mse)

The Mean Squared Error (MSE) formula is:

Mean absolute percentage error

The Mean Absolute Percentage Error (MAPE) formula is:

MAPE represents the prediction effect by calculating absolute error percentage, the smaller the better. If MAPE=10, this indicates that the average forecast is off by 10%.

Since MAPE calculation is dimensionally independent, different problems can be compared in certain scenarios. However, the disadvantages of MAPE are also more obvious, inIs undefined. In addition, it should be noted that MAPE penalized negative errors more than positive errors, such as predicting a hotel consumption of 200 yuan, the real value of 150 yuan will be larger than the real value of 250 MAPE.

Root mean square error

The formula of Root Mean Squared Error is:

RMSE represents the sample standard deviation of the difference between the predicted value and the true value. Compared with MAE, RMSE has a greater penalty for large error samples. However, RMSE has the disadvantage of being sensitive to outliers, which can lead to very large RMSE results.

Embed Equation. Dsmt4 (Root Mean Squared Logarithmic Error) There is also a commonly used variable evaluation index based on RMSE called Root Mean Squared Logarithmic Error (RMSLE), which is expressed by:

RMSLE penalizes the sample with a lower predicted value more severely than the sample with a higher predicted value. For example, if the average price of a hotel is 200 yuan, the penalty of 150 yuan predicted will be larger than that of 250 yuan predicted.

R2

The formula of R2 (r-square) is:

In R2 is used to measure the variation of the dependent variable can be made of the proportion of the independent variable section, the general value range is 0 to 1, R2 is close to 1, shows that regression sum of squares of the greater the proportion of the total sum of squares, the closer the tropic of cancer and the observation point, explained by the change of the x y value variation part, regression fitting degree, the better.

exercises

After reading this article, let’s do some exercises to test the learning results:

  1. Why is there no absolute relationship between the smoothness of the ROC curve and the number of samples?

  2. If the AUC of a model is less than 0.5, what might be the cause?

  3. In a scenario of traffic prediction, multiple regression models were tried, but the RMSE indexes obtained were all very high. What might be the reasons for this?

  4. In a dichotomous problem, the true results of 15 samples are [0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0], The prediction results of the model are [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1], calculation accuracy, accuracy, recall rate and F1 value.

  5. In A dichotomous problem, the true result of 7 samples [A, B, C, D, E, F, G] is [1, 1, 0, 0, 1, 1, 0], the prediction probability of the model is [0.8, 0.7, 0.5, 0.5, 0.5, 0.5, 0.3], and the AUC value is calculated.

All the answers to the above exercises will be published in my knowledge planet for the convenience of subsequent knowledge precipitation; In addition, if you have any questions about this article or want to further study and communicate, you can join my Knowledge Planet to communicate (scan the QR code below or click “Read the original article”).

Reference:

[1] Zhou Zhihua. Machine learning. Chapter 2, Section 3 (Performance Measurement) [2] Meituan Algorithm team. Meituan machine learning actual combat. The first chapter in the first quarter (indicators) [3] [4] https://blog.csdn.net/qq_22238533/article/details/78666436 https://blog.csdn.net/u013704227/article/details/77604500