Ten minutes to master the evaluation index of classification algorithm

This is the 16th day of my participation in the August More Text Challenge. For details, see: August More Text Challenge

What are the evaluation indicators?

Evaluation index is a quantitative index for model performance. An evaluation index can only reflect part of the performance of the model. If the evaluation index is not reasonable, wrong conclusions may be drawn. Therefore, different evaluation indexes should be selected according to specific data and model.

For different types of learning tasks, we have different evaluation indicators, here we will introduce some of the most common classification algorithm evaluation indicators. Common classification task evaluation indexes include Accuracy, Precision, Recall, F1 Score, P-R Precision Curve, ROC, AUC, etc.

Basic concept – confusion matrix

Confusion Matrix is a commonly used tool to evaluate classification problems. For K-element classification, it is actually a table of K x K, which is used to record the prediction results of the classifier. For common dichotomies, the confusion matrix is 2 by 2.

In dichotomies, samples can be divided into true positive (TP), true negative (TN), false positive (FP) and false negative (FALSE negative) according to the combination of the real results and the predicted results of the model. FN). According to TP, TN, FP and FN, the confusion matrix of the dichotomies can be obtained, as shown in the figure below.

TP: True Positives, which indicates the number of samples whose True values are positive examples and are judged as positive examples (predicted values) by the classifier
FP: False Positives, indicating the number of samples whose true value was negative and were judged as positive (predicted value) by the classifier
FN: False Negatives, indicating the sample number when the true value is a positive example but is judged as a negative example (predicted value) by the classifier
TN: True Negatives, indicating the number of samples when the True value is a negative example and is judged as a negative example (predicted value) by the classifier

Note:

The first letter indicates whether the True and predicted values are correctly divided. T indicates the correct decision (True) and F indicates the incorrect decision (False).

The second letter represents the judgment result (prediction result) of the classifier. P represents the positive example and N represents the negative example.

Evaluation index of classification algorithm

Accuracy

Accuracy refers to the proportion of correctly classified samples in the total number of samples. Accuracy is a statistic for all samples. It is defined as:

Accuracy=\frac{TP+TN}{TP+FP+TN+FN}=\frac{number of samples correctly predicted}{all samples}

Accuracy can clearly judge the performance of our model, but there is a serious defect: in the case of imbalance between positive and negative samples, the category with a large proportion will often become the most important factor affecting Accuracy. At this time, Accuracy cannot well reflect the overall situation of the model.

For example, a test set has 99 positive samples and 1 negative sample. The model predicts all the samples as positive samples, so the Accuracy of the model is 99%. Looking at the evaluation index, the model works well, but in fact the model has no predictive power.

Precision rate

Accuracy rate, also known as precision rate, is an evaluation index for prediction results. It refers to the proportion of the number of correctly classified positive samples to the number of samples judged as positive by the classifier. The accuracy rate is the statistic of partial samples, which focuses on the statistics of the data judged as positive by the classifier. It is defined as:

Precision=\frac{TP}{TP+FP}=\frac{number of correctly classified positive samples}{number of samples judged as positive samples by the classifier}

Tags:

Precision=\frac{\sum_{l=1}^LTP_l}{\sum_{l=1}^L(TP_l+FP_l)}=\frac{number of positive samples predicted by Label as L and correctly classified}{number of samples determined by the classifier as L}

Recall Rate

Recall rate refers to the proportion of correctly classified positive samples to true positive samples. Recall rate is also a statistic of some samples, focusing on the statistics of real positive samples. It is defined as:

Recall=\frac{TP}{TP+FN}=\frac{number of correctly classified positive samples}{number of true positive samples}

Tags:

Recall = \ frac {\ sum_ {l = 1} ^ LTP_l} {\ sum_ {l = 1} ^ (TP_l + FN_l)} l = \ frac {Label sample number and the correct classification is forecast for l} {the real Label for sample number} l

The trade-off between Precision and Recall

High accuracy rate means that the classifier should try to predict the sample as a positive sample when it is “more confident”, which means that the accuracy rate can well reflect the model’s ability to distinguish negative samples. The higher the accuracy rate is, the better the model’s ability to distinguish negative samples.

A high recall rate means that the classifier can predict the samples that are likely to be positive samples as positive samples as possible, which means that the recall rate can well reflect the model’s ability to distinguish positive samples. The higher the recall rate, the stronger the model’s ability to distinguish positive samples.

From the above analysis, it can be seen that the accuracy rate and recall rate have a negative and negative relationship. If the classifier only predicts the samples with high probability as positive samples, it will miss many positive samples with relatively low probability but still satisfied, resulting in a lower recall rate.

So how to select models when Recall and Precision of different models have their own advantages? This can be compared by F1 Score.

F1 Score

F1 Score is the harmonic average of accuracy rate and recall rate, which takes into account the accuracy rate and recall rate of classification model at the same time. It is an indicator used to measure the accuracy of dichotomy (or multi-task dichotomy) model in statistics. Its maximum value is 1 and its minimum value is 0, and the larger the value, the better the model. It is defined as:

F1=2 \cdot \frac{ Precision \cdot Recall}{Precision+Recall}

F-Beta Score

The more general FβF_\betaFβ, whose physical meaning is to combine the two points of accuracy and recall rate into one point, in the process of merging, the weight of recall rate is β times of accuracy. We define FβF_\betaFβ fraction as:

F_\beta = (1 + \beta^2) \cdot \frac{precisiont \cdot recall}{(\beta^2 \cdot precision) + recall}

β\betaβ is essentially the weight ratio of Recall to Precision. When β=2\beta =2 β=2, F2F_2F2 indicates that the weight importance of Recall is higher than that of Precision. When β=0.5\beta =0.5 β=0.5, F0.5F_0.5F0.5 indicated that Recall weight importance was lower than Precision, and the corresponding influence was smaller.

Macro F1 average score

When calculating Precision and Recall, macroaverage F1 algorithm firstly calculates the Precision and Recall of each category, and then averages them.

Precision_i=\frac{TP_i}{TP_i+FP_i}

Precision_{macro}=\frac{\sum_{i=1}^L Precision_i}{|L|}

Recall_i=\frac{TP_i}{TP_i+FN_i}

Recall_{macro}=\frac{\sum_{i=1}^L Recall_i}{|L|}

The formula of macro average F1 score is:

F1_{macro}=2 \cdot \frac{ Precision_{macro} \cdot Recall_{macro}}{Precision_{macro}+Recall_{macro}}

Note: Macro F1 is essentially an arithmetical mean of all categories of statistical indicators, so the simple mean ignores the possibility of a large distribution imbalance between samples.

Micro average F1 score (Micro F1)

When calculating Precision and Recall, the micro-average F1 algorithm will directly put all classes together for calculation.

Precision_{micro}=\frac{ \sum_{i=1}^L TP_i}{\sum_{i=1}^L TP_i + \sum_{i=1}^L FP_i}

Recall_{micro}=\frac{ \sum_{i=1}^L TP}{ \sum_{i=1}^L TP + \sum_{i=1}^L FN}

Formula of micro-average F1 fraction is:

F1_{micro}=2 \cdot \frac{ Precision_{micro} \cdot Recall_{micro}}{Precision_{micro}+Recall_{micro}}

The difference between Macro and Micro

Macro is more important than Micro. For example, for a four-category problem there are:

Class A: 1 TP, 1 FP
Class B: 10 TP, 90 FP
Class C: 1 TP, 1 FP
Class D: 1 TP, 1 FP

Then the calculation for Precision is as follows:

P_A = P_C = P_D = P_E = 0.5, P_B = 0.1

P_ {micro} = \ frac {0.5 + 0.1 + 0.5 + 0.5} {4} = 0.4

P_ {micro} = \ frac {1 + 10 + 1 + 1} {2 + 100 + 2 + 2} = 0.123

As we can see, for Macro, small categories greatly increase the Precision value, but in fact, not so many copies are correctly classified. Considering that in the actual environment, the distribution of real samples is the same as that of training samples, this indicator is obviously problematic, because small categories play too big a role. So that large samples do not classify well. As for Micro, it takes into account the problem of sample imbalance, so it is relatively better in this case.

Summary:

If your categories are balanced, feel free to use Micro or Macro;
If you think large sample categories should be more important, use Micro;
If you think small samples should be important, use Macro;
If Micro << Macro, it means that there is a serious classification error in the large sample category;
If Macro << Micro, it means there is a serious classification error in the small sample category.

Weighted F1 score

In order to solve the problem that Macro cannot measure sample equilibrium, a good method is to find the Weighted Macro, so Weighted F1 appears.

The weighted F1 algorithm is an improved version of Macro algorithm, in order to solve the problem that Macro does not consider the imbalance of samples. When calculating Precision and Recall, the Precision and Recall of each category should be multiplied by the proportion of the class in the total sample to sum.

Precision_i=\frac{TP_i}{TP_i+FP_i}

Precision_{weighted}=\frac{\sum_{i=1}^L (Precision_i \times w_i)}{|L|}

Recall_i=\frac{TP_i}{TP_i+FN_i}

Recall_{weighted}=\frac{\sum_{i=1}^L (Recall_i \times w_i)}{|L|}

Formula of weighted F1 score is:

F1_{weighted}=2 \cdot \frac{ Precision_{weighted} \cdot Recall_{weighted}}{Precision_{weighted}+Recall_{weighted}}

The Matthews correlation coefficient — MCC

MCC is mainly used to measure the dichotomy problem, which comprehensively considers TP, TN, FP and FN, and is a relatively balanced index, which can also be used in the case of unbalanced samples.

The value of MCC is in the range of [-1, 1]. The value of 1 indicates that the prediction is completely consistent with the actual result; the value of 0 indicates that the predicted result is worse than the random prediction result; -1 indicates that the predicted result is completely inconsistent with the actual result.

So we see that MCC essentially describes the correlation coefficient between the predicted results and the actual results.

The Matthews correlation coefficient formula is:

MCC = \frac{TP \times TN – FP \times FN}{\sqrt{(TP + FP) \times (TP + FN) \times (TN + FP) \times (TN +FN)}}

MCC = \frac{{SQRT {predicted positive sample \times true positive sample \times true negative sample \times predicted negative sample \}}

The ROC curve

In the classification task, the test sample usually obtains a probability that represents the probability that the current sample belongs to the positive example. We often adopt a threshold value, where the value greater than the threshold is a positive example and the value less than the threshold is a negative example. If we reduce this threshold, more samples will be identified as positive classes, which will improve the recognition rate of positive classes, but reduce the recognition rate of negative classes.

In order to vividly describe the above changes, ROC curve is introduced to evaluate the quality of a classifier. The ROC curve is also an indicator of the comprehensive evaluation model, which is called the “Subject operating characteristic curve” in Chinese. The ROC curve originated in the military field and has been widely used in the medical field, where the term “subject operating characteristic curve” comes from.

The abscissa of the ROC curve is False Positive Rate (FPR), that is, the probability that a negative case is wrongly divided into a Positive case, which is called misdiagnosis Rate in medicine. The ordinate is True Positive Rate (TPR), the probability of dividing Positive cases into pairs.

Abscissa:

FPR = \frac{FP}{FP+TN} = \frac{predict the number of positive samples from negative samples}{true number of negative samples}

Ordinate:

TPR = \frac{TP}{TP+FN}= \frac{number of samples predicted to be positive}{true number of positive samples}

In the ROC curve, different TPR and FPR will be obtained by setting different thresholds. As the threshold gradually decreases, more and more instances are divided into positive classes, but these positive classes are also mixed with negative classes, that is, TPR and FPR will increase at the same time.

When the threshold is maximum, all positive samples are predicted to be negative samples, and all negative samples are also predicted to be negative samples, that is, all molecules are 0, so FPR = 0, TPR = 0, and the corresponding coordinate point is (0,0).
When the threshold is the minimum, all negative samples are predicted to be positive samples, and all positive samples are also predicted to be positive samples, i.e., FPR = 1, TPR = 1, corresponding to the coordinate point (1,1).
When FPR = 0 and TPR = 1, it is the optimal classification point, and the ROC curve corresponding to the classifier should be as close to the upper left corner of the coordinate axis as possible, while the position of the diagonal line means that the effect of the classifier is as bad as random guessing.

The ROC curve can remain unchanged when the sample distribution in the test set changes. However, unfortunately, in many cases, THE ROC curve cannot clearly indicate which classifier is more effective, while AUC can just make an intuitive evaluation of the classifier.

Area under the AC-ROC curve

AUC is the area under the ROC curve. The value of this area is between 0 and 1, which can intuitively evaluate the quality of the classifier. The greater the value of AUC, the better the effect of the classifier.

AUC = 1: perfect classifier. No matter what threshold is set, perfect prediction can be obtained by using this model (which does not exist most of the time).
0.5 < AUC < 1: it is better than random guess. If the classifier sets the threshold well, it has predictive value
AUC = 0.5: Like a random guess, the model has no predictive value
AUC < 0.5: Worse than random guess, but better than random guess if the prediction is reversed.

It is worth mentioning that the equal AUC of the two models does not mean the same effect of the models, as shown in the figure below:

In practical scenarios, AUC is indeed a very common indicator.

Note: For the ROC curve and AUC value in the multi-classification scenario, there should be multiple ROC curves.

The calculation of AUC is as follows: ∣ AUC = 2 C ∣ (∣ ∣ C – 1) ∣ ∑ I = 1 C ∣ AUCiAUC = \ frac {2} {| | C (C | | – 1)} \ sum_ {I = 1} ^ {| | C} {AUC_i} AUC = ∣ C ∣ (∣ C ∣ – 1) ∣ ∑ 2 I = 1 C ∣ fauci, including C says the number of categories.

P – R curve

As we know, the final output of the classification model is usually a probability value, and we generally need to convert the probability value into a specific category. For dichotomy, we set a threshold value, and then judge it as a positive category if it is greater than this threshold, and negative category if it is less.

The above evaluation indexes (Accuracy, Precision and Recall) are all for a specific threshold, so how to comprehensively evaluate different models when different thresholds are set for different models? The P-R curve is the curve describing the change of accuracy and recall rate.

How do I plot the P-R curve for all the positive samples?

By setting different thresholds, the model predicts all positive samples and calculates the corresponding accuracy rate and recall rate. The horizontal axis is the recall rate and the vertical axis is the accuracy rate, as shown below.

In the picture above, we find that:

For two different classifiers, A completely covers C, which means that both Precision and Recall of A are higher than C, and A is better than C. In this case, the area under the curve is used to measure the performance. The larger the area is, the better the performance is. Here, A is better than B.
For the same classifier, the trade off between Precision and Recall is that the closer the curve is to the upper right corner, the better the performance will be. The area under the curve is called AP score, which can reflect the ratio of high accuracy rate and high Recall rate of the model to some extent. However, this value is not convenient to calculate, so the F1 value or AUC value is generally used considering the accuracy and recall rate (because the ROC curve is easy to draw and the area under the ROC curve is also easy to calculate).

Logarithmic loss

Logarithmic Loss (Logistic Loss) is to predict the likelihood estimation of probability, its standard form is: LogLoss = – log ⁡ P (Y) ∣ X LogLoss = – \ log LogLoss = {P (Y | X)} – logP (Y ∣ X)

LogLoss measures the difference between a predicted probability distribution and a real one, and the smaller the value, the better. Unlike AUC, LogLoss is sensitive to prediction probability.

The calculation formula of the dichotomies corresponding to logarithmic loss is:

LogLoss = N – 1 ∑ I = 1 N (yi ⋅ log ⁡ PI + 1 – (yi) ⋅ log ⁡ (1 -) PI) LogLoss = – \ frac {1} {N} \ sum_ {I = 1} ^ N {(y_i \ cdot \ log {p_i} + (1 – y_i) \ cdot \ log (1 – p_i))} LogLoss = – N1 ∑ I = 1 n (yi ⋅ logpi + 1 – (yi) ⋅ log (1 – (PI),

Where N is the number of samples, yi∈{0,1}y_i \in {\{0,1 \}}yi∈{0,1}, and pip_ipi is the probability that the ith sample predicts to be 1.

Logarithmic loss is also widely used in multi-classification problems. Its calculation formula is as follows:

$LogLoss = – \frac{1}{N} \cdot \frac{1}{C} \sum_{i=1}^N\sum_{j=1}^C{ y_{ij} \cdot \log{p_{ij} }}$

Where, N is the number of samples, C is the number of categories, yijy_{ij} Yij represents the classification label of category J of the ith sample, and the probability of category J of pij of the ith sample. P_ {ij} The probability of class J of the ith sample. The probability of class J of the ith sample of PIj.

The difference between LogLoss and AUC

LogLoss is mainly an assessment of accuracy, AUC is used to assess the ability to rank positive samples to the front, the assessment aspects are different.
LogLoss measures overall accuracy and is used to measure data balance. The AUC is used to assess the accuracy of the model in cases of data imbalance.
If it’s a balanced classification problem, then both AUC and LogLoss are fine.

conclusion

By comparing the above evaluation indicators, the summary is as follows:

Precision refers to the proportion that is truly true among all samples determined by the system to be “true”.
Recall refers to the proportion of samples judged to be “true” among all samples that are indeed true.
The F1 value is an indicator designed to take into account both accuracy and recall.
The MCC describes the correlation coefficient between the predicted results and the actual results.
TPR (true positive rate) is defined the same as Recall.
The FPR (false positive rate), also known as the misdiagnosis rate, is the false rate of all the “false” samples.
ROC curve shows the curve of TPR and FPR. The corresponding curve is THE PR curve, which shows the curve of Precision and Recall.
AUC is the area under the ROC curve. The value of this area is between 0 and 1, which can intuitively evaluate the quality of the classifier. The greater the value of AUC, the better the effect of the classifier.
Logarithmic loss is a likelihood estimate of predicted probability, which measures the difference between the predicted probability distribution and the real probability distribution.

For the selection of the final classification index, there will be different choices in different data sets, different scenarios and different times. For dichotomies, AUC is usually used, while for multiclassifications, F1 is usually looked at.