Time: 2019.09.14

This article was first reproduced at www.f1renze.me/, please indicate the source!

Confusion Matrix

Unless otherwise specified, all classifications mentioned in this article refer to binary tasks

Confusion Matrix (It really Confused me when I see It first time.xd) In fact, confusion matrix can help us understand the bias of classification task model, such as more Positive judgments (True) or False judgments (False), or more Positive/Negative predictions (depending on the value of threshold).

More technically, it is used to measure the performance of machine learning classification tasks, that is, to measure the ability of model generalization.

Confusion Matrix consists of four terms, and we now assume that the classification problem is a doctor diagnosing a patient with the flu:

  • True Positive (TP)

    If the patient actually has the flu and the doctor confirms it, it’s called TP.

  • True Negative (TN)

    If the patient does not have the flu and the doctor determines that the patient does not have the flu, it is called TN.

  • False Positive (FP)

    If a patient does not have the flu but is misdiagnosed as having the flu, it is called FP, also known as “Type I error.”

  • False Negative (FN)

    If a patient has the flu but is misdiagnosed as not having the flu, it is called FN, also known as a “Type II error.”

So Confusion Matrix is not difficult to understand in fact. T/F represents whether the predicted value of the model is consistent with the actual value, and P/N represents the category predicted by the model.

Several metrics extended by Confusion Matrix

  • Accuracy, the most commonly used indicator.

    Represents the proportion of the sample number correctly predicted by classification to the total sample.

    (TP + TN) / (TP + FP + TN + FN)

  • Recall, also known as Recall rate.

    Represents the probability of being accurately predicted in all samples that are actually Positive.

    Recall = TP / (TP + FN)

  • Precision, also known as Precision.

    Represents the probability of accurate prediction among all samples that are predicted positively.

    Precision = TP / (TP + FP)

    Generally speaking, when Precision is high, Recall will be low; otherwise, when Precision is high, Precision will be low. Therefore, F-Score should be introduced to compare the performance of different models.

  • F-score, also known as F1-score.

    Is the harmonic average value of Precision and Recall, and the closer the value is to 1, the better it is; otherwise, the closer it is to 0, the worse it is.

    F-Score = 2*(Recall * Precision) / (Recall + Precision)

Return to the example of diagnosing influenza above to illustrate these indicators. If Accuracy is 90%, it means that when using this model for prediction, 9 out of 10 samples are correct and 1 is wrong. If Precision is 80%, it means that 2 out of 10 cases diagnosed with influenza will be misdiagnosed when using this model. If Recall is 70%, it means that three out of 10 patients who actually get the flu will be told they don’t when the model is used to predict it. Therefore, Recall is more important in this application scenario, and the general form Fβ of F-score can represent the different preferences of the model for Precision/Recall.

When β = 1, it is standard F1; When β > 1, Recall had a greater effect, while when β < 1, Precision had a greater effect.

What is the ROC curve?

As the classifier classification process is actually to calculate the prediction probability for the sample, the probability is compared with the pre-defined threshold (usually 0.5). If the threshold is greater than Positive, it is Negative. If the test set samples are sorted according to the predicted probability, the sample with the largest predicted probability value is the sample most likely to be positive examples, and the sample with the smallest predicted probability value is the sample most likely to be negative examples.

Different thresholds in different application scenarios can get more expected results. As for the influenza diagnosis scenario above, since we attach more importance to Recall, adopting a lower threshold will reduce the probability of patients with influenza being considered healthy (and then lead to more misdiagnosis and flight). However, in the product recommendation system, in order to disturb users as little as possible, it is expected that the recommended products are more interesting to users. At this time, we attach more importance to Precision, so raising the threshold is more in line with expectations.

Therefore, the ranking quality determines the expected generalization performance of the classifier under different tasks, and the ROC curve can well represent the generalization performance of the classifier under the overall situation. Typical ROC curves take False Positive Rate as X-axis and True Positive Rate as Y-axis, and are defined as follows:

  • TPR / Recall

    TPR = TP / (TP + FN)
  • FPR

    FPR = FP / (TN + FP)

ROC curve drawing principle According to the prediction probability calculated by n test samples, each predicted value is set as threshold in polling, and TPR & FPR at this time is calculated as (x, Y) coordinates. The curve formed by fitting n coordinate points is the ROC curve.

What is the AUC

The full name of AUC is Area Under Curve. As its name implies, AUC is the Area Under the ROC Curve. Different AUC values represent the ability of models to distinguish different samples.

Ideally, the model completely divides the samples into 2 categories, at which point the ROC curve is shown as follows, with AUC = 1.

Most of the time, the ROC curve looks like this, with the overall curve approaching the upper left. AUC = 0.7 means that 70% of the samples will be correctly classified, while FN and FP samples are also present.

When the ROC curve is on the diagonal, AUC = 0.5 indicates that the model is likely to have a problem, as its ability to distinguish types is no different from random.

When AUC is lower than 0.5 or even close to 0, it is not necessarily a bad thing, indicating that the model has reversed the distinction between sample types. If the value of 1-probability is set to a new probability at this time, AUC will be close to 1! (Don’t ask me why I know)

Python implementation

The code address: gist.github.com/F1renze/6ae…

As shown below:

read

  • Chapter 2 of the Watermelon Book
  • Wikipedia
  • AUC paragraph picture citation