• 0 x00 the
  • 0x01 Cause of this article
  • 0x02 Example build
  • 0x03 Confusion matrix
    • 3.1 Four classifications
    • 3.2 Confusion Matrix
  • 0x04 Accuracy
    • 4.1 the formula
    • 4.2 the characteristics of
  • 0x05 Precision
    • 5.1 the formula
    • 5.2 the characteristics of
    • 5.3 Application Scenarios
  • 0x06 Accuracy VS Accuracy
  • 0x07 Recall Rate Recall
    • 7.1 the formula
    • 7.2 the characteristics of
    • 7.3 Application Scenarios
    • 7.4 Sensitivity
  • 0x08 Accuracy VS recall rate
    • 8.1 Concept Differences
    • 8.2 Different Concerns
    • 8.3 Why do accuracy and recall rate affect each other
    • 8.3 Example Description
  • 0x09 F-Measure / F1 Score
    • 9.1 the formula
    • 9.2 the characteristics of
    • 9.3 Application Scenarios
  • 0x10 TPR, FPR, TNR, FNR
  • 0x11 TPR vs FPR
  • 0 x12 reference

0 x00 the

Binary classification evaluation is to evaluate the prediction results of binary classification algorithm. This paper will construct a concrete example of Water mooring Liangshan to guide you to sort out related concepts.

0x01 Cause of this article

Recently, I was studying the Alink source code. I originally planned to analyze binary evaluation, but when I opened the Alink related documents, I found that many concepts and formulas were given in it. So this article first lead you to review/familiar with some of the concepts and formulas, for the subsequent analysis of the source code to lay a good foundation.

The following is a description of “dichotomous evaluation” in Alink.

  • Support Roc curve, LiftChart curve, Recall-Precision curve drawing.

  • Streaming experiments support cumulative statistics and window statistics.

  • The criteria included AUC, K-S, PRC, Precision, Recall, F-measure, Sensitivity, Accuracy, Specificity and Kappa.

These concepts are basically evaluation indicators, which are quantitative indicators for the performance of the model. An evaluation index can only reflect part of the performance of the model. If the evaluation index is not reasonable, it may draw a wrong conclusion. Therefore, different evaluation indexes should be selected for specific data and model.

Next, analyze and comb some of the concepts.

0x02 Example build

  • ** target ** : Because in the recruitment plan, Lu Zhishen put forward opposition, so Song Gongming hopes to find and Lu Zhishen in the liangshan internal people associated.
  • ** All ** : Lin Chong, Wu Song, Shi Jin, Yang Zhi, Zhang Qing, Sun Erniang, HU Yanzhuo, Guan Sheng, Shi Xiu, Yang Xiong.

Song Jiang said to Jiang Jing, “Brother, you are a god. Please help your brother figure out how to find someone connected with Daiwa.

Jiang Jing said: My brother’s request is a “dichotomous question” (such as the prediction of heart disease or not, stock rise or fall, and so on, there are only two kinds of questions). There are many ways in it, brother, just listen to me slowly.

0x03 Confusion matrix

Jiang Jing said: First, my younger brother introduced the concept of confusion matrix, which is a 2-dimensional square matrix, which is mainly used to evaluate the quality of dichotomies.

3.1 Four classifications

For a binary classification problem, that is, the instance is divided into positive or negative categories, the following four situations will occur in the actual classification:

  • TN (True Negative) : the algorithm predicts Negative cases (N), which is actually the number of Negative cases (N), that is, the algorithm predicts correctly (True).

  • FP (False Positive) : the algorithm predicts Positive cases (P), which is actually the number of negative cases (N), that is, the algorithm predicts wrong (False).

    This refers to the number of instances that are actually negative but classified as positive by the classifier.

  • FN (False Negative) : the algorithm predicts Negative cases (N), which is actually the number of positive cases (P), namely, the algorithm predicts wrong (False).

    This refers to the number of instances that are actually positive cases but classified as negative cases by the classifier.

  • TP (True Positive) : indicates the number of Positive cases (P) predicted by the algorithm, that is, the algorithm predicted correctly (True).

Song Jiang said: so, “and Lu Zhishen related people” is TP + FP.

Jiang Jing said: my brother misunderstood, the actual number of positive samples should be TP + FN.

Here’s a memory trick from my little brother.

Here’s the trick: These four definitions are made up of two letters:

  • The first letter indicates whether the algorithm is right or wrong, that is, True and False describe whether the classifier is correct.
  • The second letter indicates the predicted result of the algorithm, that is, Positive and Negative are the classification result of the classifier.

So to clarify again:

  • TP: The predicted positive sample is actually a positive sample.
  • FP: Positive sample predicted, negative sample actually.
  • FN: negative sample predicted, positive sample actually.
  • TN: The predicted negative sample is actually a negative sample.
  • P = TP + FN: the number of all “actually positive examples” samples
  • N = FP + TN: the number of all “actually negative cases” samples
  • P~ = TP + FP: the number of samples with “positive prediction”
  • N~ = TN + FN: the sample number of all “predicted negative cases”

3.2 Confusion Matrix

An obfuscation matrix is a crosstab listing the number of samples corresponding to the actual and predicted values. In this way, all the correct predictions are on the diagonal, so it’s easy to see where the errors are from the obfuscation matrix.

Each row of the matrix is the predicted classification of the sample, and each column is the true classification of the sample (or vice versa).

Predicted value 0 (unrelated to Lu Zhishen) Predicted value 1 (associated with Lu Zhishen)
** True value 0 (unrelated to Lu Zhishen)** TN FP
** True value 1 (associated with Lu Zhishen)** FN TP

Mnemonic method: The true value is more important, so the true value is in the first dimension, which is the row.

0x04 Accuracy

Jiang Jing said: The second concept introduced by my younger brother is Accuracy. Accuracy refers to the percentage of the total number of samples predicted to be correct.

4.1 the formula

According to our definition above: the first letter indicates the algorithm prediction is correct or wrong, and the second letter indicates the algorithm prediction result.

So the denominator is all four; The first letter in the numerator is T for “the algorithm predicted correctly”.

4.2 the characteristics of

Accuracy has a disadvantage, that is, the data samples are not balanced, and this index cannot evaluate the performance of the model.

Suppose a test set has 99 positive samples and 1 negative sample. The model we designed is a brainless model, that is, all samples are predicted to be positive samples, so the Accuracy of the model is 99%. According to the evaluation index, the model has a good effect, but in fact the model has no prediction ability.

0x05 Precision

Jiang Jing said: “The third concept introduced by my younger brother is accuracy, also known as precision, which is the number of” correct “data in the” positive forecast “data. That’s the percentage of positive samples that the model predicts will be positive. Or how much of the predicted positive sample results are accurate.

5.1 the formula

According to our definition above: the first letter indicates the algorithm prediction is correct or wrong, and the second letter indicates the algorithm prediction result.

So in the denominator, TP means: the algorithm predicted correctly & predicted positive cases, FP means: the algorithm predicted incorrectly & predicted positive cases (actually negative cases)

5.2 the characteristics of

This index is cautious and has a high classification threshold.

5.3 Application Scenarios

You need to test as accurately as possible for the desired categories, regardless of whether they are all detected. For example, for the prediction of criminals, we want the prediction results to be very accurate, even if we let some real criminals go, we can not blame a good man.

0x06 Accuracy VS Accuracy

‘The Chinese words for xian di and Xian Di are so similar that I can hardly tell them apart,’ Song said.

Jiang Jing said: Brother, these two words are translated from English, let’s deliberate slowly.

Let’s look at the English meaning.

  • Accuracy is defined in the dictionary as: the quality or state of being correct or precise
  • Precision is defined in the dictionary as: the quality, condition, or fact of being exact and accurate

Accuracy is correct and precision is exact. Be accurate first, then be precise. To be accurate, a result must meet both conditions of accuracy and precision.

These two terms are also similar to bias and variance.

  • Bias reflects the gap between the expected output of the model on the sample and the real mark, that is, the accuracy of the model itself, which reflects the fitting ability of the model itself. This is very much like Precision.
  • Variance reflects the error between the output of the function learned by the model under different training data sets and the expected output, namely, the stability of the model, which reflects the fluctuation of the model. It’s kind of like Accuracy.

Song Jiang said, “Brother, your deviation, variance, brother also sounds like a fairy tale.

Jiang Jing said, “Please find a suitable example for your brother.

For example, the accuracy rate depends on the probability of hitting the bull’s eye. The accuracy depends on where in the bullseye area you hit.

0x07 Recall Rate Recall

This is an evaluation of the original sample. Recall rate, also known as recall rate, is the percentage of positive samples that are predicted to be positive. That is, how many of the positive cases are correctly determined to be positive.

7.1 the formula

According to our definition above: the first letter indicates the algorithm prediction is correct or wrong, and the second letter indicates the algorithm prediction result.

Therefore, TP+FN in the denominator means “the prediction is correct and the prediction is positive sample” + “the prediction is wrong and the prediction is negative sample (actually the real positive example)”. That’s the number of samples that are actually positive

The numerator is: correctly predicted and predicted as a positive sample.

7.2 the characteristics of

The classification threshold of recall rate is low. Try to detect data, do not miss data, the so-called would rather kill a thousand wrong, refused to let go of a.

Recall: to remember STH; to make sb think of sth; to order sb to return; To ask for STH to be returned, often because there is STH wrong with it.

Since Recall means “memory,” try to understand Recall as “memory rate.” This is how much detail is required to remember an event. This detail is how many details the retrieval system can “Recall” when we ask it for all the details of an event (enter query). The number of details that can be recalled divided by all the details the system knows about the event is the “recall rate,” or recall rate.

7.3 Application Scenarios

Recall rates are used in situations where the desired category needs to be detected as much as possible, regardless of the accuracy of the results.

For earthquake prediction, for example, we want to be able to predict every earthquake, at the expense of precision. If there are 10 earthquakes, we’d rather issue 1,000 alerts to cover all 10 (recall is 100%, precision 1%) than 100 alerts, where 8 earthquakes were predicted and 2 missed (Recall is 80%, precision 1%). Precision is 8%).

7.4 Sensitivity

Sensitive = TP/P, indicating the proportion of all positive cases to be matched, which measures the recognition ability of positive cases by the classifier. As you can see, sensitive and Recall are the same.

0x08 Accuracy VS recall rate

Song Jiang said, “Brother, explain the two concepts of accuracy and recall rate to your brother.

Jiang Jing said: Please tell me slowly.

8.1 Concept Differences

First, to illustrate the conceptual differences, an ellipse is a sample of the predicted positive class. You can also see the two definitions.

8.2 Different Concerns

Recall rate is a measure of coverage that has multiple real positive cases predicted to be positive cases. Precision is a measure of accuracy, representing the percentage of examples that are predicted to be positive that are actually positive.

In different application scenarios, our focus is different, for example:

  • When we call stocks, we care more about accuracy — how many of the stocks we predict will actually go up — because the stocks we predict will go up are the ones we put money in.
  • In the patient prediction scenario, we pay more attention to the recall rate **, that is, we should make as few mistakes as possible in those people who actually have the disease, because if the disease is not detected, the result is actually very serious, the previous brainless algorithm, the recall rate is zero.

In information retrieval, accuracy rate and recall rate are mutually affecting. Although it is an ideal situation to expect that both are high, in practice, if the threshold is high, the accuracy rate will be high, but a lot of data will be missed. If the threshold is low, the recall rate is high, but the prediction will be very inaccurate. Therefore, in practice, it is often necessary to make trade-offs according to specific situations. For example:

  • In the case of general search, the accuracy is improved while the recall rate is guaranteed.
  • If it is disease surveillance, anti-spam, etc., it is to improve the recall rate under the condition of ensuring accuracy.
  • Sometimes, you need to do both, and then you can use

    F

    – score index.

8.3 Why do accuracy and recall rate affect each other

‘Seeing this, I have a question,’ Mr. Song said. ‘Why do accuracy and recall affect each other?’

‘It’s a complicated issue,’ Mr. Jiang said

First, the general principle.

  • Recall and Precision are contradictory. If we want a higher recall, we need to make the prediction of the model cover more samples, but the model is more likely to make mistakes, that is to say, the precision will be lower. If the model is conservative and can only detect samples of which it is certain, the precision will be high, but the recall will be relatively low.
  • The denominator of recall (TPR) is the number of positive classes in the sample. Therefore, once the sample is determined, its denominator is fixed, that is, the change of recall increases monotonically with the increase of molecules. The denominator of precision is the number of predicted positive classes in the sample, which will change with the change of classification threshold. Therefore, the change of precision is influenced by TP and FP comprehensively, which is not monotonous and the change is unpredictable.

8.3 Example Description

Jiang Jing said: the specific use of real data to illustrate the best, we real operation.

8.3.1 Confusion matrix

Sample: Lin Chong, Wu Song, SHI Jin, Yang Zhi, ZHANG Qing, Sun Erniang, HU Yanzhuo, GUAN Sheng, SHI Xiu, Yang Xiong.

These heroes will be divided into four categories:

  • TP: found, relevant (found and wanted)
  • FP: found, but not relevant (found, but not useful)
  • FN: not found, but relevant (not found, but actually wanted)
  • TN: Not found, not relevant (not found, useless)

Let’s look at the confusion matrix again:

Predicted value 0 (unrelated to Lu Zhishen) Predicted value 1 (associated with Lu Zhishen)
** True value 0 (unrelated to Lu Zhishen)** TN FP
** True value 1 (associated with Lu Zhishen)** FN TP

8.3.2 Why should there be a recall rate

First of all, why do we have recall rate?

In the liangshan hero search and Lu Zhishen has an association of people. For example, 18 people are associated with Lu Zhishen and 90 people are not associated with Lu Zhishen.

Let’s make a prediction algorithm: all the good guys predict no connection. So, 90 of them must have been right, which looks very accurate.

However, the prediction accuracy of this algorithm for “people associated with Lu Zhishen” is 0, so this algorithm is meaningless.

So we’re going to introduce Recall. Recall is what can be found out of these 18. For example, if you find 12, then Recall = 66%.

8.3.3 are included analysis

Let’s review the definition:

  • Precision (Precision) = TP/(TP+FP) : related heroes found/all heroes found. The pursuit of precision (accuracy) means finding as many relevant heroes as possible and as few irrelevant ones as possible.
  • Recall = TP/(TP+FN) : the relevant heroes found/all the relevant heroes in the whole sample. The pursuit of recall means that the more relevant heroes in the sample are found, the better.

Why are they interdependent?

  • Because the “search strategy” is not perfect, when more relevant heroes are expected to be retrieved, the relaxation of the “search strategy” will often be accompanied by some irrelevant results, thus affecting the accuracy rate.
  • In order to remove the irrelevant heroes in the search results, it is necessary to make the “search strategy” more strict, which will also make some relevant heroes no longer be found, thus affecting the recall rate.

8.3.4 Original Search Policy

Jiang Jing first set the “search strategy” is: peach blossom mountain + officers. This is easy to understand, Lu Zhishen was planted in the Peach blossom Mountain, and himself was tiha, and probably had personal relations with the song generals.

The following confusion matrix is obtained

Predicted value 0 (unrelated to Lu Zhishen) Predicted value 1 (associated with Lu Zhishen)
** True value 0 (unrelated to Lu Zhishen)** TN Shi Xiu, Yang Xiong, PEI Xuan, TANG Long, LIU Tang, TAO Zongwang FP Guan Sheng, Hu Yanzhuo
** True value 1 (associated with Lu Zhishen)** FN Shi Jin, Yang Chun, Chen Da, Zhou Tong TP Wu Song, Yang Zhi, ZHANG Qing, Sun Erniang

Then it is calculated that:

Precision = TP / (TP + FP) = 4 / (4 + 2) = 2/3

Recall = TP / (TP + FN)  = 4 / (4 + 4) = 1/2 
Copy the code

8.3.5 Pursuing recall rate/Relaxing “Search Strategy”

Song Jiangdao: too few people have been found. We have to improve the recall rate.

Pursuing recall means finding as many of the relevant heroes in the sample as possible, so the “search strategy” should be relaxed. When the “search strategy” is relaxed, it is often accompanied by some irrelevant results, thus affecting accuracy.

So Gongming brother needs to relax the “search strategy” : Shaohua Mountain History and Lu Zhishen used to become sworn enemies; Lu Zhishen used to be under the command of “The old Kind of Confucian Classics and Xianggong”, and Shi Dao commanded Shaanxi, so Lu Zhishen probably had contacts with people in the northwest.

Now the “search strategy” is: Taohua Mountain + Shaohua Mountain + Officers + Northwest people (Pei Xuan, Tang Long, Liu Tang, Tao Zongwang)

Thus, the confusion matrix is:

Predicted value 0 (unrelated to Lu Zhishen) Predicted value 1 (associated with Lu Zhishen)
** True value 0 (unrelated to Lu Zhishen)** TN Shi Xiu, Yang Xiong FP Guan Sheng, HU Yanzhuo, PEI Xuan, Tang Long, LIU Tang, Tao Zongwang
** True value 1 (associated with Lu Zhishen)** FN Zhou Tong TP Wu Song, Yang Zhi, ZHANG Qing, Sun Erniang, Shi Jin, Yang Chun, Chen Da

So get

Precision = TP/(TP + FP) = 7/ (7 + 6) = 7/13 ---- Reduced Recall = TP/(TP + FN) = 7/ (7 + 1) = 7/8 ---- improvedCopy the code

As you can see, you want to increase TP and decrease FN, so you relax the “search strategy”, resulting in an increase in FP.

8.3.6 Pursuing accuracy/Enhancing “Search Strategy”

Song Jiangdao: Too many people have been found. We need to improve our accuracy.

I want to remove the irrelevant heroes from the search results. At this time, we must make the “search strategy” more strict, which will also make some related heroes can no longer be found, thus affecting the recall rate.

So to strengthen the “search strategy”, the new “search strategy” is: peach blossom Mountain hero, male.

Thus, the confusion matrix is:

Predicted value 0 (unrelated to Lu Zhishen) Predicted value 1 (associated with Lu Zhishen)
** True value 0 (unrelated to Lu Zhishen)** TN Shixiu, Yang Xiong, PEI Xuan, TANG Long, LIU Tang, Tao Zongwang, GUAN Sheng, HU Yanzhuo FP
** True value 1 (associated with Lu Zhishen)** FN Zhou Tong, Sun Erniang, Shi Jin, Yang Chun, Chen Da TP Wu Song, Yang Zhi, ZHANG Qing

So get

Precision = TP/(TP + FP) = 3/ (3 + 0) = 3/3 ---- Improve Recall = TP/(TP + FN) = 3/ (3 + 5) = 3/8 -- reduceCopy the code

As you can see, you want to increase TP and decrease FP, so you strengthen the “find strategy”, which causes FN to increase as well.

0x09 F-Measure / F1 Score

Song Jiangdao: accuracy rate and recall rate seems to be one and the other, how is this good? 'We have other indicators to consider, such as F1 Score,' Jiang saidCopy the code

F1 score is used to balance accuracy and recall rate in some scenarios. The F1 value is the harmonic mean of the accuracy value and recall rate.

9.1 the formula

9.2 the characteristics of

Precision reflects the model’s ability to control the number of FP false positives under certain circumstances, Recall value reflects the detection rate of positive samples, and F1 value integrates two aspects.

In fact, F1 Score is the harmonic average of accuracy rate and recall rate. The nature of the harmonic average is that only when both accuracy rate and recall rate are very high, their harmonic average will be high. If one of them is very low, the harmonic average is pulled closer to that very low number.

Why is that? Because the harmonic mean is a product above, it is closer to the smaller value, so whichever value of the precision or recall is smaller, the harmonic mean is closer to that value, which is a more rigorous measure.

Memory method: Golden arowana mixed with oil.

9.3 Application Scenarios

In cases where precision and Recall are equally demanding, F1 can be used.

0x10 TPR, FPR, TNR, FNR

And finally, quadruplets, who are very confusing.

True Positive Rate, TPR = TP/(TP+FN);

It describes the proportion of positive instances correctly classified by the classifier to all positive instances. That is, how many of the positive cases are correctly determined to be positive. We can see that TPR equals sensitivity.

False Positive Rate, FPR = FP/(TN+FP);

It describes the proportion of predictions that are wrong in all the samples that are actually negative, how many of the negative cases are wrongly positive. In medicine, it is also called misdiagnosis rate (a person without a disease is diagnosed as having a disease), equal to 1-specificity.

True Negative Rate, TNR= TN/N = 1-fpr;

It describes the proportion of negative instances correctly classified by a classifier to all negative instances, also known as specificity. The recognition ability of the classifier to negative cases is measured.

False Negative Rate, FNR = FN/(TP + FN);

This is the percentage of people who are actually positive that the prediction is wrong, which is called the missed diagnosis rate in medicine, and is equal to 1 minus Sensitivity.

Let’s sum it up with a table

abbreviations Chinese name equivalent Medical significance
TPR The real rate Sensitivity Bigger is better, and a number of 1 means the doctor is good and correct
FPR False positive rate 1 – Specificity Misdiagnosis rate (detection of disease in people who do not have disease) the smaller the better
TNR True negative rate Specificity
FNR False negative rate 1 – Sensitivity Missed diagnosis rate (sick people not detected)

Here’s another trick:

  • “Sensitivity” means “allergy”, corresponding to disease, which is easy to remember when associated.
  • The word for Specificity means “immunity” and can be associated with the absence of disease, so it is also easy to remember.

0x11 TPR vs FPR

A bigger TPR is better, and a smaller FPR is better, but the two indicators are often at odds. In order to increase TPR, it can be predicted that more samples will be positive cases, and at the same time, more negative cases will be misjudged as positive cases.

Understand these two indicators in a specific context. To judge a diseased specimen, as in medical diagnosis.

Then try to find out the sick is the main task, that is, the first index TPR, the higher the better.

The second indicator, FPR, should be as low as possible.

It is not difficult to find that these two indicators are mutually restricted. If a doctor is sensitive to the symptoms of a disease, the first indicator should be high, but the second indicator will be correspondingly high. In the most extreme case, he treats all the samples as diseased, so the first indicator is 1 and the second indicator is 1.

  • (TPR=1, FPR=0), is perfect classification, that is, the doctor is good at medicine, diagnosis is all right.

  • (TPR > FPR), the doctor’s judgment was generally correct.

  • (TPR = FPR), the point on the middle line, that is, the doctor is all deceived, half right, half wrong;

  • (TPR < FPR), the doctor said you are sick, then you are probably not sick, the doctor’s words we have to listen to the opposite, for the true quack.

Different from recall and Precision, TPR and FPR are positively correlated, that is to say, when TPR increases, FPR will also increase. We want the TPR to be as big as possible (1) and the FPR to be as small as possible (0), but this usually doesn’t happen.

0 x12 reference

The evaluation indexes of the binary classification algorithm include accuracy, accuracy, recall, confusion matrix and AUC

Detailed explanation of classification algorithm evaluation index

Precision and Recall

Recall, Precision, Sensitivity, Specificity, F1, PR curve, ROC, AUC application scenarios

Dichotomous evaluation indicators F1 & AUC & LogLoss

ROC curve and AUC

ROC curve vs Precision-Recall curve

ROC curve with AUC and LIFT

Kappa coefficient for evaluating multiple classification models in machine learning

ROC curve, K-S curve, Lift curve, PR curve

Dichotomous evaluation, starting with the obfuscation matrix

Test method, confusion matrix, model evaluation

★★★★ Thoughts on life and technology ★★★★★

Wechat official account: Rosie’s Thoughts

If you want to get a timely news feed of personal articles, or want to see the technical information of personal recommendations, please pay attention.

This article is formatted using MDNICE