Python deep learning target detection evaluation indicators: mAP, Precision, Recall, AP, IOU, etc

Target detection and evaluation indexes:

Accuracy, Confusion Matrix, Precision, Recall, Average Accuracy (AP), Mean Average Precision(mAP), cross division and union (IoU), ROC + AUC, non-maximum suppression (NMS).

It is assumed that there are two categories in the original sample: 1: there are P samples of category 1 in total, and category 1 is assumed to be a positive example. 2: There are a total of N samples of category 0, assuming that category 0 is a negative example. After classification: 3: TP samples of category 1 are correctly judged as category 1 by the system, FN samples of category 1 are wrongly judged as category 0 by the system, obviously P=TP+FN; 4: Samples with FP category 0 were wrongly judged as category 1 by the system, and samples with TN category 0 were correctly judged as category 0 by the system, obviously N=FP+TN;

**** GroundTruth Forecast results

TP (True Positives) : a positive sample = a sample that was correctly classified as a positive one

TN (True Negatives to divide negative sample correctly)

FP (False Positives) : positive sample = [negative sample was misclassified as positive]

FN (False Negatives to negative sample = positive sample is misclassified as negative sample

1. Accuracy

A = (TP + TN)/(P+N) = (TP + TN)/(TP + FN + FP + TN); It reflects the classifier system’s ability to judge the whole sample, which can judge the positive as positive and the negative as negative.

2. Precision

P = TP/(TP+FP);

The number of correct samples divided by the number of all samples, i.e., the accuracy (classification) rate = the number of positive and negative cases correctly predicted/the total number.

Accuracy is generally used to evaluate the global accuracy of the model, and cannot contain too much information to comprehensively evaluate the performance of a model.

It reflects the proportion of the real positive sample judged by classifier.

3. Recall

R = TP/(TP+FN) = 1 – FN/T; It reflects the proportion of positive cases correctly judged in the total positive cases.

4, F1 value

F1 = 2 * recall rate * Accuracy/(recall rate + Accuracy);

This is what’s traditionally known as the F1 measure.

5. Probability of Missing Alarm

MA = FN/(TP + FN) = 1 — TP/T = 1-r; Reflect how many positive examples are missed.

6. False Alarm probability

FA = FP/(TP + FP) = 1 — P; Reflect how many negative cases are misjudged.

7. Confusion Matrix

The horizontal axis of the confusion matrix is the quantity statistics of categories predicted by the model, and the vertical axis is the quantity statistics of actual labels of data.

The diagonals represent the number of model predictions consistent with the data labels, so the sum of the diagonals divided by the total number of test sets is the accuracy. The larger the number on the diagonal, the better, and the darker the color in the visualization result, indicating the higher prediction accuracy of the model in this class. If you look at the lines, the lines that are not diagonal are the category of mispredictions. In general, we want the diagonal as high as possible, and the non-diagonal as low as possible.

8. Precision and Recall

Some related definitions. Suppose you have a test set that consists of only geese and aircraft. Suppose your classification system is ultimately designed to extract images of all aircraft in the test set, but not geese.

When a positive sample was taken, a picture of an airplane was correctly identified as a True positives.
True negatives to identify a negative sample correctly as a negative sample, and when a goose image is not identified, the system correctly considers it to be a goose.
When a negative sample was wrongly identified as a positive one, a picture of a wild goose was mistakenly identified as a plane.
False negatives: when a positive negatives is misidentified as a negative negatives, and when images of an aircraft are not identified, then the system misidentifies them as geese.

Precision is the ratio of True positives in an identified image. That’s what percentage of all the airplanes that are identified in this hypothesis are actual airplanes. It reflects the proportion of the real positive sample judged by classifier.

Recall is the proportion of all positive samples in the test set that are correctly identified as positive samples. That is, the ratio of the number of correctly identified aircraft in this hypothesis to the number of actual aircraft in the test set.

Precision-recall curve: The recognition threshold is changed so that the system can recognize the first K images in turn. The change of threshold will lead to the change of Precision and recall values at the same time, so as to obtain the curve.

If a classifier performs well, it should behave as follows: Precision remains at a high level while Recall grows. However, a classifier with poor performance may lose a lot of Precision value in order to improve Recall value. Generally, precision-recall curves are used in articles to show the tradeoff between Precision and recall of classifiers.

9、平均精度（Average-Precision，AP）与 mean Average Precision(mAP)

AP is the area under the precision-recall curve. Generally speaking, the better a classifier is, the higher AP value is.

A mAP is the average of multiple categories of aps. The value of mAP must be in the range of [0,1]. The larger the better. This index is one of the most important target detection algorithms.

PR performance is better when positive samples are very small.

10, IoU

The value of IoU can be understood as the degree of overlap between the box predicted by the system and the box marked in the original picture. The calculation method is the intersection ratio of Detection Result and Ground Truth, and the union of them is the accuracy of Detection.

It is worth noting that when the IoU value exceeds 0.5, the subjective effect is relatively good.

IOU is the indicator to express the differences between bounding box and groundtruth:

11. Receiver Operating Characteristic (ROC) Curve and AUC (Area Under Curve)

ROC curve:

Abscissa: False positive rate (FPR), FPR = FP / [FP + TN], represents the probability that all negative samples are wrongly predicted as positive samples, False alarm rate;
Ordinate: True positive rate (TPR), TPR = TP / [TP + FN], which represents the probability and hit rate of correct prediction among all positive samples.

The diagonals correspond to random guess models, while (0,1) correspond to all ideal models whose collation precedes all counterexamples. The closer the curve is to the top left corner, the better the classifier performs.

A nice feature of the ROC curve is that it stays constant as the distribution of positive and negative samples in the test set changes. In actual data sets, class imbalance often occurs, that is, negative samples are much more than positive samples (or the opposite), and the distribution of positive and negative samples in test data may also change over time.

ROC curve drawing:

(1) According to the probability value of each test sample belonging to the positive sample, ranking from large to small;

(2) From high to low, “Score” value is successively taken as threshold. When the probability of a test sample belonging to a positive sample is greater than or equal to this threshold, we consider it as a positive sample; otherwise, it is a negative sample;

(3) By selecting a different threshold each time, we can obtain a set of FPR and TPR, namely a point on the ROC curve.

When threshold is set to 1 and 0, two points (0,0) and (1,1) on the ROC curve can be obtained respectively. Connecting these (FPR,TPR) pairs yields the ROC curve. When the threshold is larger, the ROC curve is smoother.

Area Under Curve (AUC) is the Area Under the ROC Curve. The closer AUC is to 1, the better the classifier performance is.

Physical significance: Firstly, THE AUC value is a probability value. When you randomly select a positive sample and a negative sample, the current classification algorithm ranks the probability of the positive sample before the negative sample according to the Score value calculated. Of course, the larger the AUC value is, the more likely the current classification algorithm is to rank the positive samples before the negative ones, that is, to achieve better classification.

Calculation formula: is to find the rectangular area under the curve.

12. Comparison between PR curve and ROC curve

Characteristics of ROC curve:

(1) Advantages: When the distribution of positive and negative samples in the test set changes, the ROC curve can remain unchanged. Because TPR focuses on positive cases, FPR focuses on negative cases, making it a more balanced evaluation method.

In actual data sets, class imbalance often occurs, that is, negative samples are much more than positive samples (or the opposite), and the distribution of positive and negative samples in test data may also change over time.

(2) Disadvantages: The advantage of ROC curve mentioned above is that it will not change with the change of category distribution, but this is also its disadvantages to some extent. Because the negative case N goes up a lot, and the curve doesn’t change, that equals a lot of FP. This is unacceptable if the main concern in information retrieval is the predictive accuracy of positive examples. Under the background of category imbalance, the large number of negative cases leads to the insignificant growth of FPR, resulting in an overly optimistic effect estimation of ROC curve. The horizontal axis of the ROC curve is FPR. According to FPR, when the number of negative cases N far exceeds that of positive cases P, a large increase in FP can only bring about a small change in FPR. As a result, although a large number of negative cases were wrongly identified as positive cases, they could not be seen intuitively on the ROC curve. (Of course, you can analyze only the left part of the ROC curve.)

PR curve:

(1) Precision is used in PR curve, so both indicators of PR curve focus on positive cases. The PR curve is widely considered to be better than the ROC curve in the case of class imbalance because the positive cases are mainly concerned.

Usage Scenarios:

ROC curve is suitable for evaluating the overall performance of classifier because it takes into account both positive and negative cases, while PR curve completely focuses on positive cases.
If there are multiple copies of data and the distribution of different categories, such as credit card fraud in the proportion of positive and negative cases each month may not be the same, if just want to simply compare the performance of the classifier and eliminating the influence of type distribution change, more suitable for the ROC curve, because the category distribution change may change the PR curve and downs, It is difficult to compare models at this time; On the contrary, if you want to test the influence of different category distributions on the performance of the classifier, the PR curve is more suitable.
If you want to evaluate the prediction of positive cases under the same class distribution, you should choose the PR curve.
For class imbalance problems, the ROC curve usually gives an optimistic estimate of the effect, so the PR curve is better most of the time.
Finally, according to the specific application, the optimal point on the curve can be found to obtain the corresponding precision, recall, F1 Score and other indicators, so as to adjust the threshold value of the model, so as to obtain a model that conforms to the specific application.

13. Non-maximum Suppression (NMS)

Non-maximum Suppression means that bounding boxes with high confidence are found according to the coordinate information of score matrix and region. For those boxes that overlap, only the one with the highest score is retained.

(1) The NMS calculates the area of each bounding box, and then sorts it according to score. The bounding box with the largest score is the first object to be compared in the queue.

(2) Calculating the IoU of other bounding boxes and the current maximum score and box, removing the bounding boxes whose IoU is larger than the set threshold, and reserving the small IoU prediction boxes;

(3) Then repeat the above process until candidate bounding box is empty.

Finally, there are two thresholds in the process of bounding box detection, one is IoU, and the other is to delete the bounding boxes whose score is less than the threshold from the candidate bounding boxes after the process. Non-maximum Suppression Is discarded one category at a time. If there are N categories, the non-maximum Suppression is discarded N times.