1. Concept introduction

 

The multi-label Image Classification task has more than one Image label, so the standard of ordinary single-label Image Classification, i.e. mean accuracy, cannot be used for evaluation. This task uses a method similar to that used in information retrieval – mAP (Mean Average Precision). AP measures the quality of the learned model in each category, while mAP measures the quality of the learned model in all categories. After obtaining AP, mAP calculation becomes very simple, which is to take the average of all APS.

 

2. Calculation method

Although the literal meaning of mAP looks similar to mean accuracy, the calculation method is much more complicated. The calculation method of mAP is as follows:

First, the confidence score of all test samples is obtained with the trained model, and the confidence score of each category (such as CAR) is saved in a file (such as comp1_cls_test_CAR.txt). Assume that there are 20 test samples in total, and the ID, confidence score and ground truth label of each sample are as follows:

 

Then sort the confidence score, and get:

This table is important, and the subsequent precision and Recall calculations are based on this table

Then calculate precision and recall, which are defined as follows:

 

The true positives + False positives tives in the graph are the elements we selected for the classification task, such as the classification of a test sample on a trained CAR model. We wanted the top-5 result:

 

In this example, the true positives were photos 4 and 2, and the false positives were photos 13, 19, and 6. The elements inside and outside the circle (false negatives and true negatives) are relative to the elements inside the box, in this case, the elements with confidence score beyond top-5, that is,

 ​

Among them, the false negatives is refers to the 9,16,7,20 image, true negatives means,18,5,15,10,17,12,14,8,11,3 picture 1.

So, Precision=2/5=40% in this example, which means that for the car category, we selected 5 samples, of which 2 are correct, that is, the accuracy rate is 40%; Recall=2/6=30%, meaning that out of all the test samples, there were 6 Cars, but since we only recalled 2 cars, the Recall rate was 30%.

In practical multi-category classification tasks, we usually do not meet the requirement of measuring the quality of a model only by top-5, but need to know the precision and recall corresponding to the model from top-1 to top-N (N is the number of all test samples, which is 20 in this paper). Obviously, as more and more samples are selected, recall will be higher and higher, while precision on the whole will decline. The commonly used precision-recall curve can be obtained by taking recall as the abscissa and precision as the ordinate. The precision-recall curve for this example is as follows:

 

Next, the AP calculation is based on PASCAL VOC CHALLENGE. Start by setting a set of thresholds [0, 0.1, 0.2… 1]. Then we get a corresponding maximum precision for recall greater than each threshold (e.g. Recall >0.3). Thus, we calculate 11 precision. The AP is the average of these 11 precisions. This method is called 11-Point interpolated Average Precision.

Of course, the PASCAL VOC CHALLENGE has been calculated differently since 2010. The new calculation method assumes that there are M positive examples in the N samples, then we get M recall values (1/M, 2/M… , M/M), for each recall value r, we can calculate the maximum precision of (r’ > r), and then average these M precision values to obtain the final AP value. The calculation method is as follows:

 

 

The corresponding precision-recall curve (which is monotonically decreasing) is as follows:

 

-1. Reference Content

1. blog.sina.com.cn/s/blog_9db0…