Abstract:In this paper, the basic knowledge of target detection algorithm for a brief review, convenient for everyone to learn to view.

This article is shared from the Huawei Cloud Community “Target Detection Basics” by Lutianfei.

We are already familiar with the image classification task, which is to classify the objects in the algorithm. And today we are going to learn about another problem of building neural networks, namely target detection. This means that not only do we have to use the algorithm to determine whether or not a car is in the picture, but we also have to mark its position in the picture and surround the car with a border or a red square, which is the object detection problem. In this paper, the basic knowledge of target detection algorithm for a brief review, convenient for everyone to learn to view.

Basic knowledge of target detection

The stages of the network

  • Two-stage: The first-level network is used to extract candidate regions; The second-level network classifies and regresses the extracted candidate regions with precise coordinates, such as the RCNN series.
  • One-stage: the step of candidate region extraction is abandoned, and the two tasks of classification and regression are completed using only the first-level network, such as YOLO and SSD, etc.

Why is a one-phase network inferior to a two-phase network

Because the training of positive and negative cases is not balanced.

  • Too many negative examples, too few positive examples, the loss of negative examples completely submerged the positive examples;
  • Most of the negative examples are easy to distinguish, and the network can’t learn useful information. If there are a large number of such samples in the training data, it will be difficult for the network to converge.

How can two-stage network solve the imbalance in training

  • In RPN network, the most likely candidate region is selected according to the prospect confidence, so as to avoid a large number of easily distinguishable negative cases.
  • Sampling was conducted according to the crossover ratio in the training process, and the ratio of positive and negative samples was set as 1:3 to prevent excessive negative examples from appearing.

Common data set

Pascal VOC data set

The dataset is available in two versions, 2007 and 2012, and contains 20 categories of objects.

The five main tasks of PASCALVOC:

  • ① Classification: for each category, determine whether it exists in the test photos (20 categories in total);
  • ② Detection: detects the position of the target object in the picture to be tested and gives boundingbox coordinates;
  • ③ Segmentation: for any pixel in the photo to be tested, determine which category contains the pixel (if none of the 20 categories contains the pixel, then the pixel belongs to the background);
  • ④ Human movement recognition (in the case of given rectangular box position)
  • ⑤ LargescaleRecognition (hosted by ImageNet)

Import the.xml file corresponding to the image, and each object of each image in the annotation file corresponds to an individual dict

  • The attribute ‘boxes’
  • The attribute ‘gt_classes’
  • The attribute ‘gt_overlaps’
  • The attribute ‘flipped’
  • The attribute ‘seg_areas’

CoCo data set

It is divided into three versions: 2014, 2015 and 2017

Manage data annotation information uniformly in the Annotations folder. For example, the detection and segmentation annotation file for train2014 is instances_train2014.json

Instances of ObjectInstances, ObjectKeyPoints and ImageCaptions are all instances of annotation

Common Evaluation Indicators

  • True positives (TP) : The number of positive tives that are properly classified, i.e. the number of examples that are actually positive and are classified as positive examples.
  • False positives (FP) : The number of cases that are incorrectly classified as positive, i.e. the number of cases that are actually negative but are classified as positive.
  • False negatives (FN) : the number of instances that are wrongly classified as negative examples, i.e., the number of instances that are actually positive but are classified as negative examples.
  • True negatives (TN) : the number of cases correctly divided into negative cases, which is actually the number of cases divided into negative cases.

Precision = TP/(TP+FP) = TP/ All data predicted by the model as positive examples

Recall = TP/(TP+ Fn) = TP/ All real class data with positive examples

PR curve

We hope that the higher P is, the better, and the higher R is, the better. However, in fact, the two are contradictory in some cases.

So what we need to do is find a balance between accuracy and recall. One method is to draw a PR curve and then use the area under the PR curve (AUC) to judge the model.

Ious indicators

IOU is the ratio of the intersection and union of prediction box and ground truth.

For each class, the area where the prediction box and ground truth overlap is the intersection, and the total area across is the union.

PR in target detection

The calculation method of MAP in VOC

Through the PR curve, we can get the corresponding AP value:

Prior to 2010, AP was defined in PASCALVOC contests as follows:

  • First, the prediction results of the model are sorted (that is, in descending order of the confidence of each prediction).
  • Recall values were divided from 0 to 1 into 11 parts: 0, 1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, and 1.0.
  • In each recall interval (0-0.1,0.1-0.2, 2-0.3,… , 0.9-1.0), and then calculate and average the sum of the maximum precision rates, which is the AP value.

Since 2010, the PASCALVOC competition has replaced these 11 recall points with all recall data points in the PR curve.

For a recall value R, the precision value is the maximum value in all recall>= R (this ensures that the P-R curve is monotonically decreasing and avoids curve wobbling). This method is called All-points-interpolation. This AP value is the same thing as the area under the PR curve.

Specific examples:

The calculation method of MAP in COCO

IOU (used to decide whether it is TP) was used to calculate AP for 10 times at [0.5:0.05:0.95] and then calculate the mean value.

Non-maximum suppression

NMS algorithm is generally designed to remove redundant boxes after model prediction, which is generally set with an NMS_THRESHold =0.5,

The specific implementation idea is as follows:

  1. Select the one with the highest scores in this type of box, mark it as BOX_BEST, and retain it
  2. Calculate the IOU of box_best with the rest of the boxes
  3. If its IOU> is 0.5, then the box should be discarded (since it is possible that both boxes represent the same target)
  4. Which one has the higher score?)
  5. Find out which of the last remaining boxes has the highest scores, and repeat

Click on the attention, the first time to understand Huawei cloud fresh technology ~