[Target detection (VI)] YOLOv2 - Introduce Anchor, Better, Faster and Stronger

Target detection (1) RCNN detail — the first work of deep learning target detection (2) SPP Net — let convolution computation share (3) Target detection (3) Fast RCNN — let RCNN model can train end-to-end (4) target detection Faster RCNN — RPN network to replace Selective Search [Target detection (5)] YOLOv1 — start the chapter of one-stage target detection [Target detection (6)] YOLOv2 — Introduce Anchor, Better, Faster, Understand the regression box loss function of target detection — IoU, GIoU, DIoU, CIoU principle and Python code FPN detail — Multi-scale feature fusion through feature pyramid network

1. Introduction

YOLOv2 is the second generation algorithm of YOLO series, the original paper name is “YOLO9000: Better, Faster, Stronger, the author made a lot of improvements on the basis of YOLOv1, and introduced the Anchor mechanism proposed by Faster RCNN, which won CVPR 2017 Best Paper Honorable Mention. YOLOv2 has made many improvements compared with YOLOv1, which also significantly improves the mAP of YOLOv2. In addition, the speed of YOLOv2 is still very fast, maintaining its advantages as a one-stage method. The comparison between YOLOv2 and Faster R-CNN, SSD and other models is shown in the following figure.

2. Principle and improvement strategy of YOLOv2

This article is based on YOLOv1 to explain the improvement strategy of YOLOv2, if you are not familiar with YOLOv1 algorithm, you can read a technical blog of YOLOv1. The author has made a lot of improvement and optimization work from YOLO optimization to YOLOv2, and the optimization effect is also very significant. The mAP accuracy of VOC2007 is directly optimized from 63.4 to 78.6%.

2.1 Batch Normalization

In 2017, BN has proved its role and effectiveness. The author also used BN layer after the convolution layer in the second generation OF YOLO. BN layer makes network training easier and has regularization effect, and improves mAP by 2 percentage points after using BN layer to prevent over-fitting. In addition, the author notes that the Dropout layer can be eliminated after the BN layer is used to prevent overfitting.

2.2 High Resolution Classifier

Used in YOLOv1224 * 224As the network input, the image of Backbone is pre-trained on ImageNet. During the formal training, the network needs to learn the detection process and adapt to a larger resolution. In YOLOv2 the authors use a higher resolution448 * 448Continue fine-tune with ten epochs and let the web get used to the large resolution.
Meanwhile, the author removes a pooling layer in the network to make the network output feature map more dense (7 * 7–>13 * 13).
The High Resolution Classifier can improve the mAP by nearly 4 percentage points.

2.3 the Anchor

2.3.1 Introducing the Anchor mechanism

In YOLOv1, the input image is eventually divided intoGrid, predicting 2 bounding boxes per cell. YOLOv1 finally used full-connection layer to directly predict boundary frames. The width and height of boundary frames were relative to the size of the whole picture. However, it was difficult for YOLOv1 to learn to adapt to the shapes of different objects in the training process due to the presence of objects with different scales and ratios in each image. This also leads to poor performance of YOLOv1 in accurate localization. The author draws on the anchor mechanism of Faster RCNN to generate several anchors for each feature map point output by the network:

Anchor is a virtual bounding box.
The actual Bbox prediction is generated by Anchor.

In YOLOv1, the final boundary box is the predicted value and GT value directly fitted, while the Faster RCNN is the offset regression, that is, the deviation value of fitting the predicted value and the prior box, which makes the model easier to learn. Therefore, YOLOv2 removes the full connection layer in YOLOv1 and adopts convolution and Anchor boxes to predict boundary boxes. In order to improve the resolution of the feature map used for detection, a pool layer is removed. In the detection model, YOLOv2 is not adoptedImages are used as input insteadSize. Because the total step size of sampling under YOLOv2 model isforThe size of the final feature map is, the dimensions are odd, so that the feature graph has exactly one central position.

For YOLOv1, each cell predicts 2 boxes, each containing 5 values:, the first four values are the position and size of the boundary box, and the last value is confidence scores (including the probability of the object and the IOU of the prediction box and the ground truth). But each cell only predicts one set of class predictions (conditional probabilities under confidence), shared by two boxes. After anchor boxes are used in YOLOv2, each Anchor box at each location separately predicts a set of classification probability values.

After using Anchor boxes, YOLOv2’s mAP decreased slightly (the reason for the decrease, I guess, is that YOLOv2 still adopts YOLOv1’s training method even though it uses Anchor boxes). YOLOv1 can only predict 98 bounding boxes (), while YOLOv2 can predict thousands of boundary boxes by using Anchor boxes (). Therefore, after anchor boxes were used, the recall rate of YOLOv2 increased greatly, from 81% to 88%.

2.3.2 Anchor clustering

In The Faster RCNN, each point of the feature map generates 9 anchors, and the number and type of anchors generated here are based on engineering experience. Then the author tries to solve how to obtain a better anchor when he gets a new network. Here, k-means algorithm is proposed to cluster GT Boxes to obtain better anchor hyperparameters. The following figure shows the influence of k value selection on the final result of the author’s experimental cluster analysis (k=5 is selected in the paper to balance accuracy and calculation speed) :

The author defines the distance of k-means clustering as:

2.3.3 Anchor, True BBoxes & Predicted BBoxes

This part mainly explains how the Anchor mechanism is specifically combined with GT Box and prediction Box.

The first part first explains the meaning of Anchor and first looks at the output of a group of Anchor clustering, such as:

Output 5 Anchors, each group of anchors contains two numbers, all within the range [0, 13] (dimension of outputting feature map is 13*13). Here, the width and height of image area corresponding to real anchor are mapped to the interval of feature map. That is, each width is divided by the width of the original image and multiplied by 13.

Part 2 then looks at how GT Boxes are handled:

Original bbox: [x_o, y_o, w_o, h_o| TH]] ∈ [0, W
Normalize in 0~1: [x_r, y_r, w_r, h_r] = [x_o / W, y_o / H, w_o / W, h_o / H] ∈ [0, 1]
Transfer to feature map size: [x, y, w, h] = [x_r, y_r, w_r, h_r] * (13 | 13)
Transfer to 0~1 corresponding to each grid cell:

$\begin{cases} x_f = x – i\\ y_f = y – i\\ w_f = log(w/anchors[0])\\ h_f = log(h/anchors[1]) \end{cases}$

In the above formula, I and j correspond to the ID of the Grid cell, that is, I =floor(x), j=floor(y). The relative offset of the center coordinates of the table Bbox to the grid cell to which it belongs, because the center of GT Bbox must be within the grid cell. So I, j ∈ [0, 1].

The last part shows how anchor, GT Boxes and Predicted BBoxes can be used comprehensively: First, recall how offset is returned in Faster RCNN:

$\begin{cases} t_x^p = \frac{x_p – x_a}{w_a}, t_y^p = \frac{y_p – y_a}{h_a}\\ t_w^p = log(\frac{w_p}{w_a}), t_h^p = log(\frac{h_p}{h_a})\\ t_x^g = \frac{x_g – x_a}{w_a}, t_y^g = \frac{y_g – y_a}{h_a}\\ t_w^g = log(\frac{w_g}{w_a}), t_h^g = log(\frac{h_g}{h_a}) \end{cases}$

If TXP =3t_x^p= 3TXP =3, then the center coordinates will be directly offset by 3 waw_AWA. It is likely that the predicted center of the Bbox is already not in the grid cell. In order to correct this situation, it is necessary to limit the offset. YOLOv2 improves the offset regression of the original Anchor, fitting the relative offset of the center point of Bbox to the Grid cell and limiting the offset of the center point to the range [0, 1], as shown in the following formula:

$\ begin {cases} sigma (t_x ^ p) = b_x – C_x, sigma (t_y ^ p) = b_y – C_y \ \ t_w ^ p = log (\ frac {w_p} {w_a ‘}), t_h^p = log(\frac{h_p}{h_a’})\\ t_x^g = g_x – floor(g_x), t_y^g = g_y – floor(g_y)\\ t_w^g = log(\frac{w_g}{w_a’}), t_h^g = log(\frac{h_g}{h_a’}) \end{cases}$

In the above formula, σ represents the sigmoid function. Bx,by∈[0,13]b_x, b_Y ∈[0,13] bx,by∈[0,13] bx,by∈[0,13] are the center coordinates of the box mapped to 0-13 dimensions; Cx, (Cy) (C_x, C_y) (Cx, Cy), bx, byb_x, b_ybx, by recent discrete integer point, wa ‘, ha ‘∈ [0, 13] w_a’, h_a ‘∈ [0, 13]’ wa, ha ‘∈ [0, 13]. Is the width and height of anchor mapped to 0-13 dimensions, gx, GY ∈[0,13]g_x, g_y ∈[0,13] Gx, GY ∈[0,13], is the center point coordinates of GT Box mapped to 0-13 dimensions; Gw, GH ∈[0,13]g_w, g_h ∈[0,13] GW, GH ∈[0,13] similarly.

2.4 Network Structure

The author no longer uses VGG as backbone. Although ResNet was proved to be very useful at that time, the author did not use the structure of ResNet, but designed the network structure by himself, named Darknet-19. As shown in the figure below, Darknet-19 is the structure of classification network. In the target detection task, the author changes the size of the input image to 416*416, and removes the last pooling layer and the full connection layer in the red box to improve the resolution. Finally, the network output is a feature map with dimension 13*13*1024.

2.5 Fine-grained Features

In YOLOv1, the detection effect of small targets is very poor. On the one hand, it is related to the design of loss function. The model Loss will be biased towards large targets, which will affect the learning of small targets. On the other hand, it is related to the coarse feature of the model. The network has been carrying out downsampling, which has lost a lot of details. The shallow features learned by the deep learning model are details such as corners and edges of objects, while the deep features learned tend to be abstract semantics. Therefore, both shallow features and deep features are very important for target detection task, and the positioning task requires detail information to accurately frame. As a result, YOLOv2 borrows ideas from RestNet and incorporates more refined features. The author translates the features of 26*26*512 into 13*13*256 tensor through the passthrough layer, which integrates with the original coarse-grained features of 13*13*1024. The network structure is shown in the following figure:

The passthrough layer is the way of separating the feature matrix one by one, so that the size of the feature graph can be halved, but the channel becomes 4 times of the original, as shown in the figure below:

Such feature fusion design improves the final mAP by 1 point.

2.6 Multi – Scale Training

Multi-scale training is to make the network adapt to images of different sizes and improve the generalization ability of the model. Through multi-scale training, mAP can be improved by 1 percentage point.

Remove the full connection layer: So that the network can receive images of any size, improving the robustness of the model.
Multi-scale: set the scale [320, 352, 384…, 608] and train to change every 10 epochs.

2.7 Loss equation

The author did not give details of loss calculation and positive and negative sample division in the original paper (maybe to make readers worship the author’s code). You can refer to this technical blog to understand the author’s code of Loss calculation.

Also refer to part loss: technology blog zhuanlan.zhihu.com/p/82099160 transcribed the core formula, this paper for reference:

3. Analysis of YOLOv2’s effects and advantages and disadvantages

3.1 YOLOv2 effect

As shown in the figure below, YOLOv2’s advantage can be summed up as a point increase and a speed increase. At the same scale, YOLOv2 improves mAP accuracy by 13.4% and inference speed by 50% under VOC2007 data set.

3.2 Advantages and Disadvantages of YOLOv2

Advantages:

No full connection layer, fast speed: Darknet19 network structure including only convolution and pooling is adopted, with fewer layers than YOLO, and without full connection layer, less computation is required; Models run faster.
Use convolution instead of full connection layer: the input size constraint is removed.
It’s faster, it’s more accurate, it uses a lot of optimization techniques.

Disadvantages:

There is room for improvement in accuracy.
The detection performance of small targets is not very good and needs to be improved.
Dense objects difficult to detect: Although the resolution has been improved and the overlapping difficulty of YOLOv1 has been greatly improved, each Grid cell can only predict one object at most, which is still not possible in the case of Russian nesting dolls.

Reference

Arxiv.org/pdf/1612.08…
zhuanlan.zhihu.com/p/35325884
Zhangxu.blog.csdn.net/article/det…
www.cnblogs.com/YiXiaoZhou/…
zhuanlan.zhihu.com/p/82099160

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

[Target detection (VI)] YOLOv2 — Introduce Anchor, Better, Faster and Stronger

1. Introduction