Target detection (1) RCNN detail — the first work of deep learning target detection (2) SPP Net — let convolution computation share (3) Target detection (3) Fast RCNN — let RCNN model can train end-to-end (4) target detection Faster RCNN — RPN network to replace Selective Search [Target detection (5)] YOLOv1 — start the chapter of one-stage target detection [Target detection (6)] YOLOv2 — Introduce Anchor, Better, Faster, Understand the regression box loss function of target detection — IoU, GIoU, DIoU, CIoU principle and Python code Focal Loss pushes one-stage algorithm to the peak

YOLOv3 only has four A4 pages of content without references. There are no major changes or innovations, but some small changes, and finally achieved good results. However, YOLO’s author style is quite free and easy, and many details are not specified in the article, perhaps because he wants people to appreciate his code. The paper is mainly modified in the following details:

  • Backbone network: Switch from DarkNET-19 to DarkNET-53.
  • Three output feature layers of different scales are used to better adapt to the detection of small objects.
  • Increased the number of Anchor.
  • Loss function: Switch from softmax’s cross entropy loss to binary cross entropy loss.

1. Darknet-53 network architecture

As shown in the figure below, Darknet-53 is a classification network designed by the author. As the backbone of YOLOv3, darkNET-53 consists of a series of 3*3 and 1*1 convolutions, each of which is followed by a BN layer and LeakyReLU activation function. The author counts the last fully connected layer as a convolution, and there are 53 convolution in total. The authors also borrowed the idea of ResNet residual structure, with each box in the figure below representing a residual structure. In the figure below, 256*256 input is used as the classification network, while the resolution of 416*416 is used as the network input in target detection. The second chapter is mainly about which three feature maps are used for prediction.

2. YOLOv3 model structure

Features learned from shallow network are more detailed features of corners and edges, while features learned from deep network tend to abstract semantic features. Meanwhile, shallow network has small receptive field, which is more suitable for small target detection, while deep network has large receptive field, which is more suitable for prediction of large objects. Similar to FPN’s idea, the authors also used multi-dimensional features for prediction in YOLOv3.

In YOLOV2, the author fuses fine-grained features and concat together to make the final prediction. YOLOv3 uses Darknet-53 as backbone and removes the full connection layer. The author selects the feature map obtained after the last three residual layers as the scale of the three feature maps. These are 52*52*256(small goals), 26*26*512(medium goals), and 13*13*1024(large goals). The fusion of these three feature layers is realized through a series of convolution layers and up-sampling layers, and the concat connection and fusion of high-dimensional features and low-dimensional features are carried out. And here you see a picture that says it very clearly, moved over, as follows.

At the end of the sequence, the three tensor outputs are 52*52*255, 26*26*255 and 13*13*255. This runs on the COCO task. There are 80 categories. Three different anchors are generated for each location, i.e. 255=3×(80+4+1).

3. Anchor

Yolov3 follows the anchor mechanism of YOLOv2, and it is necessary to understand the relationship between Anchor, predictated Bboxes and GT Boxes (which is not easy), for reference to section 2.3.3 of YOLOv2. The number of anchor in YOLOv3 increases a lot, mainly because each of the three scales will generate corresponding anchor (the size ratio is obtained by clustering on coco data set).

4. Design of Loss function

The author does not provide a clear Loss Loss function in the paper. The following Loss function is mainly summarized based on the author’s description and recurrence code, which mainly consists of target positioning Loss, confidence Loss and classification Loss.

4.1 loss of Confidence

Target confidence can be understood as confidence=P(obj)*IOU, which includes two parts: whether the predicted frame is the target and the degree of overlap between the predicted frame and the real frame. YOLOv3 changes the confidence from the sum of error squares to the binary cross entropy loss method. In the following figure, OI indicates whether there is a target in the target boundary box I, and 1 indicates that there is a target. 0 means that there is no target (there is also IOU that considers OI as the predicted target boundary box and the real target boundary box in the repetition code). Here, the predicted probability value is processed by sigmoID.

4.2 Classified Loss

Here, dichotomous cross entropy is also used. The author thinks that the same object can be classified into multiple categories at the same time, for example, cats can be classified into cats and animals, which can cope with more complex scenes (but I think there is no such situation in general target detection). In the formula, Oij represents whether there is a real j-class target in the boundary box I of the predicted target, 0 represents non-existence, 1 represents existence, and the probability value in the formula is also processed by sigmoID.

4.3 Positioning loss of boundary frame

This part is consistent with YOLOv2. The sum of squares of errors is used for calculation, and the offset regression is performed. Only the case that the current region does have a target is calculated, and the offset is defined as follows: {sigma (TXP) = bx – Cx, sigma (typ) = by – Cytwp = log (wpwa ‘) and THP = log (hpha ‘) TXG = gx – floor (gx), tyg = gy – floor (gy) TWG = log (wgwa ‘), THG = log (hgha ‘) \ b Do v.begin {cases} sigma (t_x ^ p) = b_x – C_x, sigma (t_y ^ p) = b_y – C_y \ \ t_w ^ p = log (\ frac {w_p} {w_a ‘}), t_h^p = log(\frac{h_p}{h_a’})\\ t_x^g = g_x – floor(g_x), t_y^g = g_y – floor(g_y)\\ t_w^g = log(\frac{w_g}{w_a’}), t_h^g = log(\frac{h_g}{h_a’}) \end{cases} ⎩ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎧ sigma (TXP) = bx – Cx, sigma (typ) = by – Cytwp = log (wa ‘wp), THP = log (ha’ HP) TXG = gx – floor (gx), tyg = gy – floor (gy) TWG = log (wa ‘wg), THG = l Og (ha ‘hg)

Taking the 13*13 scale as an example, σ represents the Sigmoid function. Bx,by∈[0,13]b_x, b_Y ∈[0,13] bx,by∈[0,13] bx,by∈[0,13] are the center coordinates of the box mapped to 0-13 dimensions; Cx, (Cy) (C_x, C_y) (Cx, Cy), bx, byb_x, b_ybx, by recent discrete integer point, wa ‘, ha ‘∈ [0, 13] w_a’, h_a ‘∈ [0, 13]’ wa, ha ‘∈ [0, 13]. Is the width and height of anchor mapped to 0-13 dimensions, gx, GY ∈[0,13]g_x, g_y ∈[0,13] Gx, GY ∈[0,13], is the center point coordinates of GT Box mapped to 0-13 dimensions; Gw, GH ∈[0,13]g_w, g_h ∈[0,13] GW, GH ∈[0,13] similarly.

5. Effects and contributions of YOLOv3

  • Backbone: YOLOv3 uses a backbone network with stronger performance to improve accuracy.
  • Small targets work better: YOLOv3 pulls multiple branches and has a maximum feature map of 52 × 52, making small targets easier to detect.
  • The accuracy is further improved, and the inference speed is also real-time, which exceeds other models at the same time.
  • Compared with Faster RCNN, the accuracy is still slightly less than that of two-stage algorithm in the industry.

Reference:

  1. Arxiv.org/pdf/1804.02…
  2. Blog.csdn.net/qq_37541097…
  3. Zhangxu.blog.csdn.net/article/det…