YOLO v1

Published in CVPR in 2016, 448×448448 \times 448448×448 image input reached 45FPS and 63.4mAP, which was worse than SSD and Faster but less accurate than Faster R-CNN

Paper thought

  1. Divide an image into S×SS \times SS×S grid cells. If the center of an object falls in this grid, the grid is responsible for predicting the object
  2. The scores of B bounding boxes and C categories are predicted for each grid, and each bounding box is accompanied by a predicted CONFIDENCE value in addition to the predicted position

The network structure





Loss function

limited

  • Group small target detection effect is poor
  • The target size affects the detection result
  • Misposition

YOLO v2

Published in CVPR in 2017, he uses Darknet-19 as backbone

Various attempts to

  • Batch Normalization. Regularize the model to avoid overfitting, and with the BN layer, remove dropout operations and improve the mAP by 2%
  • High resolution classifier. The input image size is 448×448448 \times 448448×448, which improves mAP by 4%
  • The Anchor Boxes. Using Anchor Boxes offsets instead of direct positioning like YOLO V1 simplifies the problem of target bounding box prediction and facilitates network learning. Compared to not using Anchor Boxes, mAP is slightly lower, but the recall rate increases by 7%
  • Dimension Cluster. K-means clustering is used to automatically find suitable priors according to the bounding boxes of training sets
  • Direct the location prediction. The network training is more stable by limiting the coordinate range of the predicted target center
  • Fine – Grained Feature. Through the PassThrough Layer fusion of high and low dimensional feature matrix, improve the ability of small target detection
  • Multiscale training. Improve robustness

The network structure

YOLO v3

Released in CVPR in 2018, darknet-53 is used as backbone, and the convolution layer with 3×33 \times 33×3 convolution kernel step 2 is used to replace the down-sampling pooling layer

The network structure

Prediction of target bounding box

A match between positive and negative samples

The box with the largest IoU was selected as the positive sample, and the box with an IoU over 0.5 but not the largest was directly discarded

Loss calculation









YOLO v3 SPP

Mosaic image enhancement

Four images were spliced into one image as the training sample

  • Increase the diversity of data
  • Increase the number of targets
  • BN can calculate the parameters of many pictures at one time

SPP module

The feature fusion of different scales is realized

The network structure

Regression positioning loss

IoU Loss


I o U = I n t e r s e c t i o n ( b o x A . b o x B ) U n i o n ( b o x A . b o x B ) IoU = \frac{Intersection(boxA, boxB)}{Union(boxA, boxB)}

L I o U = l n ( I o U ) L_{IoU} = -ln(IoU)

Advantages:

  • Can better reflect the degree of overlap
  • It has scale invariance

Disadvantages:

  • Loss is 0 when it doesn’t intersect, right?

GIoU Loss


G I o U = I o U A c u A c . 1 < = G I o U < = 1 GIoU = IoU – \frac{A^c – u}{A^c}, -1 <= GIoU <= 1

L G I o U = 1 G I o U . 0 < = L G I o U < = 2 L_{GIoU} = 1 – GIoU, 0 <= L_{GIoU} <= 2

Where AcA^cAc is the area of the enclosing rectangle of boxAboxAboxA and boxBboxBboxB, and uuu is the area of the union of boxAboxAboxA and boxBboxBboxB

DIoU Loss

Disadvantages of LIoUL_IoULIoU and LGIoUL_GIoULGIoU:

  • Slow convergence
  • Inaccurate regression

The DIoU loss directly minimizes the distance between two boxes, and therefore converges faster


D I o U = I o U rho 2 ( b . b g t ) c 2 = I o U d 2 c 2 DIoU = IoU – \frac{\rho^2(b, b^{gt})}{c^2} = IoU – \frac{d^2}{c^2}

1 < = D I o U < = 1 -1 <= DIoU <= 1

L D I o U = 1 D I o U L_{DIoU} = 1 – DIoU

0 < = L D I o U < = 2 0<= L_{DIoU} <= 2

CIoU Loss

An excellent regression locating loss should take into account three geometric parameters:

  • Overlapping area
  • Center distance
  • Aspect ratio

C I o U = I o U ( rho 2 ( b . b g t ) c 2 + Alpha. nu ) CIoU = IoU – (\frac{\rho^2(b, b^{gt})}{c^2} + \alpha \upsilon)

nu = 4 PI. 2 ( a r c t a n w g t h g t a r c t a n w h ) 2 \upsilon = \frac{4}{\pi^2}(arctan\frac{w^{gt}}{h^{gt}} – arctan\frac{w}{h})^2

Alpha. = nu ( 1 I o U ) + nu \alpha = \frac{\upsilon}{(1-IoU) + \upsilon}

L C I o U = 1 C I o U L_{CIoU} = 1 – CIoU

Focal Loss

One-stage network model, positive and negative samples are unbalanced


F L ( p t ) = Alpha. t ( 1 p t ) gamma l n ( p t ) FL(p_t) = -\alpha_t(1 – p_t)^\gamma ln(p_t)

More focused on the hard samples