YOLO v1

Published in CVPR in 2016, 448×448448 \times 448448×448 image input reached 45FPS and 63.4mAP, which was worse than SSD and Faster but less accurate than Faster R-CNN

Paper thought

Divide an image into S×SS \times SS×S grid cells. If the center of an object falls in this grid, the grid is responsible for predicting the object
The scores of B bounding boxes and C categories are predicted for each grid, and each bounding box is accompanied by a predicted CONFIDENCE value in addition to the predicted position

The network structure

Loss function

limited

Group small target detection effect is poor
The target size affects the detection result
Misposition

YOLO v2

Published in CVPR in 2017, he uses Darknet-19 as backbone

Various attempts to

Batch Normalization. Regularize the model to avoid overfitting, and with the BN layer, remove dropout operations and improve the mAP by 2%
High resolution classifier. The input image size is 448×448448 \times 448448×448, which improves mAP by 4%
The Anchor Boxes. Using Anchor Boxes offsets instead of direct positioning like YOLO V1 simplifies the problem of target bounding box prediction and facilitates network learning. Compared to not using Anchor Boxes, mAP is slightly lower, but the recall rate increases by 7%
Dimension Cluster. K-means clustering is used to automatically find suitable priors according to the bounding boxes of training sets
Direct the location prediction. The network training is more stable by limiting the coordinate range of the predicted target center
Fine – Grained Feature. Through the PassThrough Layer fusion of high and low dimensional feature matrix, improve the ability of small target detection
Multiscale training. Improve robustness

The network structure

YOLO v3

Released in CVPR in 2018, darknet-53 is used as backbone, and the convolution layer with 3×33 \times 33×3 convolution kernel step 2 is used to replace the down-sampling pooling layer

The network structure

Prediction of target bounding box

A match between positive and negative samples

The box with the largest IoU was selected as the positive sample, and the box with an IoU over 0.5 but not the largest was directly discarded

Loss calculation

YOLO v3 SPP

Mosaic image enhancement

Four images were spliced into one image as the training sample

Increase the diversity of data
Increase the number of targets
BN can calculate the parameters of many pictures at one time

SPP module

The feature fusion of different scales is realized

The network structure

Regression positioning loss

IoU Loss

IoU = \frac{Intersection(boxA, boxB)}{Union(boxA, boxB)}

L_{IoU} = -ln(IoU)

Advantages:

Can better reflect the degree of overlap
It has scale invariance

Disadvantages:

Loss is 0 when it doesn’t intersect, right?

GIoU Loss

GIoU = IoU – \frac{A^c – u}{A^c}, -1 <= GIoU <= 1

L_{GIoU} = 1 – GIoU, 0 <= L_{GIoU} <= 2

Where AcA^cAc is the area of the enclosing rectangle of boxAboxAboxA and boxBboxBboxB, and uuu is the area of the union of boxAboxAboxA and boxBboxBboxB

DIoU Loss

Disadvantages of LIoUL_IoULIoU and LGIoUL_GIoULGIoU:

Slow convergence
Inaccurate regression

The DIoU loss directly minimizes the distance between two boxes, and therefore converges faster

DIoU = IoU – \frac{\rho^2(b, b^{gt})}{c^2} = IoU – \frac{d^2}{c^2}

-1 <= DIoU <= 1

L_{DIoU} = 1 – DIoU

0<= L_{DIoU} <= 2

CIoU Loss

An excellent regression locating loss should take into account three geometric parameters:

Overlapping area
Center distance
Aspect ratio

CIoU = IoU – (\frac{\rho^2(b, b^{gt})}{c^2} + \alpha \upsilon)

\upsilon = \frac{4}{\pi^2}(arctan\frac{w^{gt}}{h^{gt}} – arctan\frac{w}{h})^2

\alpha = \frac{\upsilon}{(1-IoU) + \upsilon}

L_{CIoU} = 1 – CIoU

Focal Loss

One-stage network model, positive and negative samples are unbalanced

FL(p_t) = -\alpha_t(1 – p_t)^\gamma ln(p_t)

More focused on the hard samples

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

YOLOv1, YOLOv2, YOLOv3, and YOLOv3 SPPS

YOLO v1

Paper thought

The network structure

Loss function

limited

YOLO v2

Various attempts to

The network structure

YOLO v3

The network structure

Prediction of target bounding box

A match between positive and negative samples

Loss calculation

YOLO v3 SPP

Mosaic image enhancement

SPP module

The network structure

Regression positioning loss

IoU Loss

GIoU Loss

DIoU Loss

CIoU Loss

Focal Loss

YOLOv1, YOLOv2, YOLOv3, and YOLOv3 SPPS

YOLO v1

Paper thought

The network structure

Loss function

limited

YOLO v2

Various attempts to

The network structure

YOLO v3

The network structure

Prediction of target bounding box

A match between positive and negative samples

Loss calculation

YOLO v3 SPP

Mosaic image enhancement

SPP module

The network structure

Regression positioning loss

IoU Loss

GIoU Loss

DIoU Loss

CIoU Loss

Focal Loss

Related Posts

Classic paper series | Group, Normalization and the defect of BN

Quantization of audio digital Watermarking based on Matlab wavelet Transform

A Univariate Bound of Area Under ROC