This is the first of a series of YOLO columns that will cover the principles of algorithms, code parsing, model deployment, and more. This article is a public account readers contribute, welcome to write any series of articles readers contribute to us, together to create a computer vision technology sharing community.

This paper introduces the YOLO algorithm of one stage in target detection, and introduces the development process from YOLOv1 to YOLOv3.

This article is from the YOLO series of public account CV Technical Guide

Welcome to pay attention to the public number CV technical guide, focusing on computer vision technology summary, the latest technology tracking, classic paper interpretation, CV recruitment information.

YOLO’s design theory

YOLO stands for You Only Look Once. It is a typical work of one stage in target detection. In addition, there are two-stage schools of target detection, such as RCNN series. And Anchor Free, such as Cornnet and Centernet.

In fact, YOLO achieves end-to-end target detection through a series of convolution operations. YOLO divides the image into S x S grids, and each grid is responsible for detecting the objects falling into it. Finally, bounding boxes, positioning information, and confidence of all categories of the contained objects are output.

The basic process is as follows:

1. First resize the input image to a fixed size.

2. Input it into the network, and finally obtain the target detected by the prediction results.

3. Use non-maximal suppression algorithm to filter redundant targets.

Non-maximum suppression algorithm (NMS)

NMS is not only used by YOLO. In fact, NMS is used by most target detection algorithms. It is mainly to solve the problem of a target being repeatedly detected. In the picture below, we can see that the face has been tested repeatedly. Every box is checked correctly, but we only need to get one box, the best box.

This can be achieved by using NMS. Firstly, find the one with the highest confidence degree from all the detection boxes, and then traverse the remaining boxes to calculate the IOU between it and the largest box. If its value is greater than a certain threshold, it indicates that the coincidence degree is too high, and the box will be removed. Then repeat the process for the remaining checkboxes until all checkboxes are processed.

YOLOv1

In fact, the biggest difference between YOLOv1 and subsequent YOLO algorithms is that YOLOv1 does not use anchor box. There is no concept of anchor in YOLOv1. Strictly speaking, YOLOv1 belongs to Anchor free.

Algorithm thought

1. Divide the image into S x S grids. The setting in this paper is 7 x 7 grids. If the center of an object falls in this grid, the grid is responsible for predicting the object.

2. Then B borders are predicted for each grid. In the paper, 2 borders are set, that is, 7 x 7 x 2 borders are predicted, and this border is responsible for predicting the position of the object.

3. In addition to predicting the position of the object, each border should be accompanied by a confidence level prediction. The confidence here refers to the probability of the frame, regardless of which category the target belongs to, indicating whether there is an object in the frame.

4. Each grid also predicts scores in C categories. For example, the VOC dataset predicted 20 categories.

5. Therefore, for VOC data sets, YOLOv1’s network ends with the output predicted position (XYWH) + confidence and category score, so the output is 7 x 7 x (5 + 5 + 20).

6. Linear activation function is used for the last fully connected layer and Leaky ReLU is used for the remaining layers.

Loss function

It can be seen that the loss function consists of three parts, namely frame loss, confidence loss and category loss. And all three loss functions use mean square error loss function (MSE).

Interesting statement

Here’s an interesting twist: YOLO treats detection problems as regression tasks, not categorization tasks. And you can think about why?

Deficiency in

1. Because each grid cell only predicts two boxes and only one category. That is, each grid will only select the target with the highest IOU, so if each grid contains multiple objects, only one of them can be detected. Therefore, the detection effect will be poor for images with small group targets, such as birds. It may also be because this method reduces multiple detection of the same target, resulting in poor accuracy of object location recognition and low recall rate.

2. When a target with a new size appears, the effect will also deteriorate. The reason is that YOLOv1 uses simple features to predict the boundary box and does not use anchor, resulting in inaccurate positioning.

3. The last output layer is the full connection layer. Because the output size of the full connection layer is fixed, the input size of the image must also be fixed, which is limited to some extent.

4. MSE gives the same weight to large and small borders. Suppose the same prediction is 25 pixels off from the actual, but for a large bounding box, the small error is usually insignificant or even negligible, but for a small bounding box, this value has a great influence on it.

YOLOv2

YOLOv2 is also called YOLO9000 because it uses the COCO dataset and Imagenet dataset for joint training, which can eventually detect 9000 categories.

Algorithm thought

1. Use Darknet19 as the backbone of the network. Darknet19 is similar to VGG. In Darknet19, a 3 x 3 convolution kernel is used, and the number of channels is doubled after Pooling every time, and the width and height of the feature map is reduced to half of the original. There are 19 convolutional layers in the network, so it is called Darknet19, and 5 Max Pooling layers, so 32 times downsampling is performed.

2. The Batch Normal layer is used to accelerate the convergence rate during training, and the Batch Normal layer can be used to remove Dropout without over-fitting. In Batch Normal papers, the roles of Batch Normal and Dropout are similar.

3. The prior frame anchor is used, and the clustering method based on Kmeans is used to automatically extract the information of the prior frame according to the label of the data set, so the anchor can be set according to different data sets. When the Cluster IOU selection value is 5, the Avg IOU is higher than the method without clustering. When the choice value is 9, Avg IOU has a more significant improvement.

4. Improved training time from 224 x 224 to 448 x 448 using higher resolution. And the use of random cutting, rotation, color transformation, saturation transformation, exposure transformation and other data enhancement operations.

5. Multi-scale training can be carried out, 10 batches per iteration, randomly changing sizes 320, 352… 608, note that both of these are multiples of 32, because a 32-fold downsample operation was performed in Darknet19.

6. Pass through, similar to Pixel-shuffle, is used to merge the information of high level and low level, so that some details can be retained and small objects can be better detected. To be specific, the operation of one and four is carried out, which is directly transferred to the pooled feature graph. After convolution, the two are superposed and finally output together as the output feature graph. The mAP improves by 1 point by using the Pass through layer to detect fine-grained features.

7. The network removes the last convolution layer and adds three 3 x 3 convolution layers to predict objects at large, medium and small scales respectively. Each convolution layer has 1024 convolution kernels, and each convolution layer is followed by a 1 x 1 convolution layer. Each anchor predicts 5 boundary boxes, so for VOC data set, each boundary box will output 5 coordinate related information and 20 categories related information.

Prediction of position

In the position prediction of YOLOv2, the constraint of sigmoid is added because without the constraint, the predicted boundary box is easy to shift to any size in any direction, and the center point of this anchor may fall on the area of other anchors. As a result, the boundary box of each position prediction can fall at any position of the picture, which easily leads to the instability of model training. It takes a long time to get the correct offset during training.

In YOLOv2, it is predicted that the center point of the boundary box is offset relative to the upper-left corner position of the corresponding grid. In order to constrain the center point of the boundary box in the current grid, sigmoid function is used to process the offset value, so that the predicted offset value is within the range of (0,1). Here the scale of each grid is regarded as the basic unit of 1.

YOLOv3

Without further ado, just looking at this picture gives you a sense of how powerful YOLOv3 is. As if to say: I am not against who, I mean, everyone here is XX ~

YOLOv3 adopts darknet-53 designed by the author himself as the backbone network. Darknet-53 borrows the idea of residual network, which has similar accuracy and higher speed compared with Resnet101 and Resnet152.

In the down-sampling operation, convolution with step size of 2 is used to replace the traditional pooling operation. In the aspect of feature fusion, in order to improve the detection performance of small targets, a multi-scale feature fusion method similar to FPN is introduced. After up-sampling, the feature graph performs concat operation with the output of the front layer, which can fuse shallow features and deep features, making YOLOv3 greatly improve the accuracy of small targets. In addition, logical regression is used to replace SoftMax as the classifier, in order to solve the multi-label classification problem, such as an object belongs to both A class and B class.

Algorithm thought

1. The output of YOLOv3 is still divided into three parts, the first is confidence, the second is coordinate information and the last is classification information. In reasoning, the feature graph will be equally divided into S x S grids, and the grids will be screened by setting confidence threshold. If there is a target on a grid, the grid will be responsible for predicting the confidence, coordinates and category information of the object.

2. The idea of residual model is used: Darknet-53. This is a resNet-like structure, with a continuously stacked residual structure and no maximum pooling layer. Since there is no pooling for downsampling, downsampling is done by a two-step convolution operation. So in general, the network is all built by convolution, which is a full convolutional network. Basic components DBL: Conv + BN + Leaky ReLU and residual structure RES_Unit.

3. Multi-scale prediction: the method similar to FPN fusion is adopted to predict the information of different scales in three feature layers of different sizes. Each feature layer has three scales, so the number of information is finally 9, from 5 in YOLOv2 to 9.

4. Large target output dimension: 13 x 13 x 255, where 255 = (80 + 5) × 3; Target output dimension: 26 × 26 × 255; Small target output dimension: 52 × 52 × 255. The 80 categories here are due to the COCO dataset.

5. A new positive and negative sample matching strategy is adopted: if the consistency degree is high but not the highest, it is still discarded. Take only the ones with the highest degree of overlap.

6. Classifier loss uses binary cross loss entropy cross loss(BCE) instead of Softmax, because some targets may have overlapping category tags, i.e., multi-label classification. For example, SUV is car and SUV, and SoftMax only outputs the largest category.

Loss function

Loss function: BCE is used for confidence loss, BCE is also used for category loss, and MSE is used for positioning loss. Only positive samples are involved in the calculation of category loss, location loss and confidence loss, while negative samples are only involved in confidence loss.

And the balance coefficient is added between the three loss functions to balance the loss.

Why YOLOv3 is fast

YOLOv3 and SSD are deeper than the network. Although anchor is much less than SSD, the deeper network depth obviously increases the amount of calculation, so why YOLOv3 is 3 times faster than SSD? Because BACKBONE of SSD uses VGG16, YOLOv3 uses its latest original Darknet, DarkNet-53 and Resnet network structure, DarkNET-53 will first use 1 x 1 convolution check feature to reduce dimension. And then using the convolution kernel of 3 x 3 to raise the dimension. In this process, the calculation of parameters and the size of the model can be greatly reduced, somewhat similar to low-rank decomposition. The reason is that a lot of optimizations have been made, such as replacing full connection with convolution, and reducing the amount of computation with 1 x 1 convolution.

The paper links

I used to read others’ papers on blogs and public accounts to digest the knowledge chewed by others. But later, I was advised to read the original paper and understand it from the very beginning. Only in this way can I have my own unique views and understand it more thoroughly. So I would still recommend that you check out the original paper. Here is the link to the original paper:

YOLOv1: https://arxiv.org/abs/1506.02640
YOLOv2: https://arxiv.org/abs/1612.08242
YOLOv3: https://arxiv.org/abs/1804.02767
Copy the code

Hopefully, after reading this article, you’ll have a better understanding of YOLO. I will continue to update the analysis of YOLOv4, please stay tuned.

This article is a public account readers contribute, welcome to write any series of articles readers contribute to us, together to create a computer vision technology sharing community.

Welcome to pay attention to the public number CV technical guide, focusing on computer vision technology summary, the latest technology tracking, classic paper interpretation, CV recruitment information.

CV Technical Guide has created a great environment for communication, except for out-of-the-way questions, which are almost always answered. Concern public number to add edit micro signal can invite to add exchange group.

​​

Other articles

To fully understand the SOTA StyleGAN big summary | method, architecture, the new progress

A thermal map visual code use tutorial

A visual feature map of the code

Build a Pytorch model from zero

Build Pytorch model from zero

Summary of Anomaly Detection research on Industrial Image (2019-2020)

A Review of Small Sample Learning (INSTITUTE of Computing Science, Chinese Academy of Sciences)

Summary of positive and negative sample differentiation strategy and balance strategy in target detection

Summary of frame position optimization in target detection

Summary of Anchor-free application methods of target detection, instance segmentation and multi-target tracking

Soft Sampling: Explore more effective Sampling strategy

How to solve the problem of small samples in industrial defect detection

A summary of some personal habits and thoughts about fast learning a new technology or field