This is the 23rd day of my participation in the August More Text Challenge

Tesla also uses similar YOLO One-stage target detection, said to also want to take this opportunity to review. So what is one-stage target detection? That is, candidate box generation and target prediction and bounding box regression are all done by a network.

Today to say and write about YOLOv1, YOLOv1 in the target detection should be a school of its own, a bit similar to today’s Web front-end VUE position. And continues to amaze us along the way. YOLO is an acronym for You Only Look Once. As its name implies, YOLO is a one-stage target detection framework and an end-to-end target detection framework. YOLOV1 successfully transforms the target detection problem into a regression problem. There are many articles and videos about YOLO on the Internet. I hope my article can bring you different content and help you better understand the current popular multi-object detection framework of YOLO.

Input and output

Usually we know a neural network model, a neural network model is a function, and when you feed data into a function, the function gives you an output. In YOLOv1, you take an image, and the output is structured data, which objects appear in the image and where they are located. Later in the article, we will collectively refer to all the different categories of objects that we need to detect as prospects.

The input data is 448 x 448 x 3. Because two full connection layers need to be connected to the network at the end, and the full connection layer requires input of a fixed size, resize needs to be entered.

The final FC layer has an output of 1470 x 1. If you Reshap that output, you have a Tensor of 7 x 7 x 30, so you have a 30-dimensional output for every cell

  • The output vector represents the predicted results [C,x,y, W,h][C,x,y, W,h]
  • X,y, W,hx,y,w,hx,y, W, W, W,h represent the position information, which is the center point of the boundary box and the width and height of the boundary box respectively
  • Pr(Object)Pr(Object) what is the probability that Pr(Object) is the foreground, and I have to multiply by an IoU, why do I have to multiply by an IoU? In other words, in addition to correctly judging the prospect, it also hopes that the predicted boundary box as the prospect is closer to the real target boundary box.
  • Pr(Classi,Object)Pr(Class_i,Object)Pr(Classi,Object) category, detection is the foreground, the so-called foreground is that there is a target category in this grid, which is used to deduce the detection target belongs to the category. This is a conditional probability from the formula, So probability can be expressed Pr (Classi ∣ Object) x Pr (Object) * IoUpredtruthPr IoUpredtruth = Pr (Classi) (Class_i | Object), times Pr (Object), times IoU_{pred}^{truth} = Pr(Class_i) \times IoU_ ^ {Mr Pred} {way} Pr (Classi ∣ Object) x Pr (Object) * IoUpredtruth IoUpredtruth = Pr (Classi)
  • Each grid is responsible for target detection

  • S x S x ( B x ( 1 + 4 ) + C ) = 7 x 7 x 30 = 1470 S \times S \times (B \times (1+4) + C) = 7 \times 7 \times 30 = 1470

How does this neural network output value map to the image for inference and comparison with the real value, in YOLOv1 network output width and height is a decimal between 0 and 1, need to be multiplied by the real width and height of the image


w p r e d = 0.25 w i m g = 0.25 x 448 = 112 \begin{aligned} w_{pred} = 0.25\ w_{img} = 0.25\ times 448 = 112 \end{aligned}

Similarly, the center point of the boundary box is also represented by proportion, but the calculation needs to be combined with the position of the grid

The network structure

Firstly, the first 20 layers in the network are pre-trained on ImageNet, and then the 20 layers are followed by 4 convolutional layers and 2 fully connected layers. Therefore, the first 20 layers are initialized with the pre-training network, and the last 6 layers are randomly initialized and updated during training. In addition, the size of input images was adjusted from 224×224224 \times 224224×224 to 448×448448 \times 448448×448 during training because detection needed more detailed information of images.

YOLOv1 is mainly a 24-layer convolution structure, which is a network structure composed of the convolution layer and the fully connected layer. Instead of Maxpooling as downsampling, convolution with a convolution kernel size of 1×1 and step size of 2 is used as downsampling. The activation functions used in the middle are LeakyReLU. In the YOLOv1 era, there has been no BatchNormalization. After feature extraction through convolution, a feature vector will be obtained, and a convolution will finally output the 7 x 7 x 3 matrix

Loss function

In fact, for YOLO, it may not only include YOLO but also multi-target detection frameworks like FastRCNN. To truly understand these frameworks, a thorough understanding of loss function is required. So today we explain the YOLOv1 loss function in detail. As for the loss value of target detection and the regression and classification problems we did in the past, it is ok to only consider the relatively single or simple loss, which needs to measure the loss from multiple aspects, including positioning loss, target detection loss and classification loss

Location Loss


i = 0 S 2 j = 0 B I i j o b j [ ( x i x ^ i ) 2 + ( y i y ^ i ) 2 + ( w i w ^ i ) 2 + + ( h i h ^ i ) 2 ] \sum_{i=0}^{S^2} \sum_{j=0}^B \mathbb{I}_{ij}^{obj} \left[ (x_i – \hat{x}_i)^2 + (y_i – \hat{y}_i)^2 + (w_i – \hat{w}_i)^2 + + (h_i – \hat{h}_i)^2 \right]
  • You can use the square variance as the loss function, calculating the deviation between the predicted boundary box and this is the deviation between the center of the box and the deviation between the width and height
  • Iijobj\mathbb{I}_{ij}^{obj}Iijobj represents the JTH boundary box in the i-th grid, and it is 1 if it is responsible for prediction, otherwise it is 0. We only consider the loss value of target detection in the prospective grid


I 51 o b j = ( ( 0.7 0.7 ) 2 + ( 0.5 0.5 ) 2 + ( 0.5 0.5 ) 2 + ( 0.75 0.7 ) 2 ) \ mathbb {I} _ {51} ^ = {obj} \ left ((0.7 0.7) ^ 2 + (0.5 0.5) ^ 2 + (0.5 0.5) ^ 2 + (0.75 0.7) ^ 2, right)

I 51 o b j = ( ( 0.6 0.6 ) 2 + ( 0.5 0.5 ) 2 + ( 0.2 0.2 ) 2 + ( 0.3 0.25 ) 2 ) \ mathbb {I} _ {51} ^ = {obj} \ left ((0.6 0.6) ^ 2 + (0.5 0.5) ^ 2 + (0.2 0.2) ^ 2 + (0.3 0.25) ^ 2, right)

According to the above formula, we calculated the loss value of the two targets with large proportion difference. For the two targets, there is a difference between the predicted and real results in terms of height, which is convenient to measure. It is also because these values are expressed in the way of ratio when calculating the loss value. This is because the value between 0 and 1 is convenient for neural network to learn. This is, it’s taking this expression that brings up some of the problems of not responding correctly to errors of small proportions of objects. So to eliminate this problem, take the square root of the length and the width to solve the problem


i = 0 S 2 j = 0 B I i j o b j [ ( x i x ^ i ) 2 + ( w i w ^ i ) 2 + ( h i h ^ i ) 2 ] \sum_{i=0}^{S^2} \sum_{j=0}^B \mathbb{I}_{ij}^{obj} \left[ (x_i – \hat{x}_i)^2 + ( \sqrt{w_i} – \sqrt{\hat{w}_i} )^2 + (\sqrt{h_i} – \sqrt{\hat{h}}_i)^2 \right]

I 51 o b j = ( ( 0.6 0.6 ) 2 + ( 0.5 0.5 ) 2 + ( 0.2 0.2 ) 2 + ( 0.3 0.25 ) 2 ) + I 51 o b j = ( ( 0.6 0.6 ) 2 + ( 0.5 0.5 ) 2 + ( 0.2 0.2 ) 2 + ( 0.3 0.25 ) 2 ) = I 51 o b j x ( 0.03 ) 2 + I 71 o b j x ( 0.048 ) 2 \ mathbb {I} _ {51} ^ = {obj} \ left ((0.6 0.6) ^ 2 + (0.5 0.5) ^ 2 + (0.2 0.2) ^ 2 + (0.3 0.25) ^ 2 \ right) + \ mathbb {I} _ {51} ^ = {obj} \ left ((0.6 0.6) ^ 2 + (0.5 0.5) ^ 2 + (0.2 0.2) ^ 2 + (0.3 0.25) ^ 2 \ right) = \ mathbb {I} _ {51} ^ {obj} \ times (0.03) ^ 2 + \ mathbb {I} _ {71} ^ {obj} \ times (0.048) ^ 2

The target loss


i = 0 s 2 j = 0 B ( C i C ^ i ) 2 \sum_{i=0}^{s^2}\sum_{j=0}^B (C_i – \hat{C}_i)^2
i 1 2 3 4 5 6 7 8 9

C i C i C_iCi
0 0 0 1 1 0 1 0 0

C ^ i C i \hat{C}_iC^i
0.1 0.1 0.1 0.1 0.6 0.1 0.6 0.1 0.1

C i C ^ i C i C i C_i – \ hat {C} _iCi – C ^ I
0.1 0.1 0.1 0.1 0.4 0.1 0.4 0.1 0.1


2 x ( 0.4 ) 2 + 7 x ( 0.1 ) 2 = 0.32 + 0.07 \times (0.4)^2 + 7 \times (0.1)^2 = 0.32 + 0.07

In practice, YOLOv1 is 7 x 7 x 2, in which 49 grids, and each grid provides two prediction frames, so there are altogether 98 frames. Among them, the problem is that the loss value of the target is not detected in loss accounts for a large proportion. That is, in the process of learning, they pay more attention to the background and ignore the foreground.


2 x ( 0.4 ) 2 + 96 x ( 0.1 ) 2 = 0.32 + 0.96 2 \ times (0.4) ^ 2 + 96 \ times (0.1) ^ 2 = 0.32 + 0.96

Here, YOLOv1 designers directly give a coefficient to balance foreground and background learning, which is usually 0.5 based on experiments


i = 0 S 2 j = 0 S I i j o b j ( C i C ^ i ) 2 + Lambda. n o o b j i = 0 S 2 j = 0 S I i j o b j ( C i C ^ i ) 2 \sum_{i=0}^{S^2} \sum_{j=0}^S \mathbb{I}_{ij}^{obj} (C_i – \hat{C}_i)^2 + \lambda_{noobj} \sum_{i=0}^{S^2} \sum_{j=0}^S \mathbb{I}_{ij}^{obj}(C_i – \hat{C}_i)^2

Classification Loss


i = 0 S 2 I i j o b j c c l a s s e s ( p i ( c ) p ^ i ( c ) ) 2 \sum_{i=0}^{S^2} \mathbb{I}_{ij}^{obj} \sum_{c \in classes} (p_i(c) – \hat{p}_i(c))^2

Mainly for each grid, computing mode, object center in a certain grid will be learned.

Finally, the sum of these three losses is taken as the total loss. In the learning process, the minimum total loss value is taken as the goal for learning. In the summation, we need to add a coefficient lambda before the locating loss, which is also an empirical value of 5, in order to make the learning process pay more attention to the locating accuracy.

Existing problems

  • When two targets appear in the same grid, there will be some problems in YOLOv1 learning. As shown in the figure, when the center points of cars and people fall in the same grid, YOLOv1 will not be able to make correct predictions. In fact, there are still many cases like this, so this shortcoming is a problem that must be solved for YOLOv1. In YOLOv3 you will see how the designers of YOLO solved this problem step by step
  • Each grid can detect only one object at most, so YOLOv1 is difficult to detect even if the center of multiple objects in the same category falls on the same grid.
  • In fact, these problems are caused by the design of YOLOv1, so it cannot be solved by adjusting parameters
  • YOLOv1 needs a lot of data to learn, especially for the target that the width to height ratio of writing boundary box YOLOv1 has not seen in the learning materials, so the target detection effect will be poor

  • From the above figure, it can be seen that the error rate of YOLOv1 in misjudging Background as foreground is much lower than that of Fast R-CNN. There are only 98 grids used for testing in YOLOv1, while there are about 2000 grids used for testing in Fast R-CNN. Therefore, it is not difficult to understand why Fast R-CNN is easy to misjudge the foreground as the background.
  • For the foreground detection ability, YOLOv1 is slightly lower than Fast R-CNN, which is the same as the above reason, because Fast R-CNN uses far more candidate boxes for target detection than YOLOv1, so the detection effect must be better
  • From the perspective of positioning accuracy, it means that the classification is correct. The number of IoU between the prediction boundary box of these correctly classified targets and the real boundary box is between 0.1 and 0.5, and the more this number is, the worse the positioning accuracy is. YOLOv1 needs more efforts than Fast R-CNN in positioning accuracy.

We are looking at the comparison between Faster R-CNN and YOLOv1 on PASCAL VOC 2007 data set. YOLO’s processing capacity of 45 frames per second is much higher than Faster R-CNN’s processing capacity of 18 frames per second. So in terms of speed and performance, YOLO had an absolute advantage at that time