** This is the first day of my participation in the Gwen Challenge, see the details: Gwen Challenge **

Hello everyone, I am Yufeng, public number: Yufeng code word, welcome to tease.

Next, this article will talk about the principle of YOLOv1–YOLOv3 algorithm, and the implementation of YOLOv3, one article with you to understand the context of YOLO. I hope you found this article useful.

directory

YOLOv1

YOLOv1 structure

YOLOv1 loss function

YOLOv2

Major improvements of YOLOv2 over YOLOv1

The Anchor mechanism

YOLOv3

The improvement of YOLOv3

YOLOv3 code combat

1. Annotation of data set

2. Data preprocessing

3. Training and testing

YOLO series summary

YOLOv1

YOLOv1 algorithm is the basis of YOLO series algorithm, understanding YOLOv1 can better understand YOLO series algorithm.

YOLOv1 structure

First, we need to understand the network structure of YOLO, as shown in Figure 1.

In fact, the network structure is relatively simple, including simple CNN network, pooling operation and fully connected network.

** We mainly understand the mapping relationship between input and output. Intermediate network is only a tool to find this mapping relationship. The input of the ** network is a color image of 448*448*3, while the output of the network is a multidimensional vector of 7*7*30. This mapping, which is the basis of YOLOv1, will be explained in detail below.

Figure 1

The input and output of YOLO are shown in Figure 2. On the left is an image, and the circle in the middle can be regarded as the target object. When the image is input into the network, the first thing YOLOv1 does is to divide the image into 7*7 grids. And then yellow represents the actual border of the object.

The most important concept here is that when the center of an object falls into the center of a grid, the grid is responsible for predicting the object, which is the basis of Yolov1.

Each grid will generate two prediction boxes in advance, so YOLOv1 will generate a total of 7*7*2=98 prediction boxes. Compared with the hundreds and thousands of prediction boxes of Faster RCNN, YOLO has significantly fewer prediction boxes, which is a reason for YOLO’s high speed.

Each prediction box will correspond to a 30-dimensional vector, which is obtained from 2*5+20, among which 20 is 20 categories. The reason why 20 is 20 here is because what the original paper does is to classify 20 objects. If our own data set has n categories, then 20 here can be changed to n categories.

Then 2 represents two frames, because at the beginning each grid will generate two prediction frames, and 5 represents five parameters in each frame, namely, the center coordinate (x, y) of the frame, the width w and the height H of the frame, and the confidence degree of the frame. The confidence degree formula is calculated as shown in the formula. The prediction box with the highest confidence will be selected as the prediction border of the grid.

The output of the network is a 7*7*30 dimensional vector, and there is a mathematical mapping relationship with the input, and the middle YOLO network is only a tool to find this mapping relationship. Next, we will focus on yOLO’s loss function.

Figure 2 mapping between input and output

YOLOv1 loss function

The loss function can be roughly divided into three parts. The first is the prediction of coordinates, which are x, y, W and H of the frame respectively.

The second is the confidence prediction of the object,

The third is the category prediction of objects,

The loss function corresponds to a 7*7*30 dimensional vector and is a “mathematical expression” for the error of the mapping between inputs and outputs.

FIG. 3 Loss function of YOLOv1

First let’s look at the coordinate loss function, as shown in Figure 4.

The significance of each parameter is shown in the figure. The reason why the square root sign is used to calculate the length and width of the object is that the loss of the length and width of the large object after the square root sign is close to that of the small object, so that the whole loss function will not be manipulated by the large object. If the square root is not used, the loss of the large object is much larger than that of the small object, so the loss function will be more accurate for the large object and ignore the small object.

The coefficient before the formula is a hyperparameter, which is set to 5. In the object detection process, the objects we need to detect are much less than the background, so this hyperparameter is added to balance the influence of “non-objects” on the result.

FIG. 4 Coordinate loss function

The loss function of confidence is shown in Figure 5, and the meaning of each parameter is shown in figure 5.

Here: why do you want to join the “object” confidence, because the network to learning classification n object, that he really want to learn the n + 1 category, that the “1” is the background or is the true sense of the object, this kind is a large proportion, so have to learn this class, to ensure the accuracy of the network.

So why is there a hyperparameter in front of the confidence of the non-object?

It is also because the target object detected is very small relative to the “non-object”. If this super-parameter is not added, the confidence loss of “non-object” will be very large and the weight will be relatively large. As a result, the network will only learn the features of “non-object” and ignore the features of the target object.

FIG. 5 Confidence loss function

Finally, there is the category loss function, as shown in Figure 6. Category loss is a very rude subtraction of two categories, which is an undesirable part of YOLOv1, and of course will be changed later.

FIG. 6 Category loss function

In conclusion, the advantage of YOLO is that it is fast, and the disadvantage of YOLOv1 is obvious,

  1. Detection of crowded objects is not good: because the centers of crowded objects may both fall in the center of a grid, the grid may have to predict two objects, which is not good.

  2. The detection effect of small objects is not good. Although the loss of small objects is balanced by using hyperparameter or square root, the loss of small objects still accounts for a small proportion, and the network mainly learns the features of large objects.

  3. For irregular object shape or proportion, the detection effect is not good

  4. There is no batch normalize.

YOLOv2

Major improvements of YOLOv2 over YOLOv1

The first improvement of YOLOv2 is the network improvement. DarckNet19 is used to replace YOLOv1’s GoogLeNet network. The main improvement here is the removal of the full connection layer and the replacement of convolution and Softmax.

The second improvement of YOLOv2 is the addition of Batch Normalization to the network, which uses Batch Normalization to optimize the network to improve convergence and eliminate the reliance on other forms of regularization.

The third improvement of YOLOv2 is the addition of HighResolution Classifier, which firstly fine-tune the 10 epochs of the classification network on ImageNet at 448×448 full resolution. This gives the network time to adjust its filters to work better on higher-resolution inputs. We then fine-tune the network based on the results. This high resolution classification network improved our mAP by almost 4%.

The fourth improvement of YOLOv2 is multi-Scaletraining, which enables the network to achieve a good prediction effect at different input sizes. The same network can be detected at different resolutions. When the input picture size is relatively small, it runs faster, and when the input picture size is relatively large, it has high accuracy.

The Anchor mechanism

** The fifth improvement of YOLOv2 is the addition of Anchor mechanism. ** This is the most important one and the one that this paper will focus on.

First of all, we need to understand what Anchor mechanism is. Anchor should preset several virtual boxes at first, and then determine the final prediction box by regression method.

In YOLOv2, k-means algorithm is used to generate Anchor Bbox, as shown in FIG. 7. When K =5, the complexity and recall rate of the model reach a good balance, so YOLOv2 uses 5 Anchor Bboxes.

Figure 7.

Compare the output of YOLOv1 with that of YOLOv2, as shown in Figure 2.

YOLOv1 outputs multidimensional vectors of 7*7*30, in which 7*7 is the resolution, and carries out 7*7 segmentation of the original image. Each grid corresponds to a vector containing 30 parameters, each vector contains two Bboxes, and each bbox contains five vectors, namely the barycentric coordinates (x,y) of bbox and the length and width of bbox. There’s a bbox confidence, and the remaining 20 are category probabilities.

The output of YOLOv2 is a multidimensional vector of 13*13*5*25, in which 13*13 is the resolution. That is to say, the network divides the input image into 13*13 grids, and each grid corresponds to a one-dimensional vector containing 5*25=125 parameters. Among them, 5 represents 5 Anchor Bboxes, and each Anchor Bbox contains 25 parameters, which are the barycentric coordinates (x,y) of bbox, the length and width of bbox, as well as the confidence of bbox, and the remaining 20 are category probability.

YOLOv2 are the benefits of this can be multiple tags to one area of the prediction, such as a “person” target objects, he can belong to the “people” the label, can also belong to the “male” or “female” the label, also can be a “teacher”, “student” or “worker”, etc. These labels, but YOLOv1 can predict a category of objects. The main change made here is that the loss function calculation method of the four position parameters of bbox has been changed.

Figure 8 output comparison

First of all, let’s understand the relationship among Anchor Bbox, Predicated bbox and Ground truth bbox.

As shown in Figure 9, the red box represents the Anchor BBox, the blue box represents the Predicated bbox, and the green box represents the Ground truth bbox.

What we hope is that the Anchor Bbox is close to the Ground truth bbox, but the Anchor Bbox is pre-set and cannot be changed.

But the Anchor BBox can generate different Predicated BBoxes, so we convert our target to: Predicated bbox is closer to Ground truth bbox, and the target can be converted into a mathematical expression f(x), as shown in the figure, so our target becomes mathematical TP which is closer to TG. All the formulas are normalized to prevent large objects from interfering with the whole calculation result.

FIG. 9 Relationship among the three

Secondly, we need to understand the concept of coordinate transformation. The coordinate of YOLOv1 is relative to the whole image, while the coordinate of YOLOv2 is relative to each grid. How to obtain the coordinate of the grid, and how to calculate the loss value?

As shown in figure 10, at the beginning, we will generate Anchor bbox, which is relative to the whole image at this time, so we need to normalize at this time, from normalization to [0,1].

The resolution of YOLOv2 is 13*13, so we need to multiply the coordinate between [0,1] by 13, so that the coordinate of bbox is relative to 13 grids, and the coordinate range is between [0,13]. We are now normalizing so that the coordinates are relative to a single grid, xf = X-i, Yf = y-j, wf = log(W /anchors[0]),hf = log(H /anchors[1]). We have a particle here, Add x = 9.6 (the range of x is [0,13]), then I is an integer part of x, that is, I = 9, so ****xf** = 0.6**, this 0.6 is the X-axis coordinate with respect to the 10th axial grid.

FIG. 10 Coordinate transformation

As shown in FIG. 11, the formula in the middle of the picture is the calculation formula of YOLOv2 Loss. The coordinate calculation of this formula is relative to the grid, and its corresponding F (x) is equivalent to the whole image.

The network will calculate δ(tx) and δ(ty), where δ is a sigmoid function that normalizes the network output to between [0,1]. In this way, the centroid position relative to a grid can be obtained, and the offset value of the grid relative to the entire 13*13 grid can be added to obtain the barycentric position, height and width of the predicted bbox. Adjust this value. Make it closer to the real BBox.

Figure 11 summary

YOLOv3

The improvement of YOLOv3

The first improvement of YOLOv3 is the change of network structure, which introduces the idea of ResNet. However, if the ResNet module is completely introduced, the whole model will be very large. Therefore, the last layer of ResNet module 1*1*256 is directly removed, and the penultimate layer 3*3*64 is directly changed to 3*3*128. The entire network structure is shown in the figure. The input is RGB image of 416*416*3. The network will output output of three scales, and finally output the category and frame of each target object.

The second improvement of YOLOv3 is multi-scale training, which is truly multi-scale. There are three scales in total, namely 13*13, 26*26 and 52*52 with three resolutions, which are responsible for predicting large, medium and small object borders respectively. This improvement is more friendly to small object detection.

The principle of YOLOv3 multi-scale training is shown in the figure. First, an image is input, which is divided into grids of 13*13, 26*26 and 52*52 by YOLOv3. Each grid of each resolution corresponds to a multidimensional vector respectively. Confidence in the border, and 80 category probabilities. Finally, output the category probability and border for each object.

YOLOv3 code combat

1. Annotation of data set

To train YOLOv3, LabelImg labeling is carried out first.

LabelImg is available at github.com/tzutalin/la…

The installation program is shown below:

After installation, the interface will look like the following figure:

First click “open” to open the picture, as shown in the picture, it opens a picture of a dog and a cat, and then select the border for annotation.

After marking, the target object category should be noted, as shown in the figure:

Once labeled, the “catdog.xml” file is generated,

The file content is shown in the figure below:

Finally put pictures (catdog) in respectively. / VOCdevkit/VOC2007 / JpegImages, LabelImg labeled images into “Annotations”. As shown in the figure:

2. Data preprocessing

Once the images and XML files are ready, run the voc2yolo3.py program to generate the data set list file, change the image’s voc_classes.txt label to your own category label, and if you have more than one category, place each category on a separate line.

For the sake of presentation, I have temporarily added some image data, which is not implemented in this YOLOv3. The data in the following pictures are the original yolov3 data, so some data do not correspond, but the implementation of the whole process is to be discussed next. If you are training your own data set, you need to paste your own data into the corresponding location.

And then before you run voc_Annotate. py, first change the category in your application to your own category, which in my case only has a particle.

Then you run “kmeans. Py” and when it runs, you will generate k Anchors, and these numbers represent the size of your pre-generated markup box. Put these markup box data in the location shown in the image first and modify it according to the original format of “yolo_anchors.

Then copy these numbers into “yolov3.cfg” and search “yolo” to modify the corresponding anchors and classes, classes select the category you want to categorize, I only have one category here, so I changed it to 1. There are three yOLO, all of which need to be modified.

3. Training and testing

When all is done, you are ready to train, simply execute “train.py”. Pay attention to the weight of the save path and some parameters can be adjusted.

After the training is complete, just run “yolo_video.py” to test. If you downloaded yolov3 from my official account, you need to modify yolo_video.py as follows:

YOLO series summary

That’s all FOR today. Thank you. If there is any mistake, welcome criticism and correction.

If you want YOLOv3 code, welcome to pay attention to the “Yufeng code word” public number, and reply “YOLOv3” to get the corresponding code.

I am Yufeng, public number: Yufeng code word, we will see you next time.