YOLO is a target detection algorithm proposed in 2016. At that time, excellent target detection algorithms include R-CNN, Fast R-CNN and so on, but YOLO algorithm still makes people feel novel and exciting.

YOLO stands for “You Only Look Once,” and the idea comes from us, because when You look at an image, You can tell with a glance the location of different types of objects in that image.

YOLO wins for its simplicity and speed.

  • YOLO is a single neural network for end-to-end prediction without multiple steps and processes.
  • YOLO is extremely fast, reaching 45 FPS in real time and even 155 FPS in a simplified version.

This post is based on YOLO V1, the latest version of which has evolved to YOLO V3, but I thought it was best to start with the basic idea.

YOLO V1 Paper address

YOLO target forecasting is like fishing

Before YOLO, target detection generally generates candidate regions through algorithms, and then performs target classification and bbox position regression in a large number of candidate regions.

The older method is to predict and judge one by one in the form of sliding Windows.

But YOLO is very different.

If you think of target detection as a fishing process, where other algorithms are sniping at each other with a harpoon, YOLO is much more crude, casting a net and catching them all.

YOLO’s predictions are based on the entire image, and it outputs all detected target information, including category and location, at once.

YOLO algorithm ideas.

YOLO’s algorithm is actually pretty simple.

  1. Zoom the input image
  2. The images are fed into the convolutional neural network for prediction
  3. The final result is obtained by the confidence threshold processing of the predicted results.

Let’s go into the details of the algorithm.

Segmentation image

YOLO will divide the input image into SxS grids. If the center point of a target falls in a certain grid, the corresponding grid will be responsible for predicting the size and category of the target.



In the example above, the dog’s center point is in the blue grid, so the blue grid is responsible for the information prediction of this target.

Prediction of each grid target

Each grid will predict B BBoxes and the corresponding Confidence value of bbox, which predicts the location of the target. Confidence is used to reflect whether the grid contains the target and how high the accuracy of the bbox is.

C=Pr(O BJ)∗IOU P red truth C=Pr(obj)∗IOUtruthpred C=Pr(obj) * IOU_{truth}^{pred} C=Pr(obj)∗IOUtruth P red

However, it should be noted that if a grid does not contain the target, Pr(obj)Pr(obj) Pr(obj)Pr(obj) is equal to 0; otherwise, Pr(obj) is equal to 1. Therefore, C is equal to the predicted Bbox and the ou of groundtruth.

Each bbox is composed of five predictors: x, Y, W, H and confidence.

X and y are the offset ratio of the center point of bbox relative to the corresponding grid, and their values are between 0 and 1, that is, they are normalized relative to the grid size.

Similarly, w and h are normalized, so their values are between 0 and 1.

W and H are the ratio of the bbox to the size of the whole picture.

Confidence, as mentioned earlier, represents the oU between the predicted BBox and the Groundtruth bbox.

Each grid besides bbox, still can predict a conditional probability P r C (clas s I ∣ Ojbe ct) Pr (classi ∣ Ojbect) Pr (class_ {I} | Ojbect) P r (clas s I ∣ Ojb ect).

The conditional probability refers to the probability distribution of the target category when there are targets in the grid.

Conditional probability is for each grid, not each Bbox, because each grid predicts B Bboxes.

In the test phase, YOLO multiplies the conditional probability by the confidence of each bbox.

The result is the confidence value of the probability distribution of the target category for each bbox. This confidence value expresses two things at the same time, one is the probability that the target is a certain category, the other is how far the predicted Bbox is from the real Bbox.



YOLO divides the input images into SxS grids, and each grid predicts the conditional probabilities of B Bboxes and C target categories, and each Bbox contains x, Y, W, H and CONFIDENCE.

So YOLO’s final prediction can be put together into a tensor of SS(5*B+C).

The values B, S, and C are determined by the designer of the network structure. YOLO V1 evaluated on PASCAL VOC data sets, S = 7,B = 2. Because this is a set of 20 species, so C is 20, and then using the equation, the Tensor is 7730.

Network architecture design of YOLO

YOLO is based on CNN, and the paper’s author says he was inspired by GooLeNet’s structure and then built a new network structure.

YOLO replaces GooLeNet’s Inception module with a 1×1 convolution kernel for dimensionality reduction.

YOLO has 24 convolutional layers followed by two fully connected layers.

Here is its network structure.



And as you can see, YOLO’s final predictive output is going to be a tensor of 7x7x30.



This picture also gives you a rough idea of the structure of the final predicted tensor.

YOLO training process

YOLO uses the first 20 convolutional layers and 1 full connection layer to do general category recognition training on ImageNet, and achieves relatively high accuracy.

After that, the author of YOLO added four convolution layers and one full connection layer to the pre-trained structure for target recognition training.

In order to improve the overall performance of target recognition, YOLO entered the image size of 224×224 during ImageNet pre-training, but changed it to 448×448 during PASCAL VOC specific target recognition training

With the exception of the last layer, which uses the linear activation function, the other layers are leaky Relu.

Loss setting of YOLO

As we all know, the use of appropriate Loss is very important for training a neural network.

In my opinion, only by understanding the loss design principle of YOLO can we truly understand the core of YOLO algorithm.

As we know, generally, the lower the Loss value is, the higher the network accuracy is, and the more effective the training is. Therefore, the training basically aims at weakening the Loss value.

In the training of YOLO authors, sum-squared error (the sum of squared errors) was used as loss optimization.

Sum-squared error is adopted because it is easy to optimize, but it also has a disadvantage, that is, it is difficult to improve the performance of mAP.

The sum-squared error will make the Loss function have the same error weight for position prediction and category prediction.

In fact, YOLO predicts 772=98 bboxes per image, but only a few bboxes contain the target, and those that do not contain the target will quickly go to zero with confidence during training. Therefore, the gradient of the predicted Bbox containing the target grid will change more sharply. In this way, the whole prediction system will be in an unbalanced and unstable state, and finally loss may fail to converge.

In order to correct and improve the situation, the authors of YOLO adjusted the coordinate error of bbox and the error weight of confidence without the target, which can also be seen as adding different penalty coefficients of 5 and 0.5 respectively.

Why do you do that?

Can understand it, the whole YOLO system is actually the most effective local grid contains the target center point they predicted two bbox, so they are the two bbox location information is vital, so they don’t allow them to change the coordinates of the violent, so you will need to add a coefficient magnify their error, To achieve the purpose of punishment.

Although the grid without target can also predict BBox and confidence, it is basically equivalent to invalid. Therefore, it can be seen that they are not so important to the overall Loss, so their influence on Loss should be weakened. To prevent it from interfering with the normal bBox confidence performance that contains the target, since it doesn’t matter.

Speaking of this, I have a question to think about: If the impact of confidence on loss in the absence of object is to be weakened, why not do it to the utmost? Wouldn’t it be better to change the penalty factor from 0.5 to 0?

I don’t have a definitive answer, but my understanding is that it would throw the YOLO system out of balance and lose its ability to predict the context, and therefore its ability to synthesize.

Another problem with the sum-squared error is that it makes large bboxes and small bboxes equally sensitive to position error.

For example, to predict a small bbox, the width of groundtruth is 4 and its prediction is 3, so its error is 1.

Then predict a big Bbox. The width of groundtruth is 100 and the predicted value is 99, so the error is also 1.

However, it is easy to find that the error is more sensitive when performing small-size Bbox prediction, so the sum-sqaured error method needs to be improved.

YOLO uses the square root of predicted value and groundtruth to make errors, and then sum-sqaured error.

Repeat the previous example, small bbox width Prediction = 3,groundtruth = 4, error is 0.067. Bbox Width Prediction = 99, Groundtruth = 100, error is 0.0025.

So, perfect solution to this problem.

The final Loss becomes a compound Loss, and the formula is as follows:

It should be noted that each grid of YOLO predicts two Bboxes, and only one bbox is finally responsible for the bbox position representing the corresponding target. Through Confidence comparison, that is, IOU comparison, the winner is the one with higher score. In the calculation of Loss, It’s also the Bbox that can participate.

Other specific training strategies are as follows:

  1. In the first epoch, the learning rate slowly changed from 0.001 to 0.01.
  2. For the next 75 epochs, the learning rate was fixed at 0.01
  3. Thirty epochs were then trained with 0.001
  4. Finally, train 40 epochs with 0.0001
  5. To avoid overfitting, YOLO training uses dropout and data enhancement.

conclusion

YOLO V1 was also compared to other great target detection systems of the year, but this is not important today, it is important to understand why YOLO is fast.

  • YOLO is a fishing process where you cast your net and target all at once.
  • YOLO is fast because of the coarser granularity of the input image to mesh and then make predictions.
  • The essence of YOLO’s algorithm is reflected in its Loss design and how the author improves Loss according to the problem. This way of thinking about the problem is the most worthy of our learning.