If you want to explain the video, please follow my personal wechat public account: Little Paper scraps

Please give me a “like” that makes me look good! 😁 😁 😁 😁 😁

YOLOv1 is the first one stage object detection algorithm after Faster R-CNN, creating a new school of object detection algorithm. After YOLOv1, object detection is basically divided into two schools of One stage and two stages.

The graph below shows the coordinates of YOLO throughout the history of object detection algorithms.

background

It can be seen that YOLOv1 was proposed after Faster R-CNN, but by looking at the time when the first arXiv was uploaded (see the figure below), YOLOv1 and Faster R-CNN were basically proposed at the same time, so when the author of YOLO started writing, Fast R-CNN is SOTA(State of the Art), and Faster R-CNN has not been born yet. Therefore, the main object of comparison for YOLO authors is Fast R-CNN. Of course, the author subsequently added the comparison results with Faster R-CNN for reference.

The following is a simple analysis of the principle of Fast R-CNN.

Fast R-CNN is mainly composed of four parts (as shown below), the first is the shared Feature Extractor, and the second is the traditional Region Proposal algorithm. After shared Extracted Feature Map and Region Of Interest (ROI) are obtained, the corresponding PART Of ROI on the Extracted Feature Map is intercepted, and after ROI Pooling, The Feature Map with fixed resolution is converted into the object detection part, and the whole object detection process is completed by regression of object categories and bounding box.

Because Fast R-CNN is divided into two processes, Extract Feature and Region Proposal, it is two-stage, which leads to the high accuracy of Fast R-CNN, but the speed is not real-time.

The proposal of YOLO is to solve the shortcomings of Fast R-CNN and combine two stages into one stage, so as to achieve real-time performance.

Idea

How does YOLO synthesize two stages into one stage?

1) First, divide the input image into a 7×7 grid

It can be seen that a cell predicts two boxes, but only one prediction result can be output. Generally, the prediction box with a high confidence C value is taken as the prediction result of this cell, and the category of the box is the category with the largest P_CI. 3) Since a cell has two prediction boxes, how to calculate the Ground Truth of each prediction box? First, a Ground Truth box is allocated to the cell where its center is located. As shown below, the green Ground Truth box of the bicycle is allocated to the pink cell.

Second, within a cell, the Ground Truth will be assigned to the largest predictor of its IOU, as shown below. The two predictor boxes in the pink cell are red boxes, and the green GT box is assigned to the wide and short red predictor box.

You have the output, you have GT, you design the network structure, the input to the output mapping, and you can use gradient descent for training.

Network

Below is a simplified version of the network structure,

The diagram below shows the detailed network structure.

It can be seen that the image first extracts features through a 24-layer Feature Extractor designed by an author, and then passes through two fully connected layers to obtain the final output of 7x7x30.

Note that this 7×7 output means exactly what the author says about dividing the grid into 7×7 grids.

Loss

Loss is mainly divided into three parts: position Loss, confidence Loss and category Loss.

1) Loss of position

2) Loss of confidence

Ci predicts Pr(Object)*IOU, which comprehensively reflects the probability that the prediction box has object and the IOU size of the prediction box and truth. Therefore, the C_i label of a cell can be calculated as follows:

3) Category loss

The predicted p_CI is the conditional probability, that is, the probability of Class_i if Object is known. If GT’s category is CI, p_CI will be labeled 1 and other p_CI will be labeled 0.

Training

YOLO model training uses the following techniques: 1) Data enhancement includes random scaling, random interception, random adjustment of exposure and saturation 2) Dropout estimation is the full connection layer, using the dropout rate of 0.5 3) Optimizer adopts momentum optimizer, 4) weight decay adopted weight decay with a coefficient of 0.0005 5) Batch size 64 6) learning rate

Experiments

Error Analysis

The authors also analyze the error of YOLO and compare the error sources of Fast R-CNN and YOLO. First, the author divides the recognition results into five categories:

  1. Correct classification: correct category, IOU>0.5
  2. Position error: category correct, 0.1
  3. Approximate error: Category identified as approximate category, IOU>0.1
  4. Other errors: Category error, IOU>0.1
  5. Background error: IOU<0.1

The error composition of Fast R-CNN and YOLO is shown as follows:

It can be seen that Fast R-CNN is dominated by background errors, while YOLO is dominated by position errors.

conclusion

YOLO pioneered the one-step algorithm to achieve real-time detection and high accuracy.