What’s New in YOLO V3? Slightly cut. Click to read the original, you can direct the original, need to climb the wall oh!

You Only Look Once, or YOLO, is one of the faster object detection algorithms available. While it is no longer the most accurate target object detection algorithm, it is a very good choice when you need real-time detection without losing too much accuracy.

In the first half of 2018, the third version of YOLO was released, and this article aims to explain the changes introduced in YOLO V3. This is not an article to explain what YOLO is. I assume you know how YOLO V2 works. If this is not the case, I suggest you check out the paper by Joseph Redmon and others to see how YOLO works.

  • YOLO v1
  • YOLO v2
  • A good post about YOLO

YOLO V3: Better, but not faster, stronger

Officially named “YOLO9000: Better, Faster, Stronger,” YOLO V2 paper looks like a healthy milk drink for children, rather than a target detection algorithm.

YOLO 9000 was one of the fastest and most accurate algorithms. However, with algorithms like RetinaNet, it is no longer accurate. However, it is still one of the fastest algorithms.

To improve accuracy, the YOLO V3 compromises speed. While earlier versions could run at 45 FPS on the Titan X, the current version clocks at about 30 FPS. This is related to the increased complexity of Darknet’s underlying architecture.

Darknet-53

YOLO V2 uses a custom deep architecture Darknet-19, originally a 19-layer network, with an additional 11 layers for object detection. YOLO V2, with its 30-layer architecture, often has trouble in small object detection. This is due to the layer downsampling the input, resulting in the loss of fine-grained features. To solve this problem, YOLO V2 uses feature mapping, which connects the previous layer’s feature mapping to capture low-level features.

However, the architecture of YOLO V2 is still missing some of the most important elements, which are now the main elements of most of the latest algorithms. For example, there are no leftover blocks, no hops, and no upsampling. YOLO V3 contains all of these.

First, YOLO V3 uses a variant of Darknet, originally trained on Imagenet for a 53-layer network. Then 53 layers are stacked on top of it for detection tasks, so YOLO V3 has 106 fully convolutional layers of the underlying architecture. This is why the YOLO V3 is slower than the YOLO V2. Here’s what the YOLO architecture looks like now:

Three scales of detection

The new architecture boasts residual skip connections and upsampling. The most striking feature of V3 is that it can be tested at three different scales. YOLO is a full convolutional network whose final output is generated by applying 1 x 1 kernel to the feature map. In YOLO V3, detection is accomplished by applying 1×1 cores to three feature maps of different sizes at three different locations in the network.

The shape of the detected nucleus is 1 × 1 × (B × (5 + C)). Here B is the number of predicted boundary boxes for each unit in the feature mapping, “5” represents 4 boundary box attributes and 1 object confidence, and C is the number of categories. In YOLO V3 with COCO training, B = 3 and C = 80, so the core size is 1 x 1 x 255. The feature map generated by this kernel has the same height and width as the previous feature map and has detection properties along the depth described above.

Images from:Blog.paperspace.com/how-to-impl…

Before going any further, I would like to point out the stride of the network, which is defined as the rate at which a layer is downsampled from the input. In the following example, I assume that we have an input image of size 416 x 416.

YOLO V3 makes predictions at three scales, sampling input images at 32, 16 and 8 sizes.

The first inspection was conducted on the 82nd floor. For the first 81 layers, the image is down-sampled by the network so that the 81st layer has a step of 32. If our image size is 416×416, the resulting feature map is 13×13 in size. The 1 x 1 detection core is used here to provide us with a detection feature map of 13 x 13 x 255.

Next, the feature map from layer 79 goes through several convolution layers, and the 2x up-sampling is increased to 26×26 dimensions. Then, the feature map is deeply connected with the feature map from layer 61. Then, the combined feature map again passes through several 1×1 convolution layers to fuse features from the earlier layer (61). After the second detection through the 94th layer, a detection feature map of 26×26×255 is generated.

The similar process is repeated again, where the feature map from layer 91 goes through a few convolution layers before making a deep connection with the feature map from layer 36. As before, the next several 1×1 convolution layers fuse the information from the previous layer (36). We perform a final triple multiplication at layer 106, producing a feature map of size 52 x 52 x 255.

Detecting smaller targets is better

The different levels of detection help solve the problem of detecting small objects, which is a common problem with YOLO V2. Connecting the upsample layer to the previous layer helps preserve the fine-grained features, which are good for detecting small objects.

Layer 13 x 13 is responsible for detecting large objects, while layer 52 x 52 detects smaller objects and layer 26 x 26 detects medium objects.

Select the anchor point box

YOLO V3 uses a total of nine dot boxes. Three on each scale. If you are training YOLO on your own data set, you should use k-means clustering to generate nine anchors.

Then, arrange the anchor points in descending order of size. Assign the three largest anchors for the first scale, the next three anchors for the second scale, and the last three anchors for the third scale.

Each image has more bounding boxes

If the input image size is the same, YOLO V3 predicts more bounding boxes than YOLO V2. For example, when the original resolution was 416 x 416, YOLO V2 predicted 13 x 13 x 5 = 845 boxes. In each grid cell, 5 boxes are detected with 5 anchor points.

YOLO V3 predicts three different scales of the box. For the same image 416 x 416, the number of prediction boxes is 10647. That means YOLO V3 has 10 times the number of boxes predicted by YOLO V2. You can easily see why it is slower than YOLO V2. At each scale, each grid can predict 3 boxes using 3 anchors. Since there are three scales, the total number of anchor point boxes used is 9, 3 for each scale.

Change in the loss function

Earlier, YOLO V2’s loss feature looked something like this.

Images from:Pjreddie.com/media/files…

I know this formula is very difficult, but pay attention to the last three formulas. The first punishes the object score predictions of the bounding boxes responsible for predicting objects (ideally these scores should be 1), the second punishes the class predictions of the bounding boxes responsible for predicting objects (ideally the scores should be zero), and the last punishes the class predictions of the bounding boxes responsible for predicting objects.

The last three terms in YOLO V2 are squared errors, whereas in YOLO V3 they have been replaced by cross entropy error terms. In other words, object confidence and category predictions in YOLO V3 are now predicted by logistic regression.

When we train the detector, for each real box we assign a bounding box whose anchor points overlap the real box to the maximum.

No more softmax classification

YOLO V3 now performs multi-tag classification on objects detected in images.

In the early days of YOLO, the authors used to do category Softmax, with the category with the highest score as the category of the target object contained in the bounding box. This was changed in YOLO V3.

Softmax classification relies on the assumption that categories are mutually exclusive; simply put, if an object belongs to one category, it does not belong to another. This works fine in the COCO dataset.

However, this assumption fails when we have categories like Person and Women in the data set. This is why the authors of YOLO did not adopt the Softmax classification. Instead, logistic regression is used to predict each category score, and thresholds are used to predict multiple labels of the object. Categories with scores higher than this threshold are assigned to the box.

The benchmark

YOLO V3 performed equally well with other advanced detectors such as RetinaNet and was faster in the COCO mAP 50 benchmark. It’s also better than SSD and its variants. The following performance comparisons are presented in the paper.

YOLO and RetinaNet performance in the COCO 50 benchmark

But, but, but, YOLO lost a higher IoU value in the COCO benchmark for rejection detection. I’m not going to explain how the COCO benchmark works because it’s beyond the scope of this article, but the 50 in the COCO 50 benchmark measures how well the predicted bounding box aligns with the object’s real box. Here 50 corresponds to 0.5 IoU. If the IoU between the prediction and the true box is less than 0.5, the prediction is classified as a mislocation and marked as a false positive.

In benchmarking, the higher the value (for example, COCO 75), the more perfectly aligned the boxes need to be so as not to be rejected by the evaluation metrics. This is where YOLO is surpassed by RetinaNet as its borders are not aligned as well as RetinaNet. Below is a more detailed benchmark table.

RetinaNet performed better in the COCO 75 benchmark than YOLO

Experience the

You can run the detector on an image or video using the code provided in this Github repository. This code requires PyTorch 0.3+, OpenCV 3, and Python 3.5. Once you download the code, you can experiment with it.

Different scales
python detect.py --scales 1 --images imgs/img3.jpg
Copy the code

In the detection of scale 1, we saw that some large objects were selected, but several cars were missed.

python detect.py --scales 2 --images imgs/img3.jpg
Copy the code

At scale 2, we did not detect any target objects

python detect.py --scales 3 --images imgs/img3.jpg
Copy the code

At the maximum scale of 3, you can see that only small target objects are detected, which are not detected at the scale of 1.

Different input resolutions
python detect.py --reso 320 --images imgs/imgs4.jpg
Copy the code

Input image resolution: 320 x 320

python detect.py --reso 416 --images imgs/imgs4.jpg
Copy the code

Input image resolution: 416 x 416

python detect.py --reso 608 --images imgs/imgs4.jpg
Copy the code

Here, we detected one less chair than before

python detect.py --reso 960 --images imgs/imgs4.jpg 
Copy the code

Here, the detector makes a false detection: the “person” on the right

As you can see from above, larger input resolutions don’t help much, but they may help detect images of small objects. On the other hand, larger input resolution increases inference time. This is a hyperparameter that needs to be adjusted for your application.

You can also experiment with other metrics, such as batch sizes, object confidence, and NMS thresholds. Details are provided in the ReadMe file.

Further reading

  • YOLO V3: Incremental improvement
  • How do I compute a mAP?