Welcome to follow my public account [Jizhi Vision], reply 001 to get Google programming specification

O_o >_< O_o O_o ~_~ O_o

Hello everyone, I am Jizhi vision, this paper introduces the design and practice of YOLOv3 algorithm in detail.

This article is the third article to share the realization of target detection algorithm. Two articles have been written previously. Interested students can refer to them:

(1) “[Model Training] Target detection Implementation Sharing 1: Detailed Explanation of YOLOv1 Algorithm implementation”;

(2) “[model training] Objective detection realization share two: heard clay comeback today? Detailed explanation of YOLOv2 algorithm and Clay detection”;

YOLOv3, the third version of the YOLO series, was proposed in the paper “YOLOv3: An Incremental Improvement” and further improved performance. The structure of target detection model based on YOLOv3 is very common in engineering applications. Let’s take a look at the main optimization of YOLOv3 compared with the previous VERSION of YOLO.

Again, we’re going to talk about practice as well as principles.

1. Principle of YOLOv3

As usual, let’s start with the experimental data:

   

In the figure above, the accuracy of the vertical axis is COCO AP, and the efficiency of the horizontal axis is measured on Nvidia Titan X. We can see that YOLOv3 is more efficient and accurate than SSD, FRCN and RetinaNet detection networks.

Let’s look at a more detailed set of precision data:

The data comparison here divides detection algorithms into two-stage and one-stage, among which two-stage detection algorithm is represented by Faster R-CNN. Generally, two-stage detection algorithm is famous for its accuracy, but due to the detection process (Two stages: Screening candidate area + classification) is complicated, so the forward efficiency is not high, and the engineering application is rare. One-stage detection algorithms are listed here: YOLOv2, SSD, RetinaNet and YOLOv3. One-stage detection algorithms are known for efficiency and can gradually take into account accuracy as they develop. Backbone has the highest RetinaNet AP, AP50, AP75, APs and APm of resNext-101-FPN in the COCO dataset. The APl of Faster R-CNN W TDM was the highest. To explain these metrics, in the COCO dataset evaluation index, all AP is mAP by default, that is, AP=mAP, AP50=mAP50… Recursion. Where AP represents the accuracy with the IOU threshold of.05~0.95, AP50 represents the accuracy with the IOU threshold of.50, AP75 represents the accuracy with the IOU threshold of.75, APs represents the detection accuracy with the small target (area < 32^2), APm represents the detection accuracy of medium target (32^2 < area < 96^2), and APl represents the detection accuracy of large target (Area > 96^2). RetinaNet resNext-101-FPN has very strong feature extraction ability, YOLOv3 is not better than it, and YOLOv3 is not better than Faster R-CNN W TDM in large target detection. The overall detection accuracy of YOLOv3 is in the middle and upper level, and improves the weak detection ability of small targets in YOLOv1&YOLOv2, which has been criticized. Combined with its strong reasoning efficiency, YOLOv3 is pretty good!

Here are some improvements to YOLOv3.

Backbone: 1.1 Darknet – 53

In YOLOv2, Darknet-19 is proposed as backbone. Based on Darknet-19, YOLOv3 uses Resnet residual structure for reference to further deepen the network depth, which can greatly improve the feature extraction ability of backbone. The structure of DarkNET-53 is as follows:

       

Without blowing your own trumpeting, darknet-53 compares to some of the other mainstream backbone’s performance stats:

   

It can be seen that DarkNET-53 is significantly improved in accuracy compared to DarkNET-19 of YOLOv2. Due to the deeper network of DarkNET-53, the forward efficiency is not as fast as DarkNET-19. Compared to ResNet’s deep network, DarkNET-53 is comparable in accuracy, plus forward efficiency is 2 times faster than ResNET-152 and 1.5 times faster than ResNET-101. As you can see from these data, DarkNET-53 has a very strong feature extraction capability with forward efficiency.

1.2 Predictions’ll Scales

According to the initial experimental data, YOLOv3 improves the weak detection ability of small targets, which has been criticized for its multi-scale prediction approach. Truncated part of YOLOv3 network structure (probably too small to be clear), as follows:

There are two route + upsample structures in YOLOv3 network, which respectively lead to two YOLO branches and yOLO of the main branch, forming a structure of one backbone and three YOLO detection branches. The shape of the three YOLO branches were 13 x 13, 26 x 26 and 52 x 52, among which the smallest yOLO of 13 x 13 had the largest receptive field, which was suitable for detecting large targets. The 26 x 26 YOLO is suitable for detecting medium-size targets; The largest 52 x 52 YOLO is suitable for detecting small targets. In this way, the whole scene coverage of large, medium and small target detection is formed, and the model has better robustness.

1.3 Bounding Box Prediction

In YOLOv3, the Anchor Boxes K-means clustering selection and Bounding Boxes constraint strategy proposed by YOLOv2 are continued. Some people may not quite understand the difference between Anchor Boxes and Bounding Boxes. Here’s an explanation: Anchor Boxes have no location information but only width and height information, so only width and height are targeted in k-means clustering. Bounding Boxes have more information, including center position information, width and height, confidence and category score of Bounding Boxes.

Then, let’s talk about the difference between Anchor and YOLOv2 in YOLOv3. We know that the effect of Anchor number k = 5 is better in YOLOv2 through k-means clustering analysis, so five anchors are selected to embed the network design. While YOLOv3 combines the multi-scale prediction structure mentioned above, yOLO of each scale is equipped with 3 anchors, and finally there are 9 anchors in total, which are selected by clustering in COCO as follows: (10 × 13), (16 × 30), (33 × 23), (30 × 61), (62 × 45), (59 × 119), (116 × 90), (156 × 198), (373 × 326). Besides, there is also stress on how to allocate. For the Anchor (116 x 90), (156 x 198) and (373 x 326) used in the 13 x 13 YOLO Feature map suitable for detecting large targets, For Anchors (10 x 13), (16 x 30), (33 x 23) on the 52 x 52 YOLO Feature Map suitable for small target detection, the remaining Anchors are used for medium sized target detection.

1.4 Loss Fuction

In order to achieve better convergence and improve accuracy, YOLOv3 also makes some changes to the loss function, wherein x, Y, W and H losses adopt MSE mean square error, confidence loss adopts dichotomous cross entropy, and classification loss adopts multi-category cross entropy to calculate. The overall loss function is as follows:

       

The technical improvements of YOLOv3 are introduced above, and the following are carried out in practice.

2. YOLOv3 practice

Darknet is also used here to train YOLOv3. Select COCO for data set and start below.

2.1 COCO Data Set Configuration

git clone https://github.com/pdollar/coco
cd coco
mkdir images

# download train valdatasets
wget http://images.cocodataset.org/zips/train2014.zip
wget http://images.cocodataset.org/zips/val2014.zip

# unzip .zip
unzip train2014.zip
unzip val2014.zip

cd ..

# download coco metadata
wget https://pjreddie.com/media/files/instances_train-val2014.zip
unzip instances_train-val2014.zip

wget https://pjreddie.com/media/files/coco/labels.tgz
tar zxvf labels.tgz

wget https://pjreddie.com/media/files/coco/5k.part
wget https://pjreddie.com/media/files/coco/trainvalno5k.part

paste <(awk "{print \"$PWD\"}" <5k.part) 5k.part | tr -d '\t' > 5k.txt

paste <(awk "{print \"$PWD\"}" <trainvalno5k.part) trainvalno5k.part | tr -d '\t' > trainvalno5k.txt
Copy the code

The data set is done.

2.2 training

CFG, and darknet53.conv.74. Then create the backup folder to create the following directory tree:

                       

Backup is the backup directory for training intermediate weights, coke. names is the category name, and coke. data is the training configuration file.

               

Execute training instructions:

./darknet detector train cfg/yolov3/coco.data cfg/yolov3/yolov3.cfg cfg/yolov3/darknet53.conv.74
Copy the code

Darknet 53.conv.74 for pre-training weights, not ok, just everything will start from scratch.

Darknet 53.conv.74 download download: pjreddie.com/media/files…

The training time is always long, but I am happy to see that loss is gradually converging:

2.3 validation

Once we’ve trained the model, we can test it.

Here take the street View video for YOLOv3 target detection and verification, and execute the verification command:

./darknet detector demo cfg/yolov3/coco.data cfg/yolov3/yolov3.cfg cfg/yolov3/backup/yolov3.weights data/street.mp4
Copy the code

The detection results are as follows:

You can see that the detection effect is still good.

Ok, the above detailed sharing of YOLOv3 algorithm principle and practice, I hope my sharing can help you learn.


【 model training 】 Target detection implementation share three: detailed explanation of the implementation of YOLOv3 algorithm