Nowadays, target detection has become an increasingly popular direction. It can be widely used in industrial product detection, intelligent navigation, video surveillance and other application fields to help government agencies and enterprises improve work efficiency and realize the need for human resources from science and technology. To do a good job, you must sharpen your tools. PaddleDetection is a powerful target detection development kit to help users use target detection technology.




01 
What is a PaddleDetection



PaddleDetection is a target detection development kit built based on the core framework of flying OARS, and its overall structure is shown in Figure 2. The development kit covers mainstream target detection algorithms and provides rich pre-training models, which can help users easily and quickly build a variety of detection frameworks and complete various target detection tasks with high quality.


FIG. 2 Structure diagram of PaddleDetection


PaddleDetection’s engineers are also working to make PaddleDetection more accessible to users. With the PaddleDetection upgrade to version 1.7, performance has improved even further.


The upgraded PaddleDetection model is enriched again and the compression scheme of detection model is completely open source. Take the YOLOv3 model released by Feioar as an example. Compared with the optimal similar products in the same period,
The training speed based on COCO data set exceeded 40%, and the mAP (Mean Average Precision) of verification set was 38.9%, which exceeded 1%.In this upgrade, the paddle engineers, in the spirit of craftsmanship, further enhanced the model,
The COCO dataset mAP was 43.2% and the training speed increased by 40%,And based on YOLOv3 open source a variety of model compression complete scheme, YOLOv3 to the next level!


So what exactly does the new version of PaddleDetection do? The following will be broken down in detail for you.




02 Module types and performance are improved comprehensively


PaddleDetection uses a modular design that decouples common detection components, making it easy for users to combine and extend new algorithms. This upgrade improves the performance of YOLOv3 and BlazeFace on PaddleDetection, including new IoU loss functions and multiple powerful target detection models.


YOLOv3 has been significantly improved, with accuracy increased by 4.3%, training speed increased by 40%, and reasoning speed increased by 21%


In tests based on the COCO dataset, the YOLOv3 model used by the backbone DarkNet authors in their paper had a validation accuracy mAP of 33.0%, while feioar’s Darknet53-based YOLOv3 model, which had been published in a previous release, had a validation accuracy mAP of 38.9%. The flying-blade also made the following improvements, making the verification accuracy mAP improved again to 43.2%, and the inference speed increased by 21%. The upgrade also continuously optimized the data preprocessing speed, increasing the overall training speed by 40%.


  • The backbone network type is changed to RESnet50-VD. ResNet 50-vd networks offer speed and precision advantages over the pre-upgrade DarkNet53 network, and are easier to scale than DarkNet53 networks, allowing users to customize their business scenarios. ResNet18, 34, 101 and other network types are flexibly selected as the backbone networks of the model.


  • Deformable Convolution V2 (DCNv2 for short) is introduced to replace the original Convolution operation. DCNv2 has been extensively verified in several visual tasks. Considering the balance between speed and accuracy, the 3×3 convolution of STAGe5 in the backbone network is replaced by DCNv2 in YOLOv3 model. Experimental data show that the model accuracy is improved by 0.2% and the speed is increased by about 21% after using RESnet 50-VD and DCVv2.


  • DropBlock module is added in FPN to improve the model generalization ability. Compared with Dropout algorithm, DropBlock algorithm can concentrate on dropping a certain area during the Drop feature, which is more suitable for detection tasks to improve the network generalization ability.


Figure 3 Dropout and Dropblock


  • As a one-stage detection network, YOLOv3 has natural disadvantages compared with Faster RCNN, Cascade RCNN and other network structures in positioning accuracy. Increasing IoU Loss branches can improve the positioning accuracy of boundary frames to a certain extent and narrow the gap between one-stage and two-stage detection networks.


  • The model trained with Objects365 dataset is used as the pre-training model on COCO dataset. Objects365 data set contains about 600,000 images, 365 categories and up to 10 million boxes. Compared with COCO data set, Objects365 data set has 5 times the number of images, 4 times the number of categories and more than 10 times the number of annotation boxes, which can further improve the accuracy of YOLOv3 pre-training model.


The face detection model BlazeFace is three times faster and 122% faster



PaddleDetection includes two lightweight face detection algorithms, Faceboxes and BlazeFace. Faceboxes are presented as a real-time face detector on a CPU, while BlazeFace is a lightweight, high-performance face detection model from Google Research that is tailored for mobile GPU reasoning, giving it an advantage in embedded deployment.


In PaddleDetection, Paddle has re-implemented and optimized the BlazeFace model, as well as open source a number of variants that are already being applied to real-world business scenarios.




In this update, Feiblade fully open-source Neural Architecture Search (NAS) based on hardware delay conditions for BlazeFace model. The specific process is shown in Figure 4.


(1) First, hardware test the delay of a single operator on the end, and get the delay table
(2) In the process of model search, the Controller end and the defined search space are responsible for generating model structure, and the latency of the current model structure on the hardware can be obtained quickly by using the delay table.
(3) The total time and accuracy of the model are sent back to the Controller as the score of the current model structure to generate the next model structure and start a new cycle.


The purpose of using hardware delay search is to speed up the search speed and make it easier to find model networks that are significantly accelerated on hardware. Experimental data show that compared with the original model, the size of the model obtained by searching is only 240KB, which is 3.3 times compressed, under the condition that the DistROC AP value on the FDDB evaluation set is almost unchanged. Single-threaded test speed of 1.22 times on Qualcomm Snapdragon 855 ARMv8 processor.



Figure 4 Hardware delay search process for NAS version BlazeFace


The IoU loss function is added to improve the accuracy by another 1% without increasing the prediction time



PaddleDetection added IoU(Intersection over Union) series loss functions and related models. IoU is a very common indicator in this paper. It is not sensitive to the scale of the object. In previous detection tasks, Smooth L1 was often used to calculate the border loss. Therefore, some scholars proposed IoU as the loss function of regression.


In version 1.7 of this update, PaddleDetection added loss functions of GIoU (Generalized IoU), DIoU (Distance IoU), CIoU (Complete IoU), After replacing the traditional Smooth L1 loss function with them, the accuracy mAP increased by 1.1%, 0.9% and 1.3% respectively after using the Faster RCNN RESnet50-VD-FPN model and COCO VAL2017 dataset. And it didn’t cost any time to predict.


New open source model CBNet with the highest accuracy, 53.3%, based on COCO dataset



A new CBNet model is added, which connects existing structures to generate a new backbone network. Taking ResNet structure as an example, when the number of cascades is 2, it is called dual-Resnet; when the number of cascades is 3, it is called triple-Resnet. The new BASIC CBNet model uses AHLC (Adjacent Higher Level Composition) networking mode, which has the best performance among several networking modes in this paper. In addition to the new basic model, a single scale detection model cascaderCNN-CBR200-VD-FPN-DCNV2-Nonlocal was also released. The accuracy of this model was up to 53.3% after being verified by COCO VAL2017.


Libra-RCNN model has been added to improve accuracy by 2%


Libra R-CNN model added. The training of detection model mostly includes the training and convergence of candidate region generation and selection, feature extraction, category classification and detection frame regression. In the two-stage target detection task, there are unbalanced phenomena at sample level, feature level and objective level, which will limit the performance of the model. The Librar-CNN model was optimized from these three levels. Finally, compared with FasterRCNN-RESNET50VD-FPN, the accuracy of Librar-CNN model exceeded 2% in COCO two-stage target detection task, and the effect was very obvious.


Added Open Images V5 Target Detection contest best single model



This upgrade opened source the best single model in Open Images V5 target detection competition, which was developed by Baidu and combined with the current superior detection methods. Its model structure is shown in Figure 5. Resnet 200-VD is used as the backbone network of the detection model, and CascadeClsAware RCNN, Feature Pyramid Networks, non-local and Deformable V2 are combined.



FIG. 5 Open Images V5 optimal single model structure diagram


In terms of training strategy, the model integrates Libra Loss as border regression Loss. Aiming at the phenomenon of unbalanced data sets, dynamic sampling strategy is used to select samples during training. In addition, the Objects365 dataset and the Open Images V5 dataset were used to augment the training dataset because about 189 categories were duplicated. And use SoftNMS method for post-processing during the test. Finally, the Public/Private Score of single-scale test of this model was 0.62690/0.59459, and the result of multi-scale test was 0.6492/0.6182.

PaddleSlim power PaddleDetection, chemistry, unstoppable!



The target detection model is still very challenging in actual deployment due to time consuming and memory consumption. Model compression is often an effective solution to speed and memory footprint. PaddleSlim, the flying paddle frame compression tool, provides a variety of very effective model compression methods to push PaddleDetection performance to new heights.


A distillation model compression scheme was used to improve the validation accuracy by 2%


Model distillation is to extract useful information from complex network and transfer it to a smaller network, so as to save computing resources. According to the distillation experiment, it can be found that the same distillation method may not be suitable for all data sets. Due to the different task difficulty of PascalVOC and COCO data sets, PaddleDetection adopted different distillation schemes for YOLOv3 model in PascalVOC and COCO data. The mobilenet-Yolov3 model improved the validation set accuracy mAP by 2.8% on the Pascal VOC dataset and by 2.1% on the COCO dataset.




Dramatically reduce FLOPs using a clipped model compression scheme


The clipping strategy can obtain the appropriate clipping rate of each convolutional kernel by analyzing the sensitivity of each convolutional layer, and reduce the number of convolutional kernels in the convolutional layer by clipping the number of channels in the convolutional layer, so as to reduce the model volume and reduce the model computational complexity. Experimental results show that taking RESNet 50VD-DCN-Yolov3 as an example, flop decreases by 8.4% and mAP increases by 0.7% on COCO data set. Mobilenet-yolov3 drops 28.54% in FLOPS and improves 0.9% in mAP on COCO dataset; FLOPs decreased by 69.57% and mAP increased by 1.4% on Pascal VOC dataset.




Distillation + clipping, tests based on COCO datasets can be 2.3 times faster


Pruning and distilling can be combined to achieve good results. Enter 608×608 images for testing. The following table shows some time-consuming test data. When the number of clipped FLOPs is reduced by more than 50%, the time on the phone is reduced by 57%, that is, the acceleration is 2.3 times. The clipped model also has certain benefits on the GPU.




03 
The deployment process is seamless




PaddleDetection provides users with an end-to-end process from training to deployment and a C++ predictive deployment solution for a cross-platform image detection model. After training the model, the user can obtain the ready-made C++ prediction code, which can be used to complete the prediction operation directly. Moreover, only through certain configuration and a small amount of code, the model can be integrated into its own service to complete the corresponding image detection task. The predictive deployment scheme has the following four characteristics:


1. Cross-platform. Supports compilation, development, and deployment on Windows and Linux.
2. Scalability. Support users to develop their own special data preprocessing logic for the new model.
3. High-performance. In addition to the performance advantages brought by the rotor itself, we also optimized the performance of key steps according to the characteristics of image detection.
4. Support common detection models. For example, YOLOv3, SSD, ftP-RCNN, etc., users can load the model to complete common detection tasks through a few configurations.




If you join the official QQ group, you will meet a large number of like-minded deep learning students. Official QQ group: 703252161.



If you want to learn more about flying OARS, please refer to the following documentation.


Official website address:
https://www.paddlepaddle.org.cn


PaddleDetection project address:
https://github.com/PaddlePaddle/PaddleDetection


Flying oar Core Framework Project Address:


GitHub:
https://github.com/PaddlePaddle/Paddle


Gitee:
https://gitee.com/paddlepaddle/Paddle