Welcome to pay attention to Xiao Song’s public account “Minimalist AI” — sharing of theoretical learning and application development technology based on deep learning. The author will often share the dry contents of deep learning. When learning or applying deep learning, you can also communicate with me on this account if you have any questions.

1. An overview of the

Update: Increased test speed on THE Nvidia TX2, which averaged 42ms, 7 times slower than the RTX2080Ti (42ms/6ms).

Increase the test speed on nvidia Nano, which takes about 120ms on average, 20 times slower than RTX2080Ti (120ms/6ms).

Increased test speeds on nvidia’s NGX platform, which averaged around 15ms, 2.5 times slower than the RTX2080Ti (15ms/6ms).

This experiment is to explore the acceleration effect of YoloV5 using TensorRT on RTX2080Ti platform for model reasoning, and compare the acceleration effect of GPU on RTX2080Ti platform for I7-8700 CPU.

As usual, the experimental hardware environment is first proposed:

  • System: Ubuntu 18.04.3 LTS
  • CPU: Intel® Core™ I7-8700 CPU @ 3.20GHz x 12
  • GPU: GeForce RTX 2080Ti
  • Cuda: 10.1
  • Pytorch: 1.5.0
  • TensorRT: 7.1.0

Experiment 2.

The experimental reference code is as follows:

PyTorch model training and reasoning code:

Github.com/ultralytics…

TensorRT7 model transformation and inference code:

Github.com/wang-xinyu/…

1.i7-8700CPU&PyTorch inference experiment

Input size: 576×960

CUDA_VISIBLE_DEVICES=-1 python detect.py --weights  runs/hm960_945/weights/best.pt  --img 960 --conf 0.15 --source  data/hels/testimgs/
Copy the code

Average output time: 420ms; GPU usage: 0MB

image 1/8 /home/song/code/yolov5/data/hels/testimgs/lADPD26eMep_iEjNBDjNB4A_1920_1080.jpg: 576 x960 1 hs, Done. (0.511 s) image 2/8 / home/song/code/yolov5 / data/hels/testimgs/lADPD26eMep_iEnNBDjNB4A_1920_1080. JPG: 576x960 1 ns, 1 hs, Done. (0.481 s) image 3/8 / home/song/code/yolov5 / data/hels/testimgs/lADPD2eDNKD3Ng7NBDjNB4A_1920_1080. JPG: 576x960 2 ns, 2 hs, Done. (0.416 s) image 4/8 / home/song/code/yolov5 / data/hels/testimgs/lADPD2sQs0W5CEfNBDjNB4A_1920_1080. JPG: 576 x960 2 ns, Done. (0.422 s) image 5/8 / home/song/code/yolov5 / data/hels/testimgs/lADPD3lGrdjbTqnNBDjNB4A_1920_1080. JPG: 576 x960 1 hs, Done. (0.441 s) image 6/8 / home/song/code/yolov5 / data/hels/testimgs/lADPD3zULH2hzqjNBDjNB4A_1920_1080. JPG: 576x960 1 ns, 2 hs, Done. (0.963 s) image 7/8 / home/song/code/yolov5 / data/hels/testimgs/lADPD4PvKccrNg_NBDjNB4A_1920_1080. JPG: 576 x960 1 ns, Done. (0.448 s) image 8/8 / home/song/code/yolov5 / data/hels/testimgs/lADPD4d8qGv1TqvNBDjNB4A_1920_1080. JPG: 576x960 2 ns, 3 hs, Done. (0.417s)Copy the code

2.RTX 2080Ti GPU&PyTorch reasoning experiment

Input size: 576×960

CUDA_VISIBLE_DEVICES=0 python detect.py --weights  runs/hm960_945/weights/best.pt  --img 960 --conf 0.15 --source  data/hels/testimgs/
Copy the code

Average output time: 12ms GPU usage: 1000MB

image 1/8 /home/song/code/yolov5/data/hels/testimgs/lADPD26eMep_iEjNBDjNB4A_1920_1080.jpg: 576 x960 1 hs, Done. (0.012 s) image 2/8 / home/song/code/yolov5 / data/hels/testimgs/lADPD26eMep_iEnNBDjNB4A_1920_1080. JPG: 576x960 1 ns, 1 hs, Done. (0.014 s) image 3/8 / home/song/code/yolov5 / data/hels/testimgs/lADPD2eDNKD3Ng7NBDjNB4A_1920_1080. JPG: 576x960 2 ns, 2 hs, Done. (0.015 s) image 4/8 / home/song/code/yolov5 / data/hels/testimgs/lADPD2sQs0W5CEfNBDjNB4A_1920_1080. JPG: 576 x960 2 ns, Done. (0.011 s) image 5/8 / home/song/code/yolov5 / data/hels/testimgs/lADPD3lGrdjbTqnNBDjNB4A_1920_1080. JPG: 576 x960 1 hs, Done. (0.011 s) image 6/8 / home/song/code/yolov5 / data/hels/testimgs/lADPD3zULH2hzqjNBDjNB4A_1920_1080. JPG: 576x960 1 ns, 2 hs, Done. (0.012 s) image 7/8 / home/song/code/yolov5 / data/hels/testimgs/lADPD4PvKccrNg_NBDjNB4A_1920_1080. JPG: 576 x960 1 ns, Done. (0.014 s) image 8/8 / home/song/code/yolov5 / data/hels/testimgs/lADPD4d8qGv1TqvNBDjNB4A_1920_1080. JPG: 576x960 2 ns, 3 hs, Done. (0.011s)Copy the code

3.RTX 2080Ti GPU&TensorRT7 Inference experiment

Input size: 576×960

./yolov5m -d .. /testimgs/Copy the code
503ms
6ms
6ms
5ms
6ms
6ms
6ms
5ms
Copy the code

Average output time: 6ms GPU usage: 700MB

3. Summary

Through this comparative experiment, we can find that:

1. Compared with i7-8700, THE CPU speed of RTX2080Ti is significantly improved (420ms- >12ms), with a speed increase of 35 times

2.TensorRT7 also has a good speed increase (12ms- >6ms) compared to PyTorch in the same environment, with the GPU memory usage (1000MB- >700MB) reduced by 30% while the speed is doubled. It can also ensure that the accuracy of reasoning is basically unchanged (1% fluctuation, within the acceptable range).

3. Compared with i7-8700, the CPU speed of TX2 and Nano is several times higher than that of I7-8700, but the speed of 2080Ti is still significantly decreased. NGX platform is not significantly decreased by more than two times, which can be selected when the real-time requirements are high.

1. Reference

Comparative experiment of classification network:

Blog.csdn.net/herr_kun/ar…

Experimental reference open source:

YoloV5:github.com/ultralytics…

TensorRTX:github.com/wang-xinyu/…