This article introduces seven days and seven nights, finally realizing a real-time instance segmentation algorithm deployed TensorRT,40FPS!

Seven days and seven nights, finally realized real-time instance segmentation algorithm deployment TensorRT,40FPS!

We have from Tencent, Ali and other front-line AI algorithm engineers form wechat communication group, if you want to communicate, welcome to add wechat: JintianandMerry pull group, add please note “communication group”

For the first article of 2021, my team and I spent seven days and seven nights working on it. Finally, PANopticfCN was deployed to TensorRT. I explained this algorithm to you in the previous article, which can be said to be a very new article. In the paper, mAP can be as high as 32, the same as MaskRCNN, and this algorithm is a panoramic segmentation algorithm. What it can produce is a panorama split output.

As we all know, there are very few algorithms that can run to Realtime and even fewer that can run beyond real-time in the field of instance splitting. MaskRCNN, which we are familiar with, is detectron2. Even with 512, the speed is only 8-9ms, not even 10fps. For some real-time algorithms, such as Yolact, its speed is very fast, but the accuracy is impressive. In practice, it is almost impossible to use. Now we can get an algorithm that is as accurate as MaskRCNN, but 40 times faster! Of course, this acceleration is based on the addition of TensorRT and the algorithm, and PanopticFCN is not a two-stage algorithm, so the speed is a great advantage compared with MaskrCNN.

Let’s take a look at the result:

We did it at about 19ms for the network prequel, 24ms including visual post-processing, and a total FPS of 40! It can be seen that this speed is still very fast, and for some small target segmentation, there is no big impact.

The python code for this model is available for reference in this paper:

Github.com/yanwei-li/P…

Of course, in the process of realizing this model to transform Tensorrt, it was also very troublesome and encountered many challenges (otherwise it would not have cost our men’s team seven days and seven nights). First of all, we encountered many problems when we switched to ONNX. For example, the model treated the instance and background images as a partition diagram and mixed together. At the beginning, we thought to cut out the instance part for deployment, so the separation of the two parts needed to modify a lot of model code. In addition, this model has the output of 5 layers of FPN, which is the same as Fcos and other model operations. How to merge these 5 layers of operations and squeeze all these outputs into ONNX requires some skills. There is also the post-processing of this model, which involves a lot of TopK operations. Students familiar with model deployment should know that TopK is a very troublesome thing. First of all, when onNX transfers to Tensorrt, TopK is not supported, or TopK is not fully supported. In addition, the biggest difficulty of this model is the same as that of SOlOv2’s TensorRT deployment, that is, its mask generation is synthesized by convolution weight, which means that part of the output is the weight of convolution.

We finally cracked all the difficulties, exporting OnNX was about 30% of our total work, the rest was debugging CUDA code…

In this project, we mainly achieved:

The onNx export of PanopticfCN, and of course this export we tried to cram all the operations into onnx;
Addressed InstanceSegmentation OP support in Tensorrt;
GatherElements are not supported in TensorRT.
All post-processing operations are realized by encapsulating all post-processing operations as a Plugin, and the final output is a mask directly;
A feasible scheme is realized to deal with the problem that the output is conv convolution weight.

Finally, a complete set of solutions is completed. To our knowledge, this is the only example of a segmented TensorRT deployment solution that can run this fast in the industry. By the way, the same algorithms and solutions can also be deployed on any platform through OnNX, including FPGas, CPUS, and other libraries. Welcome to discuss further business cooperation.

Algorithm on

The approximate processing flow of this algorithm is as follows:

You can also check out the paper for details. When dealing with this, the difficulty is:

How to handle KernelGenerator;
How to split Thing and Stuff? Some people asked why we split it, because it is very complicated to process without splitting it, and we cannot ensure that we get the correct result. Now that we have verified the instance part, we will try to add panorama.
How to generate mask through Feature output of Feature encoder and convolution weight of output.

The solution is too complex to go into detail, but you can get python code from the Divine Power platform to see the changes:

manaai.cn

TensorRT deployment

Finally, our deployment solution can run up to 40FPS in GTX1080TI and down to 10ms in RTX2080Ti using FP16. If the model is further optimized, the speed can be faster. The version we are currently deploying is Resnet50.

We discovered this when we deployed TensorRT using C++. The post-processing of the model can be as fast as possible, as long as the engine truncation of OnNX and Tensorrt is reasonable. Due to our export strategy, most of the complex operations are all transferred to a fixed engine, resulting in a significantly reduced post-processing time. Compared with two-stage instance segmentation algorithm deployment such as MaskRCNN, the advantage brought by this method is faster end-to-end time and is not limited by the number of instances.

The next step

You can get our modified Python code on the Divine Power platform (support onNX export and inference), as well as the corresponding acceleration scheme for Tensorrt. But forbid white whoring, now transplant hair more and more expensive, ah…..

More and more

If you want to learn artificial intelligence and are interested in cutting-edge AI technology, you can join our Knowledge planet to get the first news, cutting-edge academic trends, industry news and so on! Your support will encourage us to create more frequently, and we will help you embark on a deeper journey of deep learning!

The articles

zhuanlan.zhihu.com/p/165009477

zhuanlan.zhihu.com/p/149398749

zhuanlan.zhihu.com/p/147622974

zhuanlan.zhihu.com/p/144727162

Seven days and seven nights, finally realized real-time instance segmentation algorithm deployment TensorRT,40FPS!

Seven days and seven nights, finally realized real-time instance segmentation algorithm deployment TensorRT,40FPS!

Algorithm on

TensorRT deployment

The next step

More and more

The articles

Related Posts

Steam Robots (part 1)

YAML && Gitlab-CI summary

JDBC object details