Deploy face detection and keypoints at 250fps with ONNX+TensorRT

This article was original written by Jin Tian, welcome re-post, first come with jinfagang.github.io . but please keep this copyright info, thanks, any question could be asked via wechat: jintianiloveu

I tried to condense the core of this article into the title of the article in as short a language as possible. Some time ago, I explained the deployment of Jetson Nano. We talked about the deployment of Yolov3 on the Nano with Caffe. The question of how to deploy the ONNX model is really divided into two parts: first, why to use ONNX, and second, how to deploy ONNX. This article is to fill this hole.

TLTR, the core ideas of this paper include:

  • How to deploy the ONNX model most quickly
  • What is the world’s fastest model for detecting faces and key points?
  • How to use ONNX+TensorRT to make your model 7 times faster
  • We will introduce to you our new generation of face detection + comparison recognition engine, which is expected to run above 200fps on GPU, and of course will be open source.
  • How to deploy the ONNX model on TensorRT using C++

Above is a 250fps face detection model, thanks to TensorRT’s acceleration. The input size is 1280×960.

Why onNx

Pytorch training model is popular nowadays, but when pytorch training model is converted to PTH, it is difficult to achieve real acceleration with C++ reasoning, because the network forward reasoning part, which is the most time-consuming in nature, does not have much acceleration. And reasoning pytorch with libtorch C++ is not an easy task unless your model can be traced.

In this case, it makes more sense to introduce ONNX, which has the following benefits from the current entire DL ecosystem:

  • Its model format is more fine-grained than older Layer-based frameworks;
  • It has a uniform definition and can be inferred from any framework;
  • It can realize the transformation between different frameworks.

Some time ago we released a PyTorch project with retinaface and we managed to export it to the OnNX model with some modifications so code without complex models can be easily exported to OnNX without modifications. This part of the code can be found on our platform:

manaai.cn

What we are going to do today is to use TensorRT for reasoning based on the above ONNX model. Start with a simple speed comparison:

The framework language Time (s) fps
Pytorch python 0.012 + 0.022 29
ONNXRuntime python 0.008 + 0.022 34
TensorRT C++ 0.004 + 0.001 250

It can be seen that using TensorRT to accelerate ONNX model, the speed increase can be said to be the difference between heaven and earth. Moreover, TensorRT pure C++ inference can achieve more acceleration at the language level. The tensorRt-accelerated Retinaface we implemented should be the fastest gPU-oriented detection scheme currently and can generate both Bbox and Landmark. Compared to MTCNN, the model is simpler, the inference is faster and the accuracy is higher.

Real algorithm deployment, there is no doubt that if your target is GPU, using ONNX+TensorRT should be the most mature and optimal solution. If your target is some embedded chips, then using MNN can easily achieve fast reasoning on the EMBEDDED side of the CPU through ONNX.

If ONNX and TensorRT are so good, why not use them? Why is everyone still writing ugly reasoning code in Python? The reason is simple:

  • Entry is too difficult, C++ general people do not play;
  • Understand each layer of input and output of the model, and at least be familiar with the TensorRT API.

This tutorial will show you how to implement TensorRT step by step for the fastest inference. Let’s take a look at the actual TensorRT acceleration:

Not much to see from the pictures, but watch the video:

The effect is still very good.

Simple introduction to the Retinaface model

Retinaface is an action done by Insightface (DeepInsight), but the original version is only MXNet, this network model is small and accurate, and it is a network with landmark branch outputs which allows the model to output landmarks.

The network is called Retina because it introduces the structure and ideas of FPN, making the model more robust on a small scale.

Sudo pip3 install onnxExplorer sudo Pip3 install onnxExplorer sudo Pip3 install onnxExplorer sudo Pip3 install onnxExplorer sudo Pip3 install onnxExplorer sudo Pip3 install onnxExplorer

We made some changes to make pyTorch’s model exportable to OnNX, and we did some special work to make the OnNX model portable to TensorRT’s engine via onNX2TRt.

TensorRT reasoning c + +

The following should be the core content of this article, the above mentioned onnx2trt can compile https://gitub.com/onnx/onnx-tensorrt warehouse, to get onnx2trt, through the execution procedures, can transfer onnx to TRT engine.

Here’s something to note if you’re new to this:

  • Not all onNX models can be successfully converted to TRT Engine unless all ops in your onNX model are supported;
  • You will need to install TensorRT6.0 on your computer because only TensorRT6.0 supports dynamic input.

Without further ado, how would we reason if we had access to the TRT engine? In general, there are 3 steps:

  1. Load your engine first and get oneICudaEngineThis is the core of TensorRT’s reasoning;
  2. You need to locate the inputs and outputs of your model, how many inputs and how many outputs;
  3. Forward model, and then take the output, and postprocess the output.

Of course, the core of this is actually two things, one is how to import and get CudaEngine, the second is a cumbersome post-processing.

IBuilder* builder = createInferBuilder(gLogger); assert(builder ! =nullptr);
  nvinfer1::INetworkDefinition* network = builder->createNetwork();
  auto parser = nvonnxparser::createParser(*network, gLogger);

  if ( !parser->parseFromFile(modelFile.c_str(), static_cast<int>(gLogger.reportableSeverity) ) )
  {
    cerr << "Failure while parsing ONNX file" << std: :endl;
  }


  IHostMemory *trtModelStream{nullptr};
  // Build the engine
  builder->setMaxBatchSize(maxBatchSize);
  builder->setMaxWorkspaceSize(1 << 30);

  if (mTrtRunMode == RUN_MODE::INT8) {
    std: :cout << "setInt8Mode" << std: :endl;
    if(! builder->platformHasFastInt8())std: :cout << "Notice: the platform do not has fast for int8" << std: :endl;
// builder->setInt8Mode(true);
// builder->setInt8Calibrator(calibrator);
    cerr << "int8 mode not supported for now.\n";
  } else if (mTrtRunMode == RUN_MODE::FLOAT16) {
    std: :cout << "setFp16Mode" << std: :endl;
    if(! builder->platformHasFastFp16())std: :cout << "Notice: the platform do not has fast for fp16" << std: :endl;
    builder->setFp16Mode(true);
  }

  ICudaEngine* engine = builder->buildCudaEngine(*network);
  assert(engine);
  // we can destroy the parser
  parser->destroy();
  // serialize the engine, then close everything down
  trtModelStream = engine->serialize();
  trtModelStream->destroy();
  InitEngine();
Copy the code

This is part of the onNX_trt_engine that we maintain. The purpose of this code is to import the TRT engine directly into your ICudaEngine. If you need the full code, you can download it on our MANA platform:

manaai.cn

As you can see, if you want to accelerate the model further, it’s actually up here. Once you get your iCudaEngine, all that remains is to get the output based on your Model’s output name. In fact, the whole process can be done at one go. The only thing that can be complicated is that you need to dynamically allocate data corresponding to the size.

 auto out1 = new float[bufferSize[1] / sizeof(float)];
  auto out2 = new float[bufferSize[2] / sizeof(float)];
  auto out3 = new float[bufferSize[3] / sizeof(float)];

  cudaStream_t stream;
  CHECK(cudaStreamCreate(&stream));
  CHECK(cudaMemcpyAsync(buffers[0], input, bufferSize[0], cudaMemcpyHostToDevice, stream));
// context.enqueue(batchSize, buffers, stream,nullptr);
  context.enqueue(1, buffers, stream, nullptr);

  CHECK(cudaMemcpyAsync(out1, buffers[1], bufferSize[1], cudaMemcpyDeviceToHost, stream));
  CHECK(cudaMemcpyAsync(out2, buffers[2], bufferSize[2], cudaMemcpyDeviceToHost, stream));
  CHECK(cudaMemcpyAsync(out3, buffers[3], bufferSize[3], cudaMemcpyDeviceToHost, stream));
  cudaStreamSynchronize(stream);

  // release the stream and the buffers
  cudaStreamDestroy(stream);
  CHECK(cudaFree(buffers[0]));
  CHECK(cudaFree(buffers[1]));
  CHECK(cudaFree(buffers[2]));
  CHECK(cudaFree(buffers[3]));
Copy the code

This is how the result of TensorRT inference is transferred to our CPU and Async is used to synchronize the data. Finally, the data you get will be stored in your predefined buffer for post-processing.

The C++ code is so large and complex that it will be open sourced to our MANA AI platform. Of course, we have spent a lot of effort to write tutorials and provide source code. If you are interested in AI but lack a good learning group and mentor, you may wish to join our membership program. We are a group of AI learners committed to creating industrial cutting-edge black technology.

Our Training code for PyTorch can be found here:

Manaai. Cn/aicodes_det…

The complete code for TensorRT deployment can be found here:

Manaai. Cn/aicodes_det…

Future plans

We’re seeing that as AI technology matures, people are moving beyond just writing old code in simple Python, and we’re looking for more cutting-edge AI deployment solutions, and TensorRT is one of them, and we’re finding, Through the thinking and optimization of the network model itself, through the thinking and optimization of our network computing framework, through the writing of network reasoning language and algorithm itself, to build a deep technical bottleneck and barriers. In the future you may see why your MaskRCNN only has 10fps while someone else’s can run 35fps at full size (1280p).

Between the square inch, show kung fu.

In the future, we will continue to provide higher quality code in the onNX-TensorRT technology route. Our next goal is to adopt ONNX reasoning and accelerate MaskRCNN with TensorRT. Detectron2 has been released, is it far behind? The Realtime inference of Instance segmentation and panoramic segmentation is our ultimate goal!

In fact, after reading this article, I suggest you can do things:

  1. Deploy our TensorRT version of retinaface to Jetson Nano and you get aface detection model at least 30fps;
  2. You can try retraining a hand with Retinaface and keypoint detection to implement hand pose detection.

Of course, welcome to comment and forward, we have the opportunity to open source after we step on the harvest.

We have spent a lot of precious time to maintain, write and create these codes, while maintaining them also requires a large number of cloud platforms. We are committed to helping more beginners, intermediate scholars and old drivers to provide a perfect AI code deployment platform. If you are interested in AI, you can communicate through our forum.

In addition, we have opened a Slack group, welcome you to communicate:

Join.slack.com/t/manaai/sh…