preface

I won’t go into details about how awesome TensorRT is, because it works.

As the inference library on Nvidia’s GPU, it has been promoted vigorously over the years with frequent updates, timely issue feedback and active community staff, which is hardly NICE.

It’s just that TensorRT has a slightly higher threshold for entry, discouraging some players. Part of the problem is that the official documentation is not detailed enough (it’s pretty detailed, but it’s a bit cluttered) and lacks information. Part of the reason may be that TensorRT is low-level and requires a little bit of C++ and hardware knowledge, which is a bit more difficult to learn than Python.

In addition, TensorRT is fully supported by Python. If you are not used to using C++, you will have no problem calling TensorRT with Python.

I have always wanted to write a series of tutorials on TensorRT, and also to sort out my previous notes. The TensorRT primer is finally finished.

This tutorial is based on the current (2021-4-26) latest version of Tensorrt-7.2.3.4. TensorRT is updated frequently, and tensorrt-8 is likely to be released soon, but TensorRT does a good job with backwards compatible apis, so don’t worry too much about migration.

TensorRT has released version 8 today, and it’s still Early Access (EA), official release. However, some features of TensorRT8-EA are not very stable, so we will not rush to use it. Early adopters can try it first, but the production version is not recommended.

Pan has written about TensorRT before, some of which will be integrated into this series for easy reference:

  • TensorRT is used to accelerate deep learning
  • Speed up neural Network with TensorRT (Read ONNX model and run)
  • Implement TensorRT custom plugin free!
  • TensorRT FP16 accuracy problem? How to do? Online recruit!

Let’s just get started

Attached is a link to the official TensorRT documentation.

What is a TensorRT

TensorRT is a C++ inference framework that can run on various NVIDIA GPU hardware platforms. We can convert the model trained by Pytorch, TF, or other frameworks into TensorRT format, and then run our model using TensorRT inference engine to speed up the model running on nvidia Gpus. The rate of speed increase is considerable.

In the official words:

Tao2, Tao2, Tao2, Tao2, Tao2, Tao2, Tao2, Tao2, Tao2, Tao2, Tao2, Tao2, Tao2, Tao2, Tao2, Tao2, Tao2, Tao2, Tao2, Tao2, Tao2 units (GPUs). TensorRT takes a trained network, which consists of a network definition and a set of trained parameters, and produces a highly optimized runtime engine that performs inference for that network.

TensorRT provides API’s via C++ and Python that help to express deep learning models via the Network Definition API or load a pre-defined model via the parsers that allow TensorRT to optimize and run them on an NVIDIA GPU. TensorRT applies graph optimizations, layer fusion, among other optimizations, while also finding the fastest implementation of that model leveraging a diverse collection of highly optimized kernels. TensorRT also supplies a runtime that you can use to execute this network on all of NVIDIA’s GPU’s from the Kepler Generation Avoidance. TensorRT also includes optional high speed mixed precision capabilities introduced in the Tegra™ X1, And extended with the Pascal™, Volta™, Turing™, and NVIDIA® Ampere GPU Architectures.

TensorRT supports the following platforms:

Supports graphics cards with computing power of 5.0 or above (of course, the graphics cards here can be desktop level graphics cards or embedded graphics cards), we commonly RTX30 series of computing power is 8.6, RTX20 series is 7.5, RTX10 series is 6.1, if we want to use TensorRT, First, we need to confirm whether our graphics card supports it.

Compute Capability is not an absolute measure of the computing Capability of a GPU device. Rather, it is the version number of an architecture. Generally speaking, the newer the architecture, the higher the version number, the higher the first number of computing power is (for example, 3080 computing power 8.6), and the 6 below represents some optimized features under the premises of the architecture, which can be seen here:

  • Docs.nvidia.com/cuda/cuda-c…

You can check the computing capacity of your graphics card here.

Back to TensorRT itself, TensorRT is a library written by C++, CUDA and python. The core code is C++ and CUDA, and the python end is used as the front end to interact with users. Of course, TensorRT also supports C++ front end, if we want to pursue high performance, C++ front end call TensorRT is essential.

Scenario using TensorRT

TensorRT can be used in many scenarios. Server side, embedded side, home computer side are our use scenarios.

  • The graphics card models corresponding to the server can be A100, T4, or V100
  • The graphics cards corresponding to the embedded end are AGX Xavier, TX2, Nano, etc
  • The corresponding graphics card of home computer is 3080, 2080TI, 1080TI and so on

Of course, this is not fixed, as long as our graphics card meets the TensorRT prerequisites, it is ok to use.

What is the acceleration effect of TensorRT

The acceleration effect depends on the type and size of the model, as well as the type of graphics card we are using.

For GPU, because of the underlying hardware design, it is more suitable for parallel computing and prefers intensive computing. TensorRT’s optimization is also based on GPU optimization, of course, it also prefers the kind of large chunk of matrix calculation, as far as possible through to the end. Therefore, for the convolutional layer and deconvolution layer with a large number of channels, the optimization force is relatively large. 0 TensorRT’s optimization is 0 0 If it’s a variety of small and complex OP manipulations (e.g., Gather, Split, etc.) TensorRT’s optimization is 0 0

In order to make full use of the advantages of GPU, we can be more inclined to the parallelism of the model when designing the model, because the computing efficiency of “large and integrated” GPU is far higher than that of “small and fragmented” GPU for the same amount of computation.

Industry prefers simple and direct models and backbone. RepVGG 2020 (RepVGG: Minimalist architecture, SOTA performance, Make VGG Model great again (CVPR-2021)), is an efficient model designed for GPU and dedicated hardware, pursuing high speed, saving memory, less attention to the number of parameters and theoretical calculation. Compared with resnet series, it is more suitable to serve as backbone of some detection models or recognition models.

In practical application, Lao Pan also summarized the acceleration effect of TensorRT simply:

  • SSD detection model, three times faster (Caffe)
  • CenterNet detection model, 3-5 times faster (Pytorch)
  • LSTM, Transformer(fine OP), 0.5 times to 1 times faster (TensorFlow)
  • Resnet series classification models, up to 3 times faster (Keras)
  • GAN, the segmentation model series is relatively large model, the acceleration is about 7-20 times (Pytorch)

TensorRT has some dark technology

Why TensorRT was able to increase the speed of our model running on nvidia gpus, of course with a lot of optimization to increase speed:

  • Operator fusion (layer and tensor fusion) : Simply put, it is to speed up by fusing some computational op or removing some redundant OP to reduce data flow times and frequent use of video memory
  • Quantization: Quantization refers to IN8 quantization or the use of precision such as FP16 and TF32, which is different from conventional FP32. These precision can significantly improve the execution speed of the model without maintaining the accuracy of the original model
  • Automatic kernel adjustment: According to different graphics card architecture, SM quantity, kernel frequency, etc. (such as 1080TI and 2080TI), choose different optimization strategies and calculation methods to find the most suitable calculation method for the current architecture
  • Dynamic tensor video memory: as we all know, the opening and releasing of video memory is time-consuming. By adjusting some strategies, we can reduce the number of these operations in the model, thus reducing the model running time
  • Multi-stream execution: Use THE STREAM technology in CUDA to maximize parallel operations

TensorRT’s optimization strategy code is closed source, but we can probably guess some of the optimization strategy, including some of the official TensorRT optimization strategy:

The upper left corner is the original network (Googlenet), and the upper right corner is optimized vertically relative to the original layer, combining conV + Bias (BN)+ RELU. The lower right corner is optimized horizontally to fuse all 1X1 CBRS into one large CBR. In the lower left corner, the concAT layer is directly removed, and the input of the contact layer is directly sent to the following operation. The input calculation after concAT is not carried out separately is equivalent to reducing a transmission throughput.

And so on and so on.

TensorRT does all these operations for you: operator fusion, dynamic video memory allocation, precision calibration, multiple Steam streams, auto-tuning, etc. After TensorRT helps you tune the model, the speed of the natural model will come up.

Of course, there are other inference optimization libraries on the NVIDIa-GPU platform, such as TVM, which in some cases is better than TensorRT, but it is NVIDIA’s own product, TensorRT still has some advantages on its own GPU, out of the box, it is not difficult to learn.

Install TensorRT!

There are many ways to install TensorRT.

You can choose between the following installation options when installing TensorRT; Debian or RPM packages, a pip wheel file, a tar file, or a zip file.

These packages can be downloaded directly from the official website at developer.nvidia.com/zh-cn/tenso… Enter the download, need to pay attention to here we need to register and log in to download. Lao Pan has been using the way is to download the tar package, download good decompression, as long as our environment meets the requirements can be run directly, similar to green free installation.

For example, download tensorrt-7.2.3.4.ubuntu-18.04.x86_64-gnU.cuda-11.1.cudnn8.1.tar. gz and decompress tar -zxvf.

After unzipping, we need to add environment variables so that our program can find TensorRT’s LIbs.

vim ~/.bashrc
# Add the following
exportLD_LIBRARY_PATH = / path/to/TensorRT - 7.2.3.4 / lib:$LD_LIBRARY_PATH
exportLIBRARY_PATH = / path/to/TensorRT - 7.2.3.4 / lib: :$LIBRARY_PATH
Copy the code

So TensorRT is installed, soon!

TensorRT common FAQ

There are a few caveats to using TensorRT, and Pan summarizes some of the questions you might be interested in or might encounter later.

TensorRT version is relevant

The version of TensorRT is closely related to CUDA and CUDNN. We should notice that when we download TensorRT from the official website:

Mismatched versions of CUDA and CUDNN cannot be used with TensorRT.

So pay attention when downloading, don’t get the version wrong.

As to how to choose the right version of TensorRT for you, look first at the driver and second at the CUDA version.

Nvidia-smi command:

Drivers can be changed as root, but CUDA versions can be changed as long as the driver meets the requirements (root is not required).

For detailed environment configuration, see Pen’s previous article: Host Back and simple environment configuration (RTX3070+CUDA11.1+CUDNN+TensorRT).

To be clever, the version is not strictly limited, as long as the functionality you need exists in the lower version, it is also available.

Let me give you an example.

The official download of tensorrt-7.0.0.11.ubuntu-16.04.x86_64 -gnu. Cuda-10.2.cudn7.6.tar depends on libcudnn.so.7.6.0. But we use libcudnn.so.7.3.0 to run this TensorRT to do something, because the version is not consistent, we get an error:

TensorRT 7.0.0.11 / lib/libmyelin. So. 1: undefined reference to `[email protected]'TensorRT - 7.0.0.11 / lib/libmyelin. So. 1: undefined reference to `cudnnGetBatchNormalizationForwardTrainingExWorkspaceSize@libcudnn.so.7'TensorRT 7.0.0.11 / lib/libmyelin. So. 1: undefined reference to `cudnnGetBatchNormalizationTrainingExReserveSpaceSize@libcudnn.so.7'TensorRT - 7.0.0.11 / lib/libmyelin. So. 1: undefined reference to ` [email protected]'TensorRT 7.0.0.11 / lib/libmyelin. So. 1: undefined reference to ` [email protected]'
Copy the code

Obviously the symbol table required for libmyelin.so.1 was not found in libcudnn.so.7.3.0 when TensorRT was linked and therefore did not compile successfully.

But if we compile it with libcudnn.so.7.6.0 to get the executable, but run the program to give it only libcudnn.so.7.3.0 running environment. TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.3.0.

The reason is very simple, libcudnn. So. Although cudnnGetBatchNormalizationBackwardExWorkspaceSize libcudnn 7.6.0 have. So the 7.3.0: no, but we don’t need it, So the program can run without any errors, but if you need this function to call this function then there’s nothing you can do.

Observation by a command strings can see libcudnn. So. There really are no cudnnGetBatchNormalizationBackwardExWorkspaceSize 7.3.0 this function:

strings /usr/local/ cuda/lib64 / libcudnn. So. 7.3.0 | grep cudnnGetBatchNormalizationBackwardExWorkspaceSize and there are strings in 7.6.5 / usr /local/ cuda/lib64 / libcudnn. So. 7.6.5 | grep cudnnGetBatchNormalizationBackwardExWorkspaceSize cudnnGetBatchNormalizationBackwardExWorkspaceSize cudnnGetBatchNormalizationBackwardExWorkspaceSizeCopy the code

More details about dynamic link libraries can be found here. Some details about dynamic link libraries we don’t know.

What model can be converted to TensorRT

TensorRT supports the conversion of Caffe, Tensorflow, Pytorch, ONNX, etc. (Caffe and Tensorflow converters caffe-Parser and UFF-Parser are somewhat behind The Times). There are also three ways to convert models:

  • useTF-TRTTo integrate TensorRT into TensorFlow
  • useONNX2TensorRT, the ONNX conversion TRT tool
  • Manually constructing the model structure and then manually moving the weight information over is very flexible but slightly more time costly. One big guy has already tried this: Tensorrtx

The latest version of tensorRt-8 ONNX converter supports more OP operations. In addition to the Pytorch->ONNX->TensorRT path, there are also:

  • torch2trt
  • torch2trt_dynamic
  • TRTorch

All in all, theoretically 95% of the models can be converted to TensorRT. All roads lead to Rome. However, some models may be more difficult to convert. If you encounter a model that you can’t convert, don’t despair, think again, think again, and see if you can get around it in another way.

Does TensorRT support dynamic Shape

Support, but also very convenient to use, if some OP does not support, you can also write their own dynamic scale Plugin.

Dynamic scaling supports N, H, and W in NCHW, namely batch, height, and width.

For dynamic models, we need to specify three additional dimensions (minimum, optimal, maximum) when transforming the model.

Here’s a command to transform a dynamic model:

./trtexec --explicitBatch --onnx=demo.onnx --minShapes=input:1x1x256x256 --optShapes=input:1x1x2048x2048 --maxShapes=input:1x1x2560x2560 --shapes=input:1x1x2048x2048 --saveEngine=demo.trt --workspace=6000
Copy the code

Very simple ~

TensorRT is hardware dependent

This is easy to understand because different graphics cards (gpus), their number of cores, frequency, architecture, design (and price..) They are all different, TensorRT needs to be optimized for specific hardware, and optimizations cannot be shared between different hardware.

The generated plan files are not portable across platforms or TensorRT versions. Plans are specific to the exact GPU model they were built on (in addition to platforms and the TensorRT version) and must be re-targeted to the specific GPU in case you want to run them on a different GPU

TensorRT is open source

TensorRT is semi-open source, with all but the core being open source. What’s at the heart of TensorRT, of course, are some of the features that were officially unveiled. As follows:

The above core advantages, namely the black technology inside TensorRT, help us optimize the model and speed up the model reasoning, which is of course not open source.

The open source stuff is basically in this repository:

  • TensorRT Open Source Software

Plug-in related, tool related, document related, inside the resources or quite rich.

Can TensorRT be used in Python

Of course you can. There is an official Python installation package (described below), and you can use it by importing Tensorrt.

Use the LDD command to see what is referenced in tensorrt.so.

LDD tensorrt. So the Linux - vdso. So. 1 = > (0 x00007ffe477d4000) libnvinfer. So, 7 = > / tensorrt 7.0.0.11 / lib/libnvinfer. So. 7 7 (0 x00007f2a76f6b000) libnvonnxparser. So. = > / TensorRT 7.0.0.11 / lib/libnvonnxparser. 7 (0 x00007f2a76ca5000) so. Libnvparsers. So. 7 = > / TensorRT 7.0.0.11 / lib/libnvparsers. So. 7 (0 x00007f2a76776000) libnvinfer_plugin. So. 7 = > / TensorRT - 7.0.0.11 / lib/libnvinfer_plugin. So. 7 (0 x00007f2a758e2000) libstdc++. So. 6 = > / tensorrt-7.0.0.11 /lib/libstdc++ so.6 (0x00007f2a75555000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f2a7532f000) libc.so.6 => /lib64/libc.so.6 (0x00007f2a74f61000) libcudnn.so.7 => /usr/local/ cuda/lib64 / libcudnn. So. 7 (0 x00007f2a60768000) libcublas. So. = > / usr / 10.0local/ cuda/lib64 / libcublas. So. 10.0 (0 x00007f2a5c1d2000) libcudart. So. = > / usr / 10.0local/ cuda/lib64 / libcudart. So. 10.0 (0 x00007f2a5bf57000) libmyelin. So. 1 = > / TensorRT - 7.0.0.11 / lib/libmyelin. So. 1 (0 x00007f2a5b746000) libnvrtc. So. = > / usr / 10.0local/ cuda/lib64 / libnvrtc. So. 10.0 (0 x00007f2a5a12a000) libdl. So, 2 = > / lib64 / libdl. So. 2 (0 x00007f2a59f25000) libm. So. 6 = > /lib64/libm.so.6 (0x00007f2a59c23000) /lib64/ld-linux-x86-64.so.2 (0x00007f2a84f5c000) libprotobuf.so.16 => / onnx - tensorrt/protobuf - 3.6.0 / lib/libprotobuf. So. 16 (0 x00007f2a597ad000) libpthreads. So. 0 = > / lib64 / libpthreads. So. 0 (0x00007f2a59591000) librt.so.1 => /lib64/librt.so.1 (0x00007f2a59389000) libz.so.1 => /lib64/libz.so.1 (0x00007f2a59172000)Copy the code

TensorRT deployment is relevant

There are three ways to deploy TensorRT:

  • Integrated in Tensorflow, proportional TF-TRT, this operation is relatively convenient, but not very good acceleration effect;
  • To run the model in the TensorRT Runtime environment is to use TensorRT directly;
  • With the use of service framework, the most compatible is the official Triton-server, perfect support for TensorRT, used in the production environment of the barrack!

Which weight precision TensorRT supports

Supports FP32, FP16, INT8, TF32, etc. These types are commonly used.

  • FP32: Single-precision floating point, nothing to say, the most common data format in deep learning, training reasoning will use;
  • FP16: half precision floating point type, compared with FP32 memory usage reduced by half, there are corresponding instruction values, the speed is much faster than FP32;
  • TF32: One of the data types supported by the third generation Tensor Core is a truncated Float32 format that truncates the 23 tail bits of FP32 to 10bits, and the exponent bits are still 8bits, with a total length of 19(=1+8 +10). It maintains the same accuracy as FP16 (the mantras are all 10), and the dynamic range index of FP32 is all 8);
  • INT8: integer, less than half of the memory occupied by FP16, corresponding instruction set, model quantization can use INT8 for acceleration.

A brief demonstration of the differences in accuracy:

Take a quick example.

With all the theoretical knowledge, it’s too much to say no to a chestnut.

The purpose of this example is simply to show a scenario and the basic flow using TensorRT. Let’s say Pan has an OnNX model that he wants to run on 3070 cards, fast, and TensorRT is needed.

What is ONNX(ONNX is a model structure format that facilitates model conversion between different frameworks such as Pytorch->ONNX->TRT)

Pan doesn’t have a well-trained model on hand, so he can just get one from an open source project. Before finding a more interesting project, you can identify three-dimensional human body key points through pictures, commonly known as human posture detection, the project address is here:

  • Github.com/digital-sta…

The result is as follows. The accuracy of the model is ok, but only one target can appear in the picture. In terms of speed, the home page has an introduction:

  • GeForce RTX2070 SUPER Tail About 30 FPS
  • GeForce GTX1070 Tail About 20 FPS

Let’s see how much TensorRT can speed it up on the 3070 card.

Take a look at the model structure

The core model used is available on the Github homepage, namely the ONNX model resNet34_3INPUTS_448X448_20200609.onnx. The author demonstrates using Unity and Barracuda, which uses Barracuda to load the OnNX model and then reason.

Let’s take a look at the model structure with Netron. Three inputs and four outputs. Why three inputs? In fact, these three inputs are in different stages of the model. The input data of the author during training may be three consecutive frames of images extracted from the video, so that the training can improve the accuracy of the model (it also needs some time to warm up the model reasoning later, after all, the input of three frames is time continuous) :

Verify the model using onNXRuntime

In general, we need to verify the accuracy of the ONNX model after converting the ONNX model through different frameworks (Pytorch, TF), otherwise the wrong ONNX model will be 100% wrong when converted to the TensorRT model.

Onnx: resNET34_3INPUTS_448X448_20200609.onnx: resNET34_3INPUTS_448X448_20200609.onnx: resNET34_3INPUTS_448X448_20200609.onnx:

import onnx
import numpy as np
import onnxruntime as rt
import cv2

model_path = '/home/oldpan/code/models/Resnet34_3inputs_448x448_20200609.onnx'

# Verify model validity
onnx_model = onnx.load(model_path)
onnx.checker.check_model(onnx_model)

# Read in the image and adjust to the input dimension
image = cv2.imread("data/images/person.png")
image = cv2.resize(image, (448.448))
image = image.transpose(2.0.1)
image = np.array(image)[np.newaxis, :, :, :].astype(np.float32)

Set model session and input information
sess = rt.InferenceSession(model_path)
input_name1 = sess.get_inputs()[0].name
input_name2 = sess.get_inputs()[1].name
input_name3 = sess.get_inputs()[2].name

output = sess.run(None, {input_name1: image, input_name2: image, input_name3: image})
print(output)
Copy the code

Print the result to see what it looks like. In fact, there is a lot of output information, too much pasting, after all, the output of this model is really much… Here is a partial excerpt of the text.

The 2021-05-05 10:44:08. 696562083 W: onnxruntime:, graph.cc:3106 CleanUnusedInitializers] Removing initializer 'offset.1.num_batches_tracked'. It is not used by any node and should be removed from the model. ... [array([[[[0.16470502, 0.9578098, -0.82495296,... -0.59656703, 0.26985374, 0.5808018], [-0.6096473, 0.9780458, -0.9723106,... -0.90165156, -0.8959699, 0.91829604], [-0.03562748, 0.3730615, -0.9816262,... 0.4543069, 0.5840921],...,..., [... 0, 0, 0,..., 0., 0., 0.], [... 0, 0, 0,..., 0., 0., 0.], [0., 0. , 0. , ..., 0. , 0. , 0. ]]]], dtype=float32)]Copy the code

Remove Initializer ‘offset.1. Num_batches_tracked ‘. It is not used by any node and should be removed from the I laughed at the suggestion. It was trained with Pytorch. It seems that the author exported the model through Pytorch training.

ONNX and Pytorch output are the same. You can directly see the output value, but the output of this model is quite large, it is not rational to directly compare the results before and after the transformation by naked eyes. A simple comparison can be made with the following code:

y = model(x)
y_onnx = model_onnx(x)

# check the output against PyTorch
print(torch.max(torch.abs(y - y_trt)))
Copy the code

ONNX is converted to the TensorRT model

Conversion from ONNX model to TensorRT model is relatively easy. Currently, TensorRT has the best official support for ONNX model, and will focus on ONNX in the future (UFF, Caffe and other conversion tools may not be updated as compared to ONNX).

Currently, the official conversion tool TensorRT Backend For ONNX(onnX-Tensorrt For short) is mature and is actively developed by developers. Let’s use the tools above to rotate the model.

We do not need to clone TensorRT Backend For ONNX. The downloaded TensorRT package already contains the executable file of this tool, which has been compiled by the authorities For us. As long as our environment meets requirements, it can be used directly.

Trtexec: tensorrt-7.2.3.4 /bin: Trtexec: tensorrt-7.2.3.4 /bin: Trtexec: tensorrt-7.2.3.4 /bin: Trtexec: tensorrt-7.2.3.4 /bin: Trtexec: tensorrt-7.2.3.4 /bin

  • Github.com/NVIDIA/Tens…

If we use command conversion, we can see the output:

&&&& RUNNING TensorRT.trtexec # ./trtexec --onnx=Resnet34_3inputs_448x448_20200609.onnx --saveEngine=Resnet34_3inputs_448x448_20200609.trt --workspace=6000[05/09/2021-17:00:50] [I] === Model Options === [05/09/2021-17:00:50] [I] Format: ONNX [05/09/2021-17:00:50] [I] Model: Resnet34_3inputs_448x448_20200609.onnx [05/09/2021-17:00:50] [I] Output: [05/09/2021-17:00:50] [I] === Build Options === [05/09/2021-17:00:50] [I] Max batch: explicit [05/09/2021-17:00:50] [I] Workspace: 6000 MiB [05/09/2021-17:00:50] [I] minTiming: 1 [05/09/2021-17:00:50] [I] avgTiming: 8 [05/09/2021-17:00:50] [I] Precision: FP32 [05/09/2021-17:00:50] [I] Calibration: [05/09/2021-17:00:50] [I] Refit: Disabled [05/09/2021-17:00:50] [I] Safe mode: Disabled [05/09/2021-17:00:50] [I] Save engine: Resnet34_3inputs_448x448_20200609.trt [05/09/2021-17:00:50] [I] Load engine: [05/09/2021-17:00:50] [I] Builder Cache: Enabled [05/09/2021-17:00:50] [I] NVTX verbosity: 0 [05/09/2021-17:00:50] [I] Tactic sources: Using default tactic sources [05/09/2021-17:00:50] [I] Input(s)s format: fp32:CHW [05/09/2021-17:00:50] [I] Output(s)s format: fp32:CHW ... [05/09/2021-17:02:32] [I] Timing trace has 0 Queries over 3.16903s [05/09/2021-17:02:32] [I] Trace Averages of 10 runs:  [05/09/2021-17:02:32] [I] Average on 10 runs - GPU latency: Ms [05/09/2021-17:02:32] [I] Average on 10 runs - GPU latency: Ms [05/09/2021-17:02:32] [I] Average on 10 runs - GPU latency: 4.6537 MS [05/09/2021-17:02:32] [I] Average on 10 runs - GPU latency: Ms [05/09/2021-17:02:32] [I] Average on 10 runs - GPU latency: 4.6333 ms [05/09/2021-17:02:32] [I] Host Latency [05/09/2021-17:02:32] [I] min: 4.9716 ms (end to end 108.17ms) [05/09/2021-17:02:32] [I] Max: 4.4915 ms (end to end 110.732 ms) [05/09/2021-17:02:32] [I] mean: 4.0049 ms (end to end 109.226ms) [05/09/2021-17:02:32] [I] median: | median: | median: | median: | median: | median: | median: | 4.9646 ms (end to end 109.241 MS) [05/09/2021-17:02:32] percentile: [05/09/2021-17:02:32] [I] throughput: 0 qps [05/09/2021-17:02:32] [I] walltime: 3.1693s [05/09/2021-17:02:32] [I] Enqueue Time [05/09/2021-17:02:32] [I] min: 0.776004 ms [05/09/2021-17:02:32] [I] Max: 2.37964 MS [05/09/2021-17:02:32] [I] median: median information: [I] GPU Compute [05/09/2021-17:02:32] [I] min: Ms [05/09/2021-17:02:32] [I] Max: ms [05/09/2021-17:02:32] [I] mean: 4.6307 ms [05/09/2021-17:02:32] [I] Median: 4.5915 ms [05/09/2021-17:02:32] [I] Percentile: 4.1133 ms at 99%Copy the code

The FP32 inference speed is about 4-5ms, while FP16 only needs 1.6ms.

PS: the onnx-tensorrt tool itself is written by C++, the overall structure design is relatively compact, worth reading, after the old pan will talk about the compilation and use of onnx-tensorrt tool.

Run the TensorRT model

Here we use the Python side of TensorRT to load the transformed resNet34_3DPose. TRT model. Tensorrt-7.2.3.4 -cp37-none-linux_x86_64. WHL installation package tensorrt-7.2.3.4-cp37-none-linux_x86_64. WHL installation package tensorrt-7.2.3.4-cp37-none-linux_x86_64. Tensorrt-8-ea now supports Python-3.9.

After installing python-tensorrt, first import TensorRT as TRT.

Then load the Trt model:

logger = trt.Logger(trt.Logger.INFO)
  with open("resnet34_3dpose.trt"."rb") as f, trt.Runtime(logger) as runtime:
    engine=runtime.deserialize_cuda_engine(f.read())
Copy the code

After loading, we print the input and output of the model to see if it is consistent with the ONNX model:

for idx in range(engine.num_bindings):
    is_input = engine.binding_is_input(idx)
    name = engine.get_binding_name(idx)
    op_type = engine.get_binding_dtype(idx)
    model_all_names.append(name)
    shape = engine.get_binding_shape(idx)

    print('input id:',idx,' is input: ', is_input,' binding name:', name, ' shape:', shape, 'type: ', op_type)
Copy the code

You can see:

engine bindings message: 
input id: 0    is input:  True   binding name: input.1   shape: (1, 3, 448, 448) type:  DataType.FLOAT
input id: 1    is input:  True   binding name: input.4   shape: (1, 3, 448, 448) type:  DataType.FLOAT
input id: 2    is input:  True   binding name: input.7   shape: (1, 3, 448, 448) type:  DataType.FLOAT
input id: 3    is input:  False   binding name: 499   shape: (1, 24, 28, 28) type:  DataType.FLOAT
input id: 4    is input:  False   binding name: 504   shape: (1, 48, 28, 28) type:  DataType.FLOAT
input id: 5    is input:  False   binding name: 516   shape: (1, 672, 28, 28) type:  DataType.FLOAT
input id: 6    is input:  False   binding name: 530   shape: (1, 2016, 28, 28) type:  DataType.FLOAT
Copy the code

Three inputs and four outputs, exactly the same no problem!

Then load the image and run the model:

image = cv2.imread(image_path)
image = cv2.resize(image, (200.64))
image = image.transpose(2.0.1)
img_input = image.astype(np.float32)
img_input = torch.from_numpy(img_input)
img_input = img_input.unsqueeze(0)
img_input = img_input.to(device)
# Run model
result_trt = trt_model(img_input)
Copy the code

Yi? Is it that simple? Where is TRT’s engine?

Of course not. Trt_model is a class, and we use it like this:

# engine is the loaded engine described above
trt_model = TRTModule(engine, ["input.1"."input.4"."input.7"])
Copy the code

And this TRTModule what is, in order to load the TRT model conveniently and create the runtime, we refer to an implementation class of torch2trt this library, well combines Pytorch and TensorRT, specific implementation is as follows:

class TRTModule(torch.nn.Module) :
    def __init__(self, engine=None, input_names=None, output_names=None) :
        super(TRTModule, self).__init__()
        self.engine = engine
        if self.engine is not None:
            # engine creates the execution context
            self.context = self.engine.create_execution_context()

        self.input_names = input_names
        self.output_names = output_names

    def forward(self, *inputs) :
        batch_size = inputs[0].shape[0]
        bindings = [None] * (len(self.input_names) + len(self.output_names))

        for i, input_name in enumerate(self.input_names):
            idx = self.engine.get_binding_index(input_name)
            # set shape
            self.context.set_binding_shape(idx, tuple(inputs[i].shape))
            bindings[idx] = inputs[i].contiguous().data_ptr()

        # create output tensors
        outputs = [None] * len(self.output_names)
        for i, output_name in enumerate(self.output_names):
            idx = self.engine.get_binding_index(output_name)
            dtype = torch_dtype_from_trt(self.engine.get_binding_dtype(idx))
            shape = tuple(self.context.get_binding_shape(idx))
            device = torch_device_from_trt(self.engine.get_location(idx))
            output = torch.empty(size=shape, dtype=dtype, device=device)
            outputs[i] = output
            bindings[idx] = output.data_ptr()

        self.context.execute_async_v2(bindings,
                                      torch.cuda.current_stream().cuda_stream)

        outputs = tuple(outputs)
        if len(outputs) == 1:
            outputs = outputs[0]

        return outputs
Copy the code

We focus on the __init__() and forward member methods. Engine creates the context, determines the input and output, and executes execute_async_v2 to get the result.

At this point, the serialized TRT model is successfully called using TensorRT.

Polygraphy was used to examine the output differences between ONNX and TRT models

Polygraphy is an official collection of small tools provided by TensorRT. Use this tool to see if the accuracy of the Resnet34_3inputs_448x448_20200609 onNX model is lost when converted to TRT:

First look at the FP32 precision:

Take a look at the accuracy of FP16:

Here, the absolute error and relative error tolerance are set at 1E-3, accurate to 3 decimal places. It can be seen that the above ONNX model has no major problem in converting TRT into FP32, while FP16 has a lot of accuracy loss. As for whether it will seriously affect the results, we need to make specific or batch case analysis. I’m limited to space and I’ll talk about it later.

Here’s an example

The disadvantage of TensorRT

TensorRT is not without its “faults”, and there are a few minor ones that need to be teased:

  • Infer optimized models are bound to specific GPU. For example, models generated on 1080TI cannot be used on 2080TI.
  • A higher version of TensorRT relies on a higher version of CUDA, and a higher version of CUDA relies on a higher version of drivers. If you want to use a newer version of TensorRT, changing the environment is inevitable.
  • Although TensorRT is easy to use, Infer optimization is a closed-source, black box and can be a bit awkward to use. Fortunately, TensorRT provides a lot of tools to help with debugging.

Just as the so-called love of the deep hate cut, Lao Pan also knows that these “shortcomings” is also the shortcomings of no way, since it can not be avoided, gently laugh at it.

TensorRT supporting peripheral

After all, TensorRT has been developed for so many years, and the official government also knows the pain points of using TensorRT, so we have developed some practical small tools for you to use. The tools are currently available on TensorRT’s open source homepage, that is, these tools are also open source:

The basic functions of these three tools are briefly introduced:

  • The ONNX GraphSurgeon can modify our derived ONNX model, add or remove nodes, change names or dimensions, and so on
  • Polygraphy is a collection of widgets, such as comparing the accuracy of ONNX and TRT models, observing the output of each layer of TRT models, etc., mainly used to debug some model information, is still useful
  • PyTorch Quantization can add simulated Quantization operations during PyTorch training or inference to improve the accuracy and speed of Quantization models, and support ONNX and TRT derived from Quantization trained models

Pan has used ONNX GraphSurgeon and Polygraphy, both of which are practical tools for deployment and transformation, and do solve some problems. The only disadvantage is that the tutorials of these tools are not very detailed and difficult to get started. After that, Lao Pan will also introduce how to use these tools in detail.

Refer to the link

  • TensorRT-Index:docs.nvidia.com/deeplearnin…
  • API:docs.nvidia.com/deeplearnin…
  • ONNX GraphSurgeon: docs.nvidia.com/deeplearnin…
  • Docs.nvidia.com/deeplearnin…

Afterword.

I can’t write anymore. There are a lot of things I want to write about TensorRT. If you don’t want to miss TensorRT, you can follow me. I hope this article can help you learn TensorRT

Personally speaking, the best way to learn TensorRT is actual combat. Only by stepping on some pits, modifying some codes and running through some programs can we learn TensorRT better. And the so-called so many documents, also is nothing more than to help us step on some pits. In any case, most pits should be stomped on. Don’t wait until you’re ready.

If you have any questions or related questions, please leave a message

Pulled me

  • If you are like-minded with me, Lao Pan is willing to communicate with you;
  • If you like Lao Pan’s content, welcome to follow and support.
  • If you like my article, please like 👍 collect 📁 comment 💬 three times ~

If you want to know how Lao Pan learned to tread pits, and if you want to talk with me about his problems, please follow the public account “Oldpan blog”. Lao Pan will also sort out some of his own private stash, hoping to help you, click on the mysterious portal to obtain.