In recent years, AI model has been widely used in image and video processing, and has shown good results in applications such as hypersegmentation, noise reduction and frame insertion. However, due to the large amount of computation of image AI model, even if deployed on GPU, sometimes it still cannot reach the ideal running speed. To this end, NVIDIA has launched TensorRT, which exponentially improves the reasoning efficiency of AI models. This LiveVideoStack online sharing invited Nvidia DevTech team technical lead Ji Guang to discuss a simple way to run the model into TensorRT, help GPU programming beginners speed up their AI model.

LiveVideoStack

#01 About NVIDIA Gpus

First, introduce Nvidia GPU. The previous generation of GPU architecture was Turing, the current architecture is Ampere. Ampere consumer models start with 30, including 3090, 3080, and 3070. Enterprise-class models are used for data centers, including A100, A30, A10, and A16. As there are many enterprise models, a brief introduction to the use of these models.

  • A100 is the GPU with the largest chip area, which is suitable for training. The A30 is about half as capable as the A100. But the good thing about both gpus is that they both support the new data format TF32 and have a lot of throughput for matrix multiplication on the Tensor Core (see green on the table above). The TF32 is very useful for training and can partially replace the FP32. In addition, A100/A30 supports MIG, which can be dynamically cut into multiple Gpus in a single operating system, and can also be used for reasoning.
  • A10 is the replacement of T4, it is characterized by FP32/FP16 throughput is very high, more suitable for reasoning.
  • A16 is unique in that it has 4 Gpus, each with 1 NVENC and 2 NVDEC engines, which makes it more suitable for transcoding.
  • The GeForce 3090 is a consumer model with a GPU model that differs from the enterprise-class model in terms of computing power, for example, its FP16 has a matrix multiplier of 142 TFLOPS (FP16 cumulative, with limited accuracy) or 71 TFLOPS (FP32 cumulative). By comparison, the A10’s FP32 summated matrix multiplies by 125 TFLOPS, much higher. So for training and reasoning, the GeForce 3090 is no match for enterprise models in many situations.

#02 GPU programming basics

GPU computing power depends on the GPU program to run out, so we need to write GPU programs. GPU programming, also known as heterogeneous programming, is different from CPU programming. For CPU programs, the program and data are placed in main memory (or memory) in a familiar fashion. On the left is how a GPU program works. The GPU has its own memory, known as video memory. To run the program on the GPU, we need to copy the data from main memory to video memory, and then start the GPU program for calculation; When the calculation is complete, the data needs to be copied from video memory back to main memory. So that’s the idea of heterogeneous programming. Simply copy the data to a heterogeneous processor, start the program, and finally copy the data back. A more complete program on the right illustrates the above idea. CudaMalloc is used to allocate variables A and B on video memory (pointed to by video memory Pointers DP_A and DP_B), cudaMemcpy is used to copy A from main memory to video memory, and then the GPU program is started. The GPU program highlighted in yellow, called the CUDA Kernel, uses data from video memory. After calculation, cudaMemcpy copies result B back to main memory, and finally cudaFree releases the originally allocated video memory. It is very important to master the idea of “copy data to video memory – start GPU program – copy data back to main memory”. For those familiar with C++ programming, calling the related functions is relatively easy, but writing a CUDA kernel requires additional effort. In particular, we want to reduce the programming burden when using the GPU, and get programs running on the GPU through API calls. This is where TensorRT, a GPU-accelerated library, comes in.

#03 TensorRT in GPU transcoding pipeline

The data in the previous sample code is a single floating point number, which is a simple scenario. In more complex scenarios, the copied data can be a single image or a sequence of images. However, copying data between main memory and video memory comes at a cost, and in large volumes it can become a bottleneck that needs to be minimized or avoided.

Take video transcoding as an example, if the input data is the encoded video code stream, the hardware decoder on GPU can be used to decode, and the solved picture can be stored in video memory, and then handed over to GPU program for processing. In addition, there is a hardware encoder on the GPU, which can encode the processed image and output the video code stream. In the above process, whether decoding, or data processing, or the final coding, can make the data stay in the video memory, so that can achieve higher operating efficiency.

#04 Accelerating AI model reasoning with TensorRT

The development of deep learning applications is divided into two stages, training and reasoning. TensorRT is used to speed up reasoning.

TensorRT’s acceleration principle is generally in the following aspects:

  • TensorRT can automatically select the optimal kernel. For matrix multiplication, in different GPU architectures and different matrix sizes, the implementation of the optimal GPU kernel is different, and TensorRT can optimize it.
  • TensorRT can optimize the calculation graph, and generate the optimized calculation graph of the network by means of kernel fusion and reducing data copy.
  • TensorRT supports FP16 / INT18 for precision conversion of data, taking advantage of the low-precision, high-throughput computing power of the hardware.

#05 Acceleration effect of TensorRT

We use some examples to illustrate the acceleration effect of TensorRT.

  • For the common ResNet50, running at T4, fp32 has 1.4 times faster accuracy; The FP16 has an accuracy of 6.4 times acceleration. You can see that FP16 is very useful, enabling FP16 gives a further 4.5x acceleration compared to FP32.
  • For EDVR, a well-known video hyperdivision network running on T4, the accuracy of FP32 is 1.1 times faster, which is not very obvious. However, fp16 accuracy is 2.7 times faster, enabling FP16 has a further 2.4 times faster speed compared to FP32.

It can be seen that different models have different acceleration effects. In general, the acceleration effect of the convolution model is more significant, while the acceleration effect of the model containing a large number of data copies is general, and FP16 has no significant help.

06 Quick access to TensorRT

How do YOU use TensorRT? Essentially migrating the trained model from the training framework to TensorRT. Here are three options:

1) TensorRTTensorFlow integrates with TF-TRT, PyTorch, and TRTorch through the framework’s internal integration, calling these apis to run the model (partially) on TensorRT. They are easy to use and run through apis in the framework, but in many cases they are not as efficient as they should be.

2) the more hardcore approach is to use the TensorRT C++/Python API to build your own network, and use the TensorRT API to reconstruct the computation diagrams in the framework. This approach is the most compatible and efficient, but also the most difficult. In GTC China before for this method, we had two reports (TENSORRT: accelerate deployment, deep learning reasoning using TENSORRT freedom to build high-performance inference model * on-demand-gtc.gputechconf.com/gtcnew/sess…

3) The recommended approach today is to export the model from the existing framework (ONNX) and import TensorRT. Its advantage is moderate difficulty, efficiency is acceptable, can be counted as a shortcut. The problem to solve is how to export ONNX from the training framework and how to import ONNX into TensorRT.

How to export and how to import

Step 0: Understand the basic framework of TensorRT programming

The code shown above is the most basic use of TensorRT.

1. As preparation, logger was built, builder was built, and network was built from Builder, which is fixed for all TensorRT programs.

2. The next highlighted part is the reconstruction of the calculation diagram through the TensorRT API, so that the calculation on TensorRT is exactly the same as the original model of the training framework. This code can be very long, such as hundreds of lines.

3. After completing the network, you can build TensorRT Engine (build_CUDA_engine). The construction time varies from a few seconds to a few minutes or even hours according to the size of the network.

4. Build the engine and run it. The engine can also be saved to disk and can be loaded on a second run without having to build again. D_input and d_output in the figure above are the video memory addresses mentioned earlier in heterogeneous programming. The highlighted part can be quite complex, but to save trouble, we use the ONNX Parser to automate the networking. Therefore, the basic process is as follows: first, export ONNX from the training framework, and then use TensorRT’s built-in tool Trtexec to import ONNX into TensorRT to build engine. Finally, write a simple small program to load and run engine.

Step 1: Export ONNX from the framework

ONNX is a neutral computed graph representation, PyTorch has TouchScript, and TensorFlow has Frozen Graph, which are all framework specific methods for persisting computed graphs. ONNX is a platform-neutral representation that theoretically all frameworks can support.

In most cases, the exported ONNX can still run, but sometimes it cannot run directly. Instead, it needs to supplement the ONNX Runtime. For example, the exported ONNX has special operators, such as Deformable Convolution. It is not the standard OP of ONNX, but the exported ONNX can be run by extending the ONNX Runtime. However, the ability of ONNX to run is not a prerequisite for successful import by TensorRT. That said, it doesn’t matter if the exported ONNX doesn’t run, we still have a way to make TensorRT import. This point is illustrated in the following examples. You can see sample code for PyTorch to export ONNX above. Resnet50 is a PyTorch Nn. Module object; Setting verbose to True enables ONNX to be typed as text, which is useful for debugging; Opset can be set up to 12. The higher the version, the more opSets are supported.

Step 2: Import ONNX to TensorRT using Parser

TensorRT official development kit comes with an executable file trtexec. It can take ONNX input, build the TensorRT network from ONNX, build the engine, and save it as a file. This sequence of actions can be done through the commands in the diagram. Trtexec can also be programmed to do this, but it is generally not necessary. The success of trtexec shows that TensorRT has reconstructed the computational graph equivalent to ONNX with its own layer, and the computational graph has been successfully constructed into engine. The saved engine can be reused in the future.

Step 3: Run Engine

The last step is a simple one: load the engine file, provide input data, and run. Sample code for C++ and Python can be found here. (github.com/NVIDIA/trt-… The event measures the running time, but be careful that the stream is set correctly. In addition, there is a rough and simple method: do a GPU synchronization, and then take the time T0; Start GPU program; Do the GPU synchronization again, take T1, get T1-T0, this is the GPU program running time. (See sample code here: github.com/NVIDIA/trt-… Events are the ultimate solution.

#08 Exporting ONNX: Problematic

That’s the best case scenario. If you encounter operations that ONNX does not support, the solution is to upgrade the framework and the ONNX export tool to use the highest opset currently supported. This may not be enough, however, as some of PyTorch’s official OP’s are still undefined (or ungrouped) in ONNX. Therefore, the ONNX_FALLTHROUGH option can be added to export even if it is not defined. If you encounter a developer-defined OP, you need to be sure to add a symbolic Function to the custom Function subclass, thereby taking the ONNX node name for the custom OP. (See examples here: github.com/shining365/…)

In addition, you may encounter an error when importing ONNX into TensorRT using trtexec. A common case is an unsupported OP, more on that later. In another case TensorRT Parser has special requirements for ONNX network architecture. Specifically, let’s look at an example. The highlighted error message reads “Resize Scales must be an Initializer!” For richer debugging, turn on the –verbose option when you run trtexec. As you can see from the figure, the Resize node has three inputs: 385,402,401. These three numbers are not the input values, but the names of the input variables. We need to take a closer look at how all three variables are generated.

Make sure you set verbose=True when exporting ONNX to get a text description of ONNX, as shown above. You can see that the Resize node is at the bottom of the figure, and its three input variables are highlighted, each of which has its own calculation. Since ONNX itself is a computational graph, we can draw a graph that shows this part more clearly.

09 ONNX Surgeon: Graph Surgeon

Here is an ONNX subgraph of this Resize. Its third argument, 401, comes from a Concat operation that concats three variables together: one is Constant, and the other two are Constant after Unsqueeze and Cast. Error message “Resize scales must be an Initializer! The third parameter of Resize cannot be a variable, it has to be Constant, so we need to convert this blue molecule to Constant, which is what it looks like on the right. Once this is done, TensorRT Parser runs normally. This conversion can be done theoretically, because the leaf nodes of this molecular diagram are Constant, and the specific values are written in it. We can make relevant calculations manually according to the calculation diagram, and store the results in the newly created Constant node. The tool to do this is Graph Surgeon.

Graph Surgeon modifies ONNX computations like a scalpel. The diagram above shows the code for converting a computed Graph using Graph Surgeon.

1. First find the Resize node that meets the condition, whose filtering condition is that its third input variable should come from the Concat node. 2. We then create a while loop for all the input parameters of this Concat, going up until Constant is found and putting the values inside of Constant into values. By the end of the for loop, all the values to be merged have been stored in values. 3. Create a Constant node, use the concatenate function of NUMpy to merge the values into this node, and connect the output for this node. A complete example of Graph Surgeon code is available here *(github.com/NVIDIA/trt-… As the Surgeon becomes more experienced in handling special cases, you will accumulate more node processing methods, and more models will be correctly parsed by TensorRT Parser.

An unsupported operation was encountered

When Trtexec reported an unsupported OP, we had to write the TensorRT Plugin. The TensorRT Plugin is an extension of the TensorRT function that allows you to write anything you need. The idea behind writing the TensorRT Plugin is to “fill in the blanks” using templates. The crucial “void” is the computing program on the GPU. For users who lack CUDA programming experience, they can reuse the original code as much as possible and avoid writing a new CUDA kernel. Here we demonstrate how to package EDVR Deformable Convolution as TensorRT plugin* (the code is here: github.com/shining365/…

#11 Uses FP16 / INT8 to speed up calculations

If the model has successfully run on TensorRT, consider using FP16 / INT8 for further acceleration calculations. TensorRT runs at fp32 by default; TensorRT supports fp16/ INT8 acceleration on Volta, Turing, and Ampere Gpus. Using FP16 is as simple as setting flags when building the engine. This is reflected in trtexec, which has the — FP16 option, which sets the flag. We illustrate the importance of fp16 for accelerated computation. For EDVR, the ONNX derived model runs fp32 directly with an acceleration ratio of 0.9, which is slower than the original model, but when you open FP16 you get a 1.8 times acceleration. The impact of FP16 on accuracy is not great. Int8 quantization requires calibration of the data set, and such post-training quantization generally loses accuracy. If you don’t mind, consider using Quantization Aware Training.

12 Use the ultimate performance of TensorRT

The general usage of TensorRT is described above, and you need to consider how to maximize the performance of TensorRT once you become a TensorRT expert. For EDVR, I have used two ways to run it on TensorRT. One is to export it with ONNX, and its acceleration ratio is 0.9 and 1.8 under the accuracy of FP32 and FP16. The other is API scaffolding, which has an acceleration ratio of 1.1 and 2.7. You can see that there are benefits to API scaffolding. If the model is particularly important, consider building it with an API. 2) Optimization hotspots With Nsight Systems, you can find the most time-intensive operations and focus on optimizing them. 3) Use Plugin to manually fuse all the layers that can be fused. If you do all of these things, you can achieve almost the best performance on TensorRT.

#13 Summary and suggestions

The recommended development method today is to import the model using ONNX Parser. Here you need to be familiar with the Graph Surgeon, and do it for all kinds of special cases. Custom plugins may be required to wrap existing CUDA code. We recommend the use of mixed precision, especially fp16 is easy to use and works well; Int8 has better computational performance, but generally has reduced accuracy. If you want to go further, try networking with apis and writing and optimizing the CUDA kernel.

#14 Sample code

That’s all I have to share, thank you.

Live playback: www.livevideostack.cn/video/gary-…