TensorRT is NVIDIA’s own high-performance inference library. Getting Started lists the data entry as follows:

This article, based on the current TensorRT version 8.2, will walk you through your ONNX model from installation to accelerated reasoning.

The installation

Enter the TensorRT download page to select the version to download.

In this paper, TENsorrt-8.2.2.1.linux.x86_64-gnU.cudA-11.4.cudNn8.2.tar. gz is selected. It can be noticed that cudA cuDNN should match the version. You can also prepare NVIDIA Docker to pull the NVIDIA/CUDA image of the corresponding version, and then ADD TensorRT.

# unzip into $HOME (lest sudo compile the sample for the current user)
tar -xzvf TensorRT-*.tar.gz -C $HOME/
# soft chain to /usr/local/tensorrt (to fix a path)
sudo ln -s $HOME/ TensorRT - 8.2.2.1 / usr /local/TensorRT
Copy the code

After that, compile and run the sample to make sure TensorRT is installed correctly.

Compile the sample

The samples are at TensorRT/samples, which can be illustrated in the Sample Support Guide or readme.md in the various Sample directories.

cd /usr/local/TensorRT/samples/

# Set environment variables, visible makefile.config
export CUDA_INSTALL_DIR=/usr/local/cuda
export CUDNN_INSTALL_DIR=/usr/local/cuda
export ENABLE_DLA=
exportTRT_LIB_DIR=.. /libexport PROTOBUF_INSTALL_DIR=

# compiler
make -j`nproc`

# run
export LD_LIBRARY_PATH=/usr/local/TensorRT/lib:$LD_LIBRARY_PATH
cd /usr/local/TensorRT/
./bin/trtexec -h
./bin/sample_mnist -d data/mnist/ --fp16
Copy the code

Running result reference:

$ ./bin/sample_mnist -d data/mnist/ --fp16
&&&& RUNNING TensorRT.sample_mnist [TensorRT v8202] # ./bin/sample_mnist -d data/mnist/ --fp16
[12/23/2021-20:20:16] [I] Building and running a GPU inference engine for MNIST
[12/23/2021-20:20:16] [I] [TRT] [MemUsageChange] Init CUDA: CPU +322, GPU +0, now: CPU 333, GPU 600 (MiB)
[12/23/2021-20:20:16] [I] [TRT] [MemUsageSnapshot] Begin constructing builder kernel library: CPU 333 MiB, GPU 600 MiB
[12/23/2021-20:20:16] [I] [TRT] [MemUsageSnapshot] End constructing builder kernel library: CPU 468 MiB, GPU 634 MiB
[12/23/2021-20:20:17] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +518, GPU +224, now: CPU 988, GPU 858 (MiB)
[12/23/2021-20:20:17] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +114, GPU +52, now: CPU 1102, GPU 910 (MiB)
[12/23/2021-20:20:17] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[12/23/2021-20:20:33] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[12/23/2021-20:20:34] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[12/23/2021-20:20:34] [I] [TRT] Total Host Persistent Memory: 8448
[12/23/2021-20:20:34] [I] [TRT] Total Device Persistent Memory: 1626624
[12/23/2021-20:20:34] [I] [TRT] Total Scratch Memory: 0
[12/23/2021-20:20:34] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 2 MiB, GPU 13 MiB
[12/23/2021-20:20:34] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 0.01595ms to assign 3 blocks to 8 nodes requiring 57857 bytes.
[12/23/2021-20:20:34] [I] [TRT] Total Activation Memory: 57857
[12/23/2021-20:20:34] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1621, GPU 1116 (MiB)
[12/23/2021-20:20:34] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 1621, GPU 1124 (MiB)
[12/23/2021-20:20:34] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +4, now: CPU 0, GPU 4 (MiB)
[12/23/2021-20:20:34] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 1622, GPU 1086 (MiB)
[12/23/2021-20:20:34] [I] [TRT] Loaded engine size: 1 MiB
[12/23/2021-20:20:34] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1622, GPU 1096 (MiB)
[12/23/2021-20:20:34] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 1623, GPU 1104 (MiB)
[12/23/2021-20:20:34] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +1, now: CPU 0, GPU 1 (MiB)
[12/23/2021-20:20:34] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1485, GPU 1080 (MiB)
[12/23/2021-20:20:34] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 1485, GPU 1088 (MiB)
[12/23/2021-20:20:34] [I] [TRT] [MemUsageChange] TensorRT-managed allocation inIExecutionContext creation: CPU +0, GPU +2, now: CPU 0, GPU 3 (MiB) [12/23/2021-20:20:34] [I] Input: @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ % + - : = @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ % = - @ @ @ * * @ @ @ @ @ @ @ @ @ @ @ @ @ @ : %. # @ - # @ @ @ # @ @ @ @ @ @@ @ @ @ @ @ + @ @ @ @ : * * * @ @ @ @ @ @ @ @ @ @ @ @ @ @ @+ @ @ # @ @ @ @ @ % @ @ @ @ @ @ @@ @ @ @ @ @ @. : % @ @. @ @ @. * @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ - = @ @ @ @. - @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ % : + @ : @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ %. : - the @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ +# @ @ @ @ @ @ @ @ @ @@ @ @ @ @ @ @ @ @ @ @ @ @ @ + : @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ + * @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ : = @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ : @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ - @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @+ @ @ # @ @ @ @ @ @ @ @@ @ @ @ @ @ @ @ @ @ @ @ @ * + + @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ * * @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @# = @ @ @ @ @ @ @ @ @ @@@@@@@@@@@@@@@. +@@@@@@@@@@@ @@@@@@@@@@@@@@@@@@@@@@@@@@@@ @@@@@@@@@@@@@@@@@@@@@@@@@@@@ [12/23/2021-20:20:34] [I] Output:  0: 1: 2: 3: 4: 5: 6: 7: 8: ********** 9: &&&& PASSED TensorRT.sample_mnist [TensorRT v8202]# ./bin/sample_mnist -d data/mnist/ --fp16
Copy the code

Quick start

Quick Start Guide / Using The TensorRT Runtime API

Prepare the tutorial code, compile:

git clone --depth 1 https://github.com/NVIDIA/TensorRT.git

export CUDA_INSTALL_DIR=/usr/local/cuda
export CUDNN_INSTALL_DIR=/usr/local/cuda
export TRT_LIB_DIR=/usr/local/TensorRT/lib

# compiler quickstart
cd TensorRT/quickstart
# Makefile.config
# INCPATHS += -I"/usr/local/TensorRT/include"
# common/logging.h
# void log(Severity severity, const char* msg) noexcept override
make

# Runtime environment
export PATH=/usr/local/TensorRT/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/TensorRT/lib:$LD_LIBRARY_PATH
cd SemanticSegmentation
Copy the code

Get the pre-trained FCN-ResNET-101 model and convert it to ONNX:

Create a local environment
# conda create -n torch python= 3.9-y
# conda activate torch
#  conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch -y
Otherwise, the container environment
# docker run --rm -it --gpus all -p 8888:8888 -v `pwd`:/workspace/SemanticSegmentation -w /workspace NVCR. IO/nvidia/pytorch: 20.12 - py3 bash
$ python export.py
Exporting ppm image input.ppm
Downloading: "https://github.com/pytorch/vision/archive/v0.6.0.zip"To/home/John/cache/torch/hub/v0.6.0. Zip Downloading:"https://download.pytorch.org/models/resnet101-5d3b4d8f.pth"To/home/John/cache/torch/hub/checkpoints/resnet101-5 d3b4d8f. PTH 100% | █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ | 170 m / 170 m [00:27 < 00:00, 6.57 MB/s] Downloading:"https://download.pytorch.org/models/fcn_resnet101_coco-7ecb50ca.pth"to /home/john/.cache/torch/hub/checkpoints/fcn_resnet101_coco-7ecb50ca.pth 100% | █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ | 208 m / 208 m [02:26 < 00:00, 1.49 MB/s] Exporting ONNX model FCN - resnet101. ONNXCopy the code

Trtexec convert ONNX to TensorRT engine:

$ trtexec --onnx=fcn-resnet101.onnx --fp16 --workspace=64 --minShapes=input:1x3x256x256 --optShapes=input:1x3x1026x1282 --maxShapes=input:1x3x1440x2560 --buildOnly --saveEngine=fcn-resnet101.engine
...
[01/07/2022-20:20:00] [I] Engine built in 406.011 sec.
&&&& PASSED TensorRT.trtexec [TensorRT v8202] ...
Copy the code

Random input, test engine:

$ trtexec --shapes=input:1x3x1026x1282 --loadEngine=fcn-resnet101.engine ... [01/07/2022-20:20:00] [I] === Performance summary === [01/07/2022-20:20:00] [I] Throughput: QPS [01/07/2022-20:20:00] [I] Latency: Min = 76.9746 ms, Max = 98.8354 ms, mean = 79.5844 ms, median = 78.0542 ms, Percentile (99%) = 98.8354 ms [01/07/2022-20:20:00] [I] End-to-end Host Latency: Min = 150.9442 ms, Max = 188.431 ms, mean = 155.434 ms, median = 152.444 ms, Percentile (99%) = 188.431 ms [01/07/2022-20:20:00] Min = 0.2254ms, Max = 1.2254ms, mean = 1.2254ms, median = 1.2254ms, Percentile (99%) = 1.61279 ms [01/07/2022-20:20:00] [I] Latency: Min = 1.25977 ms, Max = 1.25967 ms, mean = 1.27497 ms, median = 1.26594 ms, Percentile (99%) = 1.53467 ms [01/07/2022-20:20:00] [I] GPU Compute Time: Min = 75.2869 ms, Max = 97.1318 ms, mean = 77.8847 ms, median = 76.3599 ms, Percentile (99%) = 97.1318 ms [01/07/2022-20:20:00] Min = 0.408447 ms, Max = 0.454546 ms, mean = 0.425577 ms, median = 0.423004 ms, Percentile (99%) = 0.454346 ms [01/07/2022-20:20:00] [I] Total Host Walltime: 3.2866s [01/07/2022-20:20:00] [I] Total GPU Compute Time: 3.19312s [01/07/2022-20:20:00] [I] accessibility of the performance metrics are unprintedin the verbose logs.
[01/07/2022-20:20:00] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8202] ...
Copy the code

To run the tutorial, use engine:

$ ./bin/segmentation_tutorial
[01/07/2022-20:20:34] [I] [TRT] [MemUsageChange] Init CUDA: CPU +322, GPU +0, now: CPU 463, GPU 707 (MiB)
[01/07/2022-20:20:34] [I] [TRT] Loaded engine size: 132 MiB
[01/07/2022-20:20:35] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +520, GPU +224, now: CPU 984, GPU 1065 (MiB)
[01/07/2022-20:20:35] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +115, GPU +52, now: CPU 1099, GPU 1117 (MiB)
[01/07/2022-20:20:35] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +131, now: CPU 0, GPU 131 (MiB)
[01/07/2022-20:20:35] [I] Running TensorRT inference for FCN-ResNet101
[01/07/2022-20:20:35] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 966, GPU 1109 (MiB)
[01/07/2022-20:20:35] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 966, GPU 1117 (MiB)
[01/07/2022-20:20:35] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +722, now: CPU 0, GPU 853 (MiB)
Copy the code

practice

Above gives the official sample and the tutorial compilation use. Here, take another RVM model and try it from scratch.

Prepare model

Robust Video Matting (RVM), which can perform real-time hd Video Matting on any Video. There are Webcam demos to experience on the web.

Prepare the ONNX model RVM_mobilenetv3_fp32.onnx, whose inference document gives the model inputs and outputs:

  • Input:src.r1i.r2i.r3i.r4i.downsample_ratio]
    • src: Input frame, RGB channel, shape is[B, C, H, W], the range of0 ~ 1
    • rXi: memory input, the initial value is the shape of[1, 1, 1, 1)The zero tensor
    • downsample_ratioThe lower sampling ratio, the tensor shape is[1]
    • onlydownsample_ratioIt must beFP32Other inputs must be the same as the loaded modeldtype
  • Output:fgr.pha.r1o.r2o.r3o.r4o]
    • fgr, pha: Foreground and transparency channel output, range is0 ~ 1
    • rXo: Memory output

Prepare to input the image input.jpg. No videos, keep the code simple.

Prepare the environment

  • Anaconda
  • PyTorch
Conda create -n Torch Python =3.9 -y Conda Activate Torch Conda install PyTorch TorchVision TorchAudio CudatoolKit =11.3 -c pytorch -y# Requirements
# https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirementsPIP install onnx onnxruntime - gpu = = 1.10Copy the code

Run the ONNX model

rvm_onnx_infer.py:

import onnxruntime as ort
import numpy as np
from PIL import Image

# fetch image
with Image.open('input.jpg') as img:
    img.load()
# HWC [0,255] > BCHW [0,1]
src = np.array(img)
src = np.moveaxis(src, -1.0) .astype(np.float32)
src = src[np.newaxis, :] / 255.

# Load model
sess = ort.InferenceSession('rvm_mobilenetv3_fp32.onnx', providers=['CUDAExecutionProvider'])

Create an IO binding
io = sess.io_binding()

# Create tensors on CUDA
rec = [ ort.OrtValue.ortvalue_from_numpy(np.zeros([1.1.1.1], dtype=np.float32), 'cuda')] *4
downsample_ratio = ort.OrtValue.ortvalue_from_numpy(np.asarray([0.25], dtype=np.float32), 'cuda')

Set the output
for name in ['fgr'.'pha'.'r1o'.'r2o'.'r3o'.'r4o']:
    io.bind_output(name, 'cuda')

# inference
io.bind_cpu_input('src', src)
io.bind_ortvalue_input('r1i', rec[0])
io.bind_ortvalue_input('r2i', rec[1])
io.bind_ortvalue_input('r3i', rec[2])
io.bind_ortvalue_input('r4i', rec[3])
io.bind_ortvalue_input('downsample_ratio', downsample_ratio)

sess.run_with_iobinding(io)

fgr, pha, *rec = io.get_outputs()

Only 'FGR' and 'PHA' are passed back to the CPU
fgr = fgr.numpy()
pha = pha.numpy()

# synthetic RGBA
com = np.where(pha > 0, fgr, pha)
com = np.concatenate([com, pha], axis=1) # + alpha
# BCHW [0,1] > HWC [0,255]
com = np.squeeze(com, axis=0)
com = np.moveaxis(com, 0, -1) * 255

img = Image.fromarray(com.astype(np.uint8))
img.show()
Copy the code

Run:

python rvm_onnx_infer.py --model "rvm_mobilenetv3_fp32.onnx" --input-image "input.jpg" --precision float32 --show
Copy the code

Results (background transparent) :

ONNX converts to TRT model

Trtexec transforms ONNX into TensorRT engine:

export PATH=/usr/local/TensorRT/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/TensorRT/lib:$LD_LIBRARY_PATH

trtexec --onnx=rvm_mobilenetv3_fp32.onnx --workspace=64 --saveEngine=rvm_mobilenetv3_fp32.engine --verbose
Copy the code

Problems occur:

[01/08/2022-20:20:36] [E] [TRT] ModelImporter.cpp:773: While parsing node number 3 [Resize -> "389"]:
[01/08/2022-20:20:36] [E] [TRT] ModelImporter.cpp:774: --- Begin node ---
[01/08/2022-20:20:36] [E] [TRT] ModelImporter.cpp:775: input: "src"
input: "386"
input: "388"
output: "389"
name: "Resize_3"
op_type: "Resize"
attribute {
  name: "coordinate_transformation_mode"
  s: "pytorch_half_pixel"
  type: STRING
}
attribute {
  name: "cubic_coeff_a"F: 0.75type: FLOAT
}
attribute {
  name: "mode"
  s: "linear"
  type: STRING
}
attribute {
  name: "nearest_mode"
  s: "floor"
  type: STRING
}

[01/08/2022-20:20:36] [E] [TRT] ModelImporter.cpp:776: --- End node ---
[01/08/2022-20:20:36] [E] [TRT] ModelImporter.cpp:779: ERROR: builtin_op_importers.cpp:3608 In function importResize:
[8] Assertion failed: scales.is_weights() && "Resize scales must be an initializer!"
Copy the code

At this point, it’s time to change the model.

First, install the necessary tools:

snap install netron
pip install onnx-simplifier
pip install onnx_graphsurgeon --index-url https://pypi.ngc.nvidia.com
Copy the code

Netron then views the model Resize_3 node:

It was found that the scales input was obtained according to downsample_ratio, that is, [1,1,downsample_ratio,downsample_ratio], which can be modified into a constant by ONNX GraphSurgeon.

Finally, the model modification steps are as follows:

# ONNX model simplified and changed to static input sizePython -m onnxsim rvm_mobilenetv3_fp32.onnx rvm_mobilenetv3_fp32_sim.onnx \ --input-shape SRC :1,3,1080,1920 r1i:1,1,1 R2i: 1,1,1,1 r3i: 1,1,1,1 r4i: 1,1,1,1# ONNX GraphSurgeon modifies the model
python rvm_onnx_modify.py -i rvm_mobilenetv3_fp32_sim.onnx --input-size 1920 1280

# trtexec transforms ONNX into TensorRT engine
trtexec --onnx=rvm_mobilenetv3_fp32_sim_modified.onnx --workspace=64 --saveEngine=rvm_mobilenetv3_fp32_sim_modified.engine
Copy the code

rvm_onnx_modify.py:

def modify(input: str, output: str, downsample_ratio: float = 0.25) - >None:
    print(f'\nonnx load: {input}')
    graph = gs.import_onnx(onnx.load(input))

    _print_graph(graph)

    # update node Resize_3: scales
    resize_3 = [n for n in graph.nodes if n.name == 'Resize_3'] [0]
    print(a)print(resize_3)

    scales = gs.Constant('388',
        np.asarray([1.1, downsample_ratio, downsample_ratio], dtype=np.float32))

    resize_3.inputs = [i ifi.name ! ='388' else scales for i in resize_3.inputs]
    print(a)print(resize_3)

    # remove input downsample_ratio
    graph.inputs = [i for i in graph.inputs ifi.name ! ='downsample_ratio']

    # remove node Concat_2
    concat_2 = [n for n in graph.nodes if n.name == 'Concat_2'] [0]
    concat_2.outputs.clear()

    # remove unused nodes/tensors
    graph.cleanup()

    onnx.save(gs.export_onnx(graph), output)
Copy the code

ONNX and TRT model output difference

Polygraphy can be used to see the output difference between the ONNX and TRT models.

First, install

Install the TensorRT Python API
cd /usr/local/ TensorRT/python/PIP install TensorRT 8.2.2.1 - cp39 - none - linux_x86_64. WHLexport LD_LIBRARY_PATH=/usr/local/TensorRT/lib:$LD_LIBRARY_PATH
python -c "import tensorrt; print(tensorrt.__version__)"

# Polygraphy installation, or by TensorRT/tools/Polygraphy source installation
python -m pip install colored polygraphy --extra-index-url https://pypi.ngc.nvidia.com
Copy the code

Run ONNX and TRT models and compare the output errors:

Run the ONNX model, save the input and outputPolygraphy run rvm_mobilenetv3_fp32_sim_modified.onnx --onnxrt --val-range [0,1] --save-inputs onnx_inputs --save-outputs onnx_outputs.json# Run TRT model, load ONNX input and output, compare the relative error and absolute error of the output
polygraphy run rvm_mobilenetv3_fp32_sim_modified.engine --model-type engine --trt --load-inputs onnx_inputs.json --load-outputs onnx_outputs.json --rtol 1e-3 --atol 1e-3
Copy the code

It can be seen that the accuracy error of FP32 is within 1E-3, PASSED:

[I]     PASSED | All outputs matched | Outputs: ['r4o'.'r3o'.'r2o'.'r1o'.'fgr'.'pha']
[I] PASSED | Command: /home/john/anaconda3/envs/torch/bin/polygraphy run rvm_mobilenetv3_fp32_sim_modified.engine --model-type engine --trt --load-inputs onnx_inputs.json --load-outputs onnx_outputs.json --rtol 1e-3 --atol 1e-3
Copy the code

Fp16 was also tried, and its accuracy loss was relatively large. FAILED:

[E]     FAILED | Mismatched outputs: ['r4o'.'r3o'.'r2o'.'r1o'.'fgr'.'pha'] [!]  FAILED | Command: /home/john/anaconda3/envs/torch/bin/polygraphy run rvm_mobilenetv3_fp16_sim_modified.engine --model-type engine --trt --load-inputs onnx_inputs.json --load-outputs onnx_outputs.json --rtol 1e-3 --atol 1e-3Copy the code

Run the TRT model

Here we use the TensorRT C++ runtime APIs as an example to run the rolled out RVM TRT model. See rvm_infer. Cc for the complete code.

1. Load model: Create Runtime and deserialize TRT model file data

static Logger logger{Logger::Severity::kINFO};
auto runtime = std::unique_ptr<nvinfer1::IRuntime>(nvinfer1::createInferRuntime(logger));
auto engine = runtime->deserializeCudaEngine(engine_data.data(), fsize, nullptr);
Copy the code

Traverse all input and output bindings:

auto nb = engine->getNbBindings(a);for (int32_t i = 0; i < nb; i++) {
  auto is_input = engine->bindingIsInput(i);
  auto name = engine->getBindingName(i);
  auto dims = engine->getBindingDimensions(i);
  auto datatype = engine->getBindingDataType(i);
  // ...
}
Copy the code
Engine Name= occurrences Network 0 DeviceMemorySize=148 MiB MaxBatchSize=1 Bindings Input[0] Name= SRC DIMS =[1,3,1080,1920] Datatype =FLOAT Input[1] Name =r1i dims=[1,1,1] Datatype =FLOAT Input[2] Name =r2i dims=[1,1,1,1] Name =r3i dims=[1,1,1] datatype=FLOAT Input[4] name=r4i dims=[1,1,1,1] datatype=FLOAT Output[5] name=r4o Dims =[1,64,18,32] datatype=FLOAT Output[6] name=r3o dims=[1,40,36,64] datatype=FLOAT Output[7] name=r2o Dims =[1,20,72,128] datatype=FLOAT Output[8] name=r1o dims=[1,16,144,256] datatype=FLOAT Output[9] name= FGR Dims =[1,3,1080,1920] datatype=FLOAT Output[10] name=pha dims=[1,1,1080,1920] datatype=FLOATCopy the code

After that, all bindings device memory is allocated:

auto nb = engine->getNbBindings(a);std::vector<void* >bindings(nb, nullptr);
std::vector<int32_t> bindings_size(nb, 0);
for (int32_t i = 0; i < nb; i++) {
  auto dims = engine->getBindingDimensions(i);
  auto size = GetMemorySize(dims, sizeof(float));
  if (cudaMalloc(&bindings[i], size) ! = cudaSuccess) { std::cerr <<"ERROR: cuda memory allocation failed, size = " << size
        << " bytes" << std::endl;
    return false;
  }
  bindings_size[i] = size;
}
Copy the code

That’s it. We’re ready.

2. Pre-processing: input data is processed into input formats and stored into input bindings

Read the image with OpenCV and scale it to the input size of SRC. The data is then processed from BGR [0,255] to RGB [0,1]. Because batch=1, it can be ignored.

// img: HWC BGR [0,255] u8
auto img = cv::imread(input_filename, cv::IMREAD_COLOR);
if(src_h ! = img.rows || src_w ! = img.cols) { cv::resize(img, img, cv::Size(src_w, src_h));
}

// src: BCHW RGB [0,1] fp32
auto src = cv::Mat(img.rows, img.cols, CV_32FC3);
{
  auto src_data = (float*)(src.data);
  for (int y = 0; y < src_h; ++y) {
    for (int x = 0; x < src_w; ++x) {
      auto &&bgr = img.at<cv::Vec3b>(y, x);
      /*r*/ *(src_data + y*src_w + x) = bgr[2] / 255.;
      /*g*/ *(src_data + src_n + y*src_w + x) = bgr[1] / 255.;
      /*b*/ *(src_data + src_n*2 + y*src_w + x) = bgr[0] / 255.; }}}if (cudaMemcpyAsync(bindings[0], src.data, bindings_size[0], cudaMemcpyHostToDevice, stream) ! = cudaSuccess) { std::cerr <<"ERROR: CUDA memory copy of src failed, size = "
      << bindings_size[0] < <" bytes" << std::endl;
  return false;
}
Copy the code

3. Inference: Bindings are given to engine execution context for inference

auto context = std::unique_ptr<nvinfer1::IExecutionContext>(
    engine->createExecutionContext());
if(! context) {return false;
}

bool status = context->enqueueV2(bindings.data(), stream, nullptr);
if(! status) { std::cout <<"ERROR: TensorRT inference failed" << std::endl;
  return false;
}
Copy the code

4. Post-processing: Extracting data from output bindings and processing data according to output format

Receiving output foreground FGR and transparent channel PHA with CV ::Mat:

auto fgr = cv::Mat(src_h, src_w, CV_32FC3);  // BCHW [0,1] fp32
if (cudaMemcpyAsync(fgr.data, bindings[9], bindings_size[9], cudaMemcpyDeviceToHost, stream) ! = cudaSuccess) { std::cerr <<"ERROR: CUDA memory copy of output failed, size = "
      << bindings_size[9] < <" bytes" << std::endl;
  return false;
}
auto pha = cv::Mat(src_h, src_w, CV_32FC1);  // BCHW A [0,1] fp32
if (cudaMemcpyAsync(pha.data, bindings[10], bindings_size[10], cudaMemcpyDeviceToHost, stream) ! = cudaSuccess) { std::cerr <<"ERROR: CUDA memory copy of output failed, size = "
      << bindings_size[10] < <" bytes" << std::endl;
  return false;
}
cudaStreamSynchronize(stream);
Copy the code

Then FGR PHA was synthesized into RGBA data and restored to the original size:

// Compose `fgr` and `pha`
auto com = cv::Mat(src_h, src_w, CV_8UC4);  // HWC BGRA [0,255] u8
{
  auto fgr_data = (float*)(fgr.data);
  auto pha_data = (float*)(pha.data);
  for (int y = 0; y < com.rows; ++y) {
    for (int x = 0; x < com.cols; ++x) {
      auto &&elem = com.at<cv::Vec4b>(y, x);
      auto alpha = *(pha_data + y*src_w + x);
      if (alpha > 0) {
        /*r*/ elem[2] = *(fgr_data + y*src_w + x) * 255;
        /*g*/ elem[1] = *(fgr_data + src_n + y*src_w + x) * 255;
        /*b*/ elem[0] = *(fgr_data + src_n*2 + y*src_w + x) * 255;
      } else {
        /*r*/ elem[2] = 0;
        /*g*/ elem[1] = 0;
        /*b*/ elem[0] = 0;
      }
      /*a*/ elem[3] = alpha * 255; }}}if(dst_h ! = com.rows || dst_w ! = com.cols) { cv::resize(com, com, cv::Size(dst_w, dst_h));
}
Copy the code

5. Image keying results obtained during operation (background transparency) :

The last

If you want to get started with TensorRT, get your hands dirty!

GoCoding personal practice experience sharing, please pay attention to the public account!