TensorRT starts

TensorRT is NVIDIA's own high-performance inference library. Its Getting Started lists the data entries, as follows:

This article is based on the current TensorRT 8.2 version and will introduce step by step from installation to accelerated inference of your own ONNX model.

Install

TensorRT download page select the version to download, you need to register and log in.

This article chooses TensorRT-8.2.2.1.Linux.x86_64-gnu.cuda-11.4.cudnn8.2.tar.gz , you can notice that it is a good version to match CUDA cuDNN You can also prepare NVIDIA Docker pull the corresponding version of nvidia/cuda image, and then ADD TensorRT .

# 解压进 $HOME （以免 sudo 编译样例，为当前用户）
tar -xzvf TensorRT-*.tar.gz -C $HOME/
# 软链到 /usr/local/TensorRT （以固定一个路径）
sudo ln -s $HOME/TensorRT-8.2.2.1 /usr/local/TensorRT

After that, compile and run the sample to ensure that TensorRT is installed correctly.

Compile the sample

The sample in TensorRT/samples , description see the Sample Support Guide or each sample directory of README.md .

cd /usr/local/TensorRT/samples/

# 设定环境变量，可见 Makefile.config
export CUDA_INSTALL_DIR=/usr/local/cuda
export CUDNN_INSTALL_DIR=/usr/local/cuda
export ENABLE_DLA=
export TRT_LIB_DIR=../lib
export PROTOBUF_INSTALL_DIR=

# 编译
make -j`nproc`

# 运行
export LD_LIBRARY_PATH=/usr/local/TensorRT/lib:$LD_LIBRARY_PATH
cd /usr/local/TensorRT/
./bin/trtexec -h
./bin/sample_mnist -d data/mnist/ --fp16

Operation result reference:

$ ./bin/sample_mnist -d data/mnist/ --fp16
&&&& RUNNING TensorRT.sample_mnist [TensorRT v8202] # ./bin/sample_mnist -d data/mnist/ --fp16
[12/23/2021-20:20:16] [I] Building and running a GPU inference engine for MNIST
[12/23/2021-20:20:16] [I] [TRT] [MemUsageChange] Init CUDA: CPU +322, GPU +0, now: CPU 333, GPU 600 (MiB)
[12/23/2021-20:20:16] [I] [TRT] [MemUsageSnapshot] Begin constructing builder kernel library: CPU 333 MiB, GPU 600 MiB
[12/23/2021-20:20:16] [I] [TRT] [MemUsageSnapshot] End constructing builder kernel library: CPU 468 MiB, GPU 634 MiB
[12/23/2021-20:20:17] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +518, GPU +224, now: CPU 988, GPU 858 (MiB)
[12/23/2021-20:20:17] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +114, GPU +52, now: CPU 1102, GPU 910 (MiB)
[12/23/2021-20:20:17] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[12/23/2021-20:20:33] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[12/23/2021-20:20:34] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[12/23/2021-20:20:34] [I] [TRT] Total Host Persistent Memory: 8448
[12/23/2021-20:20:34] [I] [TRT] Total Device Persistent Memory: 1626624
[12/23/2021-20:20:34] [I] [TRT] Total Scratch Memory: 0
[12/23/2021-20:20:34] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 2 MiB, GPU 13 MiB
[12/23/2021-20:20:34] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 0.01595ms to assign 3 blocks to 8 nodes requiring 57857 bytes.
[12/23/2021-20:20:34] [I] [TRT] Total Activation Memory: 57857
[12/23/2021-20:20:34] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1621, GPU 1116 (MiB)
[12/23/2021-20:20:34] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 1621, GPU 1124 (MiB)
[12/23/2021-20:20:34] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +4, now: CPU 0, GPU 4 (MiB)
[12/23/2021-20:20:34] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 1622, GPU 1086 (MiB)
[12/23/2021-20:20:34] [I] [TRT] Loaded engine size: 1 MiB
[12/23/2021-20:20:34] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1622, GPU 1096 (MiB)
[12/23/2021-20:20:34] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 1623, GPU 1104 (MiB)
[12/23/2021-20:20:34] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +1, now: CPU 0, GPU 1 (MiB)
[12/23/2021-20:20:34] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1485, GPU 1080 (MiB)
[12/23/2021-20:20:34] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 1485, GPU 1088 (MiB)
[12/23/2021-20:20:34] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +2, now: CPU 0, GPU 3 (MiB)
[12/23/2021-20:20:34] [I] Input:
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@%+-:  =@@@@@@@@@@@@
@@@@@@@%=      -@@@**@@@@@@@
@@@@@@@   :%#@-#@@@. #@@@@@@
@@@@@@*  +@@@@:*@@@  *@@@@@@
@@@@@@#  +@@@@ @@@%  @@@@@@@
@@@@@@@.  :%@@.@@@. *@@@@@@@
@@@@@@@@-   =@@@@. -@@@@@@@@
@@@@@@@@@%:   +@- :@@@@@@@@@
@@@@@@@@@@@%.  : -@@@@@@@@@@
@@@@@@@@@@@@@+   #@@@@@@@@@@
@@@@@@@@@@@@@@+  :@@@@@@@@@@
@@@@@@@@@@@@@@+   *@@@@@@@@@
@@@@@@@@@@@@@@: =  @@@@@@@@@
@@@@@@@@@@@@@@ :@  @@@@@@@@@
@@@@@@@@@@@@@@ -@  @@@@@@@@@
@@@@@@@@@@@@@# +@  @@@@@@@@@
@@@@@@@@@@@@@* ++  @@@@@@@@@
@@@@@@@@@@@@@*    *@@@@@@@@@
@@@@@@@@@@@@@#   =@@@@@@@@@@
@@@@@@@@@@@@@@. +@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@

[12/23/2021-20:20:34] [I] Output:
0:
1:
2:
3:
4:
5:
6:
7:
8: **********
9:

&&&& PASSED TensorRT.sample_mnist [TensorRT v8202] # ./bin/sample_mnist -d data/mnist/ --fp16

quick start

Quick Start Guide / Using The TensorRT Runtime API

To prepare the tutorial code, compile:

git clone --depth 1 https://github.com/NVIDIA/TensorRT.git

export CUDA_INSTALL_DIR=/usr/local/cuda
export CUDNN_INSTALL_DIR=/usr/local/cuda
export TRT_LIB_DIR=/usr/local/TensorRT/lib

# 编译 quickstart
cd TensorRT/quickstart
# Makefile.config
#  INCPATHS += -I"/usr/local/TensorRT/include"
# common/logging.h
#  void log(Severity severity, const char* msg) noexcept override
make

# 运行环境
export PATH=/usr/local/TensorRT/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/TensorRT/lib:$LD_LIBRARY_PATH
cd SemanticSegmentation

Get the pre-trained FCN-ResNet-101 model and convert it to ONNX:

# 创建本地环境
#  conda create -n torch python=3.9 -y
#  conda activate torch
#  conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch -y
# 不然，容器环境
#  docker run --rm -it --gpus all -p 8888:8888 -v `pwd`:/workspace/SemanticSegmentation -w /workspace nvcr.io/nvidia/pytorch:20.12-py3 bash
$ python export.py
Exporting ppm image input.ppm
Downloading: "https://github.com/pytorch/vision/archive/v0.6.0.zip" to /home/john/.cache/torch/hub/v0.6.0.zip
Downloading: "https://download.pytorch.org/models/resnet101-5d3b4d8f.pth" to /home/john/.cache/torch/hub/checkpoints/resnet101-5d3b4d8f.pth
100%|████████████████████████████████████████| 170M/170M [00:27<00:00, 6.57MB/s]
Downloading: "https://download.pytorch.org/models/fcn_resnet101_coco-7ecb50ca.pth" to /home/john/.cache/torch/hub/checkpoints/fcn_resnet101_coco-7ecb50ca.pth
100%|████████████████████████████████████████| 208M/208M [02:26<00:00, 1.49MB/s]
Exporting ONNX model fcn-resnet101.onnx

Then use trtexec convert ONNX to TensorRT engine:

$ trtexec --onnx=fcn-resnet101.onnx --fp16 --workspace=64 --minShapes=input:1x3x256x256 --optShapes=input:1x3x1026x1282 --maxShapes=input:1x3x1440x2560 --buildOnly --saveEngine=fcn-resnet101.engine
...
[01/07/2022-20:20:00] [I] Engine built in 406.011 sec.
&&&& PASSED TensorRT.trtexec [TensorRT v8202] ...

Random input, test engine:

$ trtexec --shapes=input:1x3x1026x1282 --loadEngine=fcn-resnet101.engine
...
[01/07/2022-20:20:00] [I] === Performance summary ===
[01/07/2022-20:20:00] [I] Throughput: 12.4749 qps
[01/07/2022-20:20:00] [I] Latency: min = 76.9746 ms, max = 98.8354 ms, mean = 79.5844 ms, median = 78.0542 ms, percentile(99%) = 98.8354 ms
[01/07/2022-20:20:00] [I] End-to-End Host Latency: min = 150.942 ms, max = 188.431 ms, mean = 155.834 ms, median = 152.444 ms, percentile(99%) = 188.431 ms
[01/07/2022-20:20:00] [I] Enqueue Time: min = 0.390625 ms, max = 1.61279 ms, mean = 1.41182 ms, median = 1.46136 ms, percentile(99%) = 1.61279 ms
[01/07/2022-20:20:00] [I] H2D Latency: min = 1.25977 ms, max = 1.53467 ms, mean = 1.27415 ms, median = 1.26514 ms, percentile(99%) = 1.53467 ms
[01/07/2022-20:20:00] [I] GPU Compute Time: min = 75.2869 ms, max = 97.1318 ms, mean = 77.8847 ms, median = 76.3599 ms, percentile(99%) = 97.1318 ms
[01/07/2022-20:20:00] [I] D2H Latency: min = 0.408447 ms, max = 0.454346 ms, mean = 0.425577 ms, median = 0.423004 ms, percentile(99%) = 0.454346 ms
[01/07/2022-20:20:00] [I] Total Host Walltime: 3.2866 s
[01/07/2022-20:20:00] [I] Total GPU Compute Time: 3.19327 s
[01/07/2022-20:20:00] [I] Explanations of the performance metrics are printed in the verbose logs.
[01/07/2022-20:20:00] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8202] ...

Run the tutorial, using the engine:

$ ./bin/segmentation_tutorial
[01/07/2022-20:20:34] [I] [TRT] [MemUsageChange] Init CUDA: CPU +322, GPU +0, now: CPU 463, GPU 707 (MiB)
[01/07/2022-20:20:34] [I] [TRT] Loaded engine size: 132 MiB
[01/07/2022-20:20:35] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +520, GPU +224, now: CPU 984, GPU 1065 (MiB)
[01/07/2022-20:20:35] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +115, GPU +52, now: CPU 1099, GPU 1117 (MiB)
[01/07/2022-20:20:35] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +131, now: CPU 0, GPU 131 (MiB)
[01/07/2022-20:20:35] [I] Running TensorRT inference for FCN-ResNet101
[01/07/2022-20:20:35] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 966, GPU 1109 (MiB)
[01/07/2022-20:20:35] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 966, GPU 1117 (MiB)
[01/07/2022-20:20:35] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +722, now: CPU 0, GPU 853 (MiB)

practice

The above gives the compilation and use of official samples and tutorials. Here, I found another RVM model and tried it from scratch.

Prepare the model

Robust Video Matting (RVM) stabilized video keying, can do real-time HD keying on any video. There are Webcam Demo can be experienced on the web.

Prepare the ONNX model rvm_mobilenetv3_fp32.onnx , whose inference document gives the model input and output:

Input: [ src , r1i , r2i , r3i , r4i , downsample_ratio ]
- src : input frame, RGB channels, shape [B, C, H, W] , range 0~1
- rXi : memory input, the initial value is a zero tensor of [1, 1, 1, 1]
- downsample_ratio downsampling ratio, tensor shape is [1]
- Only downsample_ratio must be FP32 , other inputs must use the same dtype
Output: [ fgr , pha , r1o , r2o , r3o , r4o ]
- fgr, pha : Foreground and transparency channel output, range 0~1
- rXo : memory output

Prepare the input image input.jpg . No video, keep the code simple.

Prepare the environment

conda create -n torch python=3.9 -y
conda activate torch

conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch -y

# Requirements
#  https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements
pip install onnx onnxruntime-gpu==1.10

Running the ONNX model

rvm_onnx_infer.py:

import onnxruntime as ort
import numpy as np
from PIL import Image

# 读取图像
with Image.open('input.jpg') as img:
    img.load()
#  HWC [0,255] > BCHW [0,1]
src = np.array(img)
src = np.moveaxis(src, -1, 0) .astype(np.float32)
src = src[np.newaxis, :] / 255.

# 载入模型
sess = ort.InferenceSession('rvm_mobilenetv3_fp32.onnx', providers=['CUDAExecutionProvider'])

# 创建 io binding
io = sess.io_binding()

# 在 CUDA 上创建张量
rec = [ ort.OrtValue.ortvalue_from_numpy(np.zeros([1, 1, 1, 1], dtype=np.float32), 'cuda') ] * 4
downsample_ratio = ort.OrtValue.ortvalue_from_numpy(np.asarray([0.25], dtype=np.float32), 'cuda')

# 设置输出项
for name in ['fgr', 'pha', 'r1o', 'r2o', 'r3o', 'r4o']:
    io.bind_output(name, 'cuda')

# 推断
io.bind_cpu_input('src', src)
io.bind_ortvalue_input('r1i', rec[0])
io.bind_ortvalue_input('r2i', rec[1])
io.bind_ortvalue_input('r3i', rec[2])
io.bind_ortvalue_input('r4i', rec[3])
io.bind_ortvalue_input('downsample_ratio', downsample_ratio)

sess.run_with_iobinding(io)

fgr, pha, *rec = io.get_outputs()

# 只将 `fgr` 和 `pha` 回传到 CPU
fgr = fgr.numpy()
pha = pha.numpy()

# 合成 RGBA
com = np.where(pha > 0, fgr, pha)
com = np.concatenate([com, pha], axis=1) # + alpha
#  BCHW [0,1] > HWC [0,255]
com = np.squeeze(com, axis=0)
com = np.moveaxis(com, 0, -1) * 255

img = Image.fromarray(com.astype(np.uint8))
img.show()

run:

python rvm_onnx_infer.py --model "rvm_mobilenetv3_fp32.onnx" --input-image "input.jpg" --precision float32 --show

Result (background transparent):

Convert ONNX to TRT model

trtexec Convert ONNX to TensorRT engine:

export PATH=/usr/local/TensorRT/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/TensorRT/lib:$LD_LIBRARY_PATH

trtexec --onnx=rvm_mobilenetv3_fp32.onnx --workspace=64 --saveEngine=rvm_mobilenetv3_fp32.engine --verbose

A problem occurred:

[01/08/2022-20:20:36] [E] [TRT] ModelImporter.cpp:773: While parsing node number 3 [Resize -> "389"]:
[01/08/2022-20:20:36] [E] [TRT] ModelImporter.cpp:774: --- Begin node ---
[01/08/2022-20:20:36] [E] [TRT] ModelImporter.cpp:775: input: "src"
input: "386"
input: "388"
output: "389"
name: "Resize_3"
op_type: "Resize"
attribute {
  name: "coordinate_transformation_mode"
  s: "pytorch_half_pixel"
  type: STRING
}
attribute {
  name: "cubic_coeff_a"
  f: -0.75
  type: FLOAT
}
attribute {
  name: "mode"
  s: "linear"
  type: STRING
}
attribute {
  name: "nearest_mode"
  s: "floor"
  type: STRING
}

[01/08/2022-20:20:36] [E] [TRT] ModelImporter.cpp:776: --- End node ---
[01/08/2022-20:20:36] [E] [TRT] ModelImporter.cpp:779: ERROR: builtin_op_importers.cpp:3608 In function importResize:
[8] Assertion failed: scales.is_weights() && "Resize scales must be an initializer!"

At this time, it is necessary to manually modify the model.

First, install the necessary tools:

snap install netron
pip install onnx-simplifier
pip install onnx_graphsurgeon --index-url https://pypi.ngc.nvidia.com

After that, Netron View model Resize_3 node:

It is found that the scales input is downsample_ratio , that is, [1,1,downsample_ratio,downsample_ratio] , which can be modified into a constant ONNX GraphSurgeon

Finally, the model modification steps are as follows:

# ONNX 模型简化，并改为静态输入尺寸
python -m onnxsim rvm_mobilenetv3_fp32.onnx rvm_mobilenetv3_fp32_sim.onnx \
--input-shape src:1,3,1080,1920 r1i:1,1,1,1 r2i:1,1,1,1 r3i:1,1,1,1 r4i:1,1,1,1

# ONNX GraphSurgeon 修改模型
python rvm_onnx_modify.py -i rvm_mobilenetv3_fp32_sim.onnx --input-size 1920 1280

# trtexec 将 ONNX 转成 TensorRT engine
trtexec --onnx=rvm_mobilenetv3_fp32_sim_modified.onnx --workspace=64 --saveEngine=rvm_mobilenetv3_fp32_sim_modified.engine

rvm_onnx_modify.py:

def modify(input: str, output: str, downsample_ratio: float = 0.25) -> None:
    print(f'\nonnx load: {input}')
    graph = gs.import_onnx(onnx.load(input))

    _print_graph(graph)

    # update node Resize_3: scales
    resize_3 = [n for n in graph.nodes if n.name == 'Resize_3'][0]
    print()
    print(resize_3)

    scales = gs.Constant('388',
        np.asarray([1, 1, downsample_ratio, downsample_ratio], dtype=np.float32))

    resize_3.inputs = [i if i.name != '388' else scales for i in resize_3.inputs]
    print()
    print(resize_3)

    # remove input downsample_ratio
    graph.inputs = [i for i in graph.inputs if i.name != 'downsample_ratio']

    # remove node Concat_2
    concat_2 = [n for n in graph.nodes if n.name == 'Concat_2'][0]
    concat_2.outputs.clear()

    # remove unused nodes/tensors
    graph.cleanup()

    onnx.save(gs.export_onnx(graph), output)

Output difference between ONNX and TRT model

Use Polygraphy see output differences between ONNX and TRT models.

First, install

# 安装 TensorRT Python API
cd /usr/local/TensorRT/python/
pip install tensorrt-8.2.2.1-cp39-none-linux_x86_64.whl

export LD_LIBRARY_PATH=/usr/local/TensorRT/lib:$LD_LIBRARY_PATH
python -c "import tensorrt; print(tensorrt.__version__)"

# 安装 Polygraphy，或者通过 TensorRT/tools/Polygraphy 源码安装
python -m pip install colored polygraphy --extra-index-url https://pypi.ngc.nvidia.com

Run the ONNX and TRT models and compare the output errors:

# 运行 ONNX 模型，保存输入输出
polygraphy run rvm_mobilenetv3_fp32_sim_modified.onnx --onnxrt --val-range [0,1] --save-inputs onnx_inputs.json --save-outputs onnx_outputs.json
# 运行 TRT 模型，载入 ONNX 输入输出，对比输出的相对误差与绝对误差
polygraphy run rvm_mobilenetv3_fp32_sim_modified.engine --model-type engine --trt --load-inputs onnx_inputs.json --load-outputs onnx_outputs.json --rtol 1e-3 --atol 1e-3

It can be seen fp32 accuracy error of 1e-3 PASSED , and 061deed0270471:

[I]     PASSED | All outputs matched | Outputs: ['r4o', 'r3o', 'r2o', 'r1o', 'fgr', 'pha']
[I] PASSED | Command: /home/john/anaconda3/envs/torch/bin/polygraphy run rvm_mobilenetv3_fp32_sim_modified.engine --model-type engine --trt --load-inputs onnx_inputs.json --load-outputs onnx_outputs.json --rtol 1e-3 --atol 1e-3

I also tried fp16 , and its precision loss is relatively large, FAILED :

[E]     FAILED | Mismatched outputs: ['r4o', 'r3o', 'r2o', 'r1o', 'fgr', 'pha']
[!] FAILED | Command: /home/john/anaconda3/envs/torch/bin/polygraphy run rvm_mobilenetv3_fp16_sim_modified.engine --model-type engine --trt --load-inputs onnx_inputs.json --load-outputs onnx_outputs.json --rtol 1e-3 --atol 1e-3

Running the TRT model

Here we take TensorRT C++ runtime APIs as an example to run the exported RVM TRT model. rvm_infer.cc complete code.

1. Load the model: create runtime , deserialize the data of the TRT model file

static Logger logger{Logger::Severity::kINFO};
auto runtime = std::unique_ptr<nvinfer1::IRuntime>(nvinfer1::createInferRuntime(logger));
auto engine = runtime->deserializeCudaEngine(engine_data.data(), fsize, nullptr);

Traverse all input and output bindings :

auto nb = engine->getNbBindings();
for (int32_t i = 0; i < nb; i++) {
  auto is_input = engine->bindingIsInput(i);
  auto name = engine->getBindingName(i);
  auto dims = engine->getBindingDimensions(i);
  auto datatype = engine->getBindingDataType(i);
  // ...
}

Engine
 Name=Unnamed Network 0
 DeviceMemorySize=148 MiB
 MaxBatchSize=1
Bindings
 Input[0] name=src dims=[1,3,1080,1920] datatype=FLOAT
 Input[1] name=r1i dims=[1,1,1,1] datatype=FLOAT
 Input[2] name=r2i dims=[1,1,1,1] datatype=FLOAT
 Input[3] name=r3i dims=[1,1,1,1] datatype=FLOAT
 Input[4] name=r4i dims=[1,1,1,1] datatype=FLOAT
 Output[5] name=r4o dims=[1,64,18,32] datatype=FLOAT
 Output[6] name=r3o dims=[1,40,36,64] datatype=FLOAT
 Output[7] name=r2o dims=[1,20,72,128] datatype=FLOAT
 Output[8] name=r1o dims=[1,16,144,256] datatype=FLOAT
 Output[9] name=fgr dims=[1,3,1080,1920] datatype=FLOAT
 Output[10] name=pha dims=[1,1,1080,1920] datatype=FLOAT

After that, a good distribution of all bindings of device memory:

auto nb = engine->getNbBindings();
std::vector<void *> bindings(nb, nullptr);
std::vector<int32_t> bindings_size(nb, 0);
for (int32_t i = 0; i < nb; i++) {
  auto dims = engine->getBindingDimensions(i);
  auto size = GetMemorySize(dims, sizeof(float));
  if (cudaMalloc(&bindings[i], size) != cudaSuccess) {
    std::cerr << "ERROR: cuda memory allocation failed, size = " << size
        << " bytes" << std::endl;
    return false;
  }
  bindings_size[i] = size;
}

At this point, the preparations are done.

2. Pre-processing: The input data is processed into input format and stored in input bindings

Read the image with OpenCV and scale it to an input size of src Then process the data from BGR [0,255] to RGB [0,1] . Because of batch=1 , it can be ignored during processing.

// img: HWC BGR [0,255] u8
auto img = cv::imread(input_filename, cv::IMREAD_COLOR);
if (src_h != img.rows || src_w != img.cols) {
  cv::resize(img, img, cv::Size(src_w, src_h));
}

// src: BCHW RGB [0,1] fp32
auto src = cv::Mat(img.rows, img.cols, CV_32FC3);
{
  auto src_data = (float*)(src.data);
  for (int y = 0; y < src_h; ++y) {
    for (int x = 0; x < src_w; ++x) {
      auto &&bgr = img.at<cv::Vec3b>(y, x);
      /*r*/ *(src_data + y*src_w + x) = bgr[2] / 255.;
      /*g*/ *(src_data + src_n + y*src_w + x) = bgr[1] / 255.;
      /*b*/ *(src_data + src_n*2 + y*src_w + x) = bgr[0] / 255.;
    }
  }
}
if (cudaMemcpyAsync(bindings[0], src.data, bindings_size[0],
    cudaMemcpyHostToDevice, stream) != cudaSuccess) {
  std::cerr << "ERROR: CUDA memory copy of src failed, size = "
      << bindings_size[0] << " bytes" << std::endl;
  return false;
}

3. Reasoning: give bindings to engine execution context for reasoning

auto context = std::unique_ptr<nvinfer1::IExecutionContext>(
    engine->createExecutionContext());
if (!context) {
  return false;
}

bool status = context->enqueueV2(bindings.data(), stream, nullptr);
if (!status) {
  std::cout << "ERROR: TensorRT inference failed" << std::endl;
  return false;
}

4. Post-processing: bindings , and process the data according to the output format

With cv::Mat receives the output of the foreground fgr and transparent channel pha :

auto fgr = cv::Mat(src_h, src_w, CV_32FC3);  // BCHW RGB [0,1] fp32
if (cudaMemcpyAsync(fgr.data, bindings[9], bindings_size[9],
    cudaMemcpyDeviceToHost, stream) != cudaSuccess) {
  std::cerr << "ERROR: CUDA memory copy of output failed, size = "
      << bindings_size[9] << " bytes" << std::endl;
  return false;
}
auto pha = cv::Mat(src_h, src_w, CV_32FC1);  // BCHW A [0,1] fp32
if (cudaMemcpyAsync(pha.data, bindings[10], bindings_size[10],
    cudaMemcpyDeviceToHost, stream) != cudaSuccess) {
  std::cerr << "ERROR: CUDA memory copy of output failed, size = "
      << bindings_size[10] << " bytes" << std::endl;
  return false;
}
cudaStreamSynchronize(stream);

Then fgr pha into RGBA data, and restore it to the original size:

// Compose `fgr` and `pha`
auto com = cv::Mat(src_h, src_w, CV_8UC4);  // HWC BGRA [0,255] u8
{
  auto fgr_data = (float*)(fgr.data);
  auto pha_data = (float*)(pha.data);
  for (int y = 0; y < com.rows; ++y) {
    for (int x = 0; x < com.cols; ++x) {
      auto &&elem = com.at<cv::Vec4b>(y, x);
      auto alpha = *(pha_data + y*src_w + x);
      if (alpha > 0) {
        /*r*/ elem[2] = *(fgr_data + y*src_w + x) * 255;
        /*g*/ elem[1] = *(fgr_data + src_n + y*src_w + x) * 255;
        /*b*/ elem[0] = *(fgr_data + src_n*2 + y*src_w + x) * 255;
      } else {
        /*r*/ elem[2] = 0;
        /*g*/ elem[1] = 0;
        /*b*/ elem[0] = 0;
      }
      /*a*/ elem[3] = alpha * 255;
    }
  }
}
if (dst_h != com.rows || dst_w != com.cols) {
  cv::resize(com, com, cv::Size(dst_w, dst_h));
}

5. Run the resulting keying result (with transparent background):

At last

If you want to get started with TensorRT, try it out!

GoCoding personal practice experience sharing, you can pay attention to the public number!

TensorRT starts

Install

Compile the sample

quick start

practice

Prepare the model

Prepare the environment

Running the ONNX model

Convert ONNX to TRT model

Output difference between ONNX and TRT model

Running the TRT model

At last

GoCoding

引用和评论

Flutter ncnn 使用

人工智能与机器学习入门：基尼系数（Gini Index）和基于熵（Entropy）

人工智能与机器学习入门：决策树应用

人工智能与机器学习入门：使用Kaggle完成Titanic推断学习

C++ 中 VS 项目引入公共配置文件

科学计算编程涉及到的技术栈简介

Visual Studio Code (VS Code) – C/C++ 入门