TensorRT is NVIDIA's own high-performance inference library. Its Getting Started lists the data entries, as follows:
This article is based on the current TensorRT 8.2 version and will introduce step by step from installation to accelerated inference of your own ONNX model.
Install
TensorRT download page select the version to download, you need to register and log in.
This article chooses TensorRT-8.2.2.1.Linux.x86_64-gnu.cuda-11.4.cudnn8.2.tar.gz
, you can notice that it is a good version to match CUDA cuDNN You can also prepare NVIDIA Docker pull the corresponding version of nvidia/cuda image, and then ADD
TensorRT
.
# 解压进 $HOME (以免 sudo 编译样例,为当前用户)
tar -xzvf TensorRT-*.tar.gz -C $HOME/
# 软链到 /usr/local/TensorRT (以固定一个路径)
sudo ln -s $HOME/TensorRT-8.2.2.1 /usr/local/TensorRT
After that, compile and run the sample to ensure that TensorRT is installed correctly.
Compile the sample
The sample in TensorRT/samples
, description see the Sample Support Guide or each sample directory of README.md
.
cd /usr/local/TensorRT/samples/
# 设定环境变量,可见 Makefile.config
export CUDA_INSTALL_DIR=/usr/local/cuda
export CUDNN_INSTALL_DIR=/usr/local/cuda
export ENABLE_DLA=
export TRT_LIB_DIR=../lib
export PROTOBUF_INSTALL_DIR=
# 编译
make -j`nproc`
# 运行
export LD_LIBRARY_PATH=/usr/local/TensorRT/lib:$LD_LIBRARY_PATH
cd /usr/local/TensorRT/
./bin/trtexec -h
./bin/sample_mnist -d data/mnist/ --fp16
Operation result reference:
$ ./bin/sample_mnist -d data/mnist/ --fp16
&&&& RUNNING TensorRT.sample_mnist [TensorRT v8202] # ./bin/sample_mnist -d data/mnist/ --fp16
[12/23/2021-20:20:16] [I] Building and running a GPU inference engine for MNIST
[12/23/2021-20:20:16] [I] [TRT] [MemUsageChange] Init CUDA: CPU +322, GPU +0, now: CPU 333, GPU 600 (MiB)
[12/23/2021-20:20:16] [I] [TRT] [MemUsageSnapshot] Begin constructing builder kernel library: CPU 333 MiB, GPU 600 MiB
[12/23/2021-20:20:16] [I] [TRT] [MemUsageSnapshot] End constructing builder kernel library: CPU 468 MiB, GPU 634 MiB
[12/23/2021-20:20:17] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +518, GPU +224, now: CPU 988, GPU 858 (MiB)
[12/23/2021-20:20:17] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +114, GPU +52, now: CPU 1102, GPU 910 (MiB)
[12/23/2021-20:20:17] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[12/23/2021-20:20:33] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[12/23/2021-20:20:34] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[12/23/2021-20:20:34] [I] [TRT] Total Host Persistent Memory: 8448
[12/23/2021-20:20:34] [I] [TRT] Total Device Persistent Memory: 1626624
[12/23/2021-20:20:34] [I] [TRT] Total Scratch Memory: 0
[12/23/2021-20:20:34] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 2 MiB, GPU 13 MiB
[12/23/2021-20:20:34] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 0.01595ms to assign 3 blocks to 8 nodes requiring 57857 bytes.
[12/23/2021-20:20:34] [I] [TRT] Total Activation Memory: 57857
[12/23/2021-20:20:34] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1621, GPU 1116 (MiB)
[12/23/2021-20:20:34] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 1621, GPU 1124 (MiB)
[12/23/2021-20:20:34] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +4, now: CPU 0, GPU 4 (MiB)
[12/23/2021-20:20:34] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 1622, GPU 1086 (MiB)
[12/23/2021-20:20:34] [I] [TRT] Loaded engine size: 1 MiB
[12/23/2021-20:20:34] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1622, GPU 1096 (MiB)
[12/23/2021-20:20:34] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 1623, GPU 1104 (MiB)
[12/23/2021-20:20:34] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +1, now: CPU 0, GPU 1 (MiB)
[12/23/2021-20:20:34] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1485, GPU 1080 (MiB)
[12/23/2021-20:20:34] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 1485, GPU 1088 (MiB)
[12/23/2021-20:20:34] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +2, now: CPU 0, GPU 3 (MiB)
[12/23/2021-20:20:34] [I] Input:
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@%+-: =@@@@@@@@@@@@
@@@@@@@%= -@@@**@@@@@@@
@@@@@@@ :%#@-#@@@. #@@@@@@
@@@@@@* +@@@@:*@@@ *@@@@@@
@@@@@@# +@@@@ @@@% @@@@@@@
@@@@@@@. :%@@.@@@. *@@@@@@@
@@@@@@@@- =@@@@. -@@@@@@@@
@@@@@@@@@%: +@- :@@@@@@@@@
@@@@@@@@@@@%. : -@@@@@@@@@@
@@@@@@@@@@@@@+ #@@@@@@@@@@
@@@@@@@@@@@@@@+ :@@@@@@@@@@
@@@@@@@@@@@@@@+ *@@@@@@@@@
@@@@@@@@@@@@@@: = @@@@@@@@@
@@@@@@@@@@@@@@ :@ @@@@@@@@@
@@@@@@@@@@@@@@ -@ @@@@@@@@@
@@@@@@@@@@@@@# +@ @@@@@@@@@
@@@@@@@@@@@@@* ++ @@@@@@@@@
@@@@@@@@@@@@@* *@@@@@@@@@
@@@@@@@@@@@@@# =@@@@@@@@@@
@@@@@@@@@@@@@@. +@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[12/23/2021-20:20:34] [I] Output:
0:
1:
2:
3:
4:
5:
6:
7:
8: **********
9:
&&&& PASSED TensorRT.sample_mnist [TensorRT v8202] # ./bin/sample_mnist -d data/mnist/ --fp16
quick start
Quick Start Guide / Using The TensorRT Runtime API
To prepare the tutorial code, compile:
git clone --depth 1 https://github.com/NVIDIA/TensorRT.git
export CUDA_INSTALL_DIR=/usr/local/cuda
export CUDNN_INSTALL_DIR=/usr/local/cuda
export TRT_LIB_DIR=/usr/local/TensorRT/lib
# 编译 quickstart
cd TensorRT/quickstart
# Makefile.config
# INCPATHS += -I"/usr/local/TensorRT/include"
# common/logging.h
# void log(Severity severity, const char* msg) noexcept override
make
# 运行环境
export PATH=/usr/local/TensorRT/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/TensorRT/lib:$LD_LIBRARY_PATH
cd SemanticSegmentation
Get the pre-trained FCN-ResNet-101 model and convert it to ONNX:
# 创建本地环境
# conda create -n torch python=3.9 -y
# conda activate torch
# conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch -y
# 不然,容器环境
# docker run --rm -it --gpus all -p 8888:8888 -v `pwd`:/workspace/SemanticSegmentation -w /workspace nvcr.io/nvidia/pytorch:20.12-py3 bash
$ python export.py
Exporting ppm image input.ppm
Downloading: "https://github.com/pytorch/vision/archive/v0.6.0.zip" to /home/john/.cache/torch/hub/v0.6.0.zip
Downloading: "https://download.pytorch.org/models/resnet101-5d3b4d8f.pth" to /home/john/.cache/torch/hub/checkpoints/resnet101-5d3b4d8f.pth
100%|████████████████████████████████████████| 170M/170M [00:27<00:00, 6.57MB/s]
Downloading: "https://download.pytorch.org/models/fcn_resnet101_coco-7ecb50ca.pth" to /home/john/.cache/torch/hub/checkpoints/fcn_resnet101_coco-7ecb50ca.pth
100%|████████████████████████████████████████| 208M/208M [02:26<00:00, 1.49MB/s]
Exporting ONNX model fcn-resnet101.onnx
Then use trtexec
convert ONNX to TensorRT engine:
$ trtexec --onnx=fcn-resnet101.onnx --fp16 --workspace=64 --minShapes=input:1x3x256x256 --optShapes=input:1x3x1026x1282 --maxShapes=input:1x3x1440x2560 --buildOnly --saveEngine=fcn-resnet101.engine
...
[01/07/2022-20:20:00] [I] Engine built in 406.011 sec.
&&&& PASSED TensorRT.trtexec [TensorRT v8202] ...
Random input, test engine:
$ trtexec --shapes=input:1x3x1026x1282 --loadEngine=fcn-resnet101.engine
...
[01/07/2022-20:20:00] [I] === Performance summary ===
[01/07/2022-20:20:00] [I] Throughput: 12.4749 qps
[01/07/2022-20:20:00] [I] Latency: min = 76.9746 ms, max = 98.8354 ms, mean = 79.5844 ms, median = 78.0542 ms, percentile(99%) = 98.8354 ms
[01/07/2022-20:20:00] [I] End-to-End Host Latency: min = 150.942 ms, max = 188.431 ms, mean = 155.834 ms, median = 152.444 ms, percentile(99%) = 188.431 ms
[01/07/2022-20:20:00] [I] Enqueue Time: min = 0.390625 ms, max = 1.61279 ms, mean = 1.41182 ms, median = 1.46136 ms, percentile(99%) = 1.61279 ms
[01/07/2022-20:20:00] [I] H2D Latency: min = 1.25977 ms, max = 1.53467 ms, mean = 1.27415 ms, median = 1.26514 ms, percentile(99%) = 1.53467 ms
[01/07/2022-20:20:00] [I] GPU Compute Time: min = 75.2869 ms, max = 97.1318 ms, mean = 77.8847 ms, median = 76.3599 ms, percentile(99%) = 97.1318 ms
[01/07/2022-20:20:00] [I] D2H Latency: min = 0.408447 ms, max = 0.454346 ms, mean = 0.425577 ms, median = 0.423004 ms, percentile(99%) = 0.454346 ms
[01/07/2022-20:20:00] [I] Total Host Walltime: 3.2866 s
[01/07/2022-20:20:00] [I] Total GPU Compute Time: 3.19327 s
[01/07/2022-20:20:00] [I] Explanations of the performance metrics are printed in the verbose logs.
[01/07/2022-20:20:00] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8202] ...
Run the tutorial, using the engine:
$ ./bin/segmentation_tutorial
[01/07/2022-20:20:34] [I] [TRT] [MemUsageChange] Init CUDA: CPU +322, GPU +0, now: CPU 463, GPU 707 (MiB)
[01/07/2022-20:20:34] [I] [TRT] Loaded engine size: 132 MiB
[01/07/2022-20:20:35] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +520, GPU +224, now: CPU 984, GPU 1065 (MiB)
[01/07/2022-20:20:35] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +115, GPU +52, now: CPU 1099, GPU 1117 (MiB)
[01/07/2022-20:20:35] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +131, now: CPU 0, GPU 131 (MiB)
[01/07/2022-20:20:35] [I] Running TensorRT inference for FCN-ResNet101
[01/07/2022-20:20:35] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 966, GPU 1109 (MiB)
[01/07/2022-20:20:35] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 966, GPU 1117 (MiB)
[01/07/2022-20:20:35] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +722, now: CPU 0, GPU 853 (MiB)
practice
The above gives the compilation and use of official samples and tutorials. Here, I found another RVM model and tried it from scratch.
Prepare the model
Robust Video Matting (RVM) stabilized video keying, can do real-time HD keying on any video. There are Webcam Demo can be experienced on the web.
Prepare the ONNX model rvm_mobilenetv3_fp32.onnx , whose inference document gives the model input and output:
Input: [
src
,r1i
,r2i
,r3i
,r4i
,downsample_ratio
]src
: input frame, RGB channels, shape[B, C, H, W]
, range0~1
rXi
: memory input, the initial value is a zero tensor of[1, 1, 1, 1]
downsample_ratio
downsampling ratio, tensor shape is[1]
- Only
downsample_ratio
must beFP32
, other inputs must use the samedtype
Output: [
fgr
,pha
,r1o
,r2o
,r3o
,r4o
]fgr, pha
: Foreground and transparency channel output, range0~1
rXo
: memory output
Prepare the input image input.jpg . No video, keep the code simple.
Prepare the environment
conda create -n torch python=3.9 -y
conda activate torch
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch -y
# Requirements
# https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements
pip install onnx onnxruntime-gpu==1.10
Running the ONNX model
import onnxruntime as ort
import numpy as np
from PIL import Image
# 读取图像
with Image.open('input.jpg') as img:
img.load()
# HWC [0,255] > BCHW [0,1]
src = np.array(img)
src = np.moveaxis(src, -1, 0) .astype(np.float32)
src = src[np.newaxis, :] / 255.
# 载入模型
sess = ort.InferenceSession('rvm_mobilenetv3_fp32.onnx', providers=['CUDAExecutionProvider'])
# 创建 io binding
io = sess.io_binding()
# 在 CUDA 上创建张量
rec = [ ort.OrtValue.ortvalue_from_numpy(np.zeros([1, 1, 1, 1], dtype=np.float32), 'cuda') ] * 4
downsample_ratio = ort.OrtValue.ortvalue_from_numpy(np.asarray([0.25], dtype=np.float32), 'cuda')
# 设置输出项
for name in ['fgr', 'pha', 'r1o', 'r2o', 'r3o', 'r4o']:
io.bind_output(name, 'cuda')
# 推断
io.bind_cpu_input('src', src)
io.bind_ortvalue_input('r1i', rec[0])
io.bind_ortvalue_input('r2i', rec[1])
io.bind_ortvalue_input('r3i', rec[2])
io.bind_ortvalue_input('r4i', rec[3])
io.bind_ortvalue_input('downsample_ratio', downsample_ratio)
sess.run_with_iobinding(io)
fgr, pha, *rec = io.get_outputs()
# 只将 `fgr` 和 `pha` 回传到 CPU
fgr = fgr.numpy()
pha = pha.numpy()
# 合成 RGBA
com = np.where(pha > 0, fgr, pha)
com = np.concatenate([com, pha], axis=1) # + alpha
# BCHW [0,1] > HWC [0,255]
com = np.squeeze(com, axis=0)
com = np.moveaxis(com, 0, -1) * 255
img = Image.fromarray(com.astype(np.uint8))
img.show()
run:
python rvm_onnx_infer.py --model "rvm_mobilenetv3_fp32.onnx" --input-image "input.jpg" --precision float32 --show
Result (background transparent):
Convert ONNX to TRT model
trtexec
Convert ONNX to TensorRT engine:
export PATH=/usr/local/TensorRT/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/TensorRT/lib:$LD_LIBRARY_PATH
trtexec --onnx=rvm_mobilenetv3_fp32.onnx --workspace=64 --saveEngine=rvm_mobilenetv3_fp32.engine --verbose
A problem occurred:
[01/08/2022-20:20:36] [E] [TRT] ModelImporter.cpp:773: While parsing node number 3 [Resize -> "389"]:
[01/08/2022-20:20:36] [E] [TRT] ModelImporter.cpp:774: --- Begin node ---
[01/08/2022-20:20:36] [E] [TRT] ModelImporter.cpp:775: input: "src"
input: "386"
input: "388"
output: "389"
name: "Resize_3"
op_type: "Resize"
attribute {
name: "coordinate_transformation_mode"
s: "pytorch_half_pixel"
type: STRING
}
attribute {
name: "cubic_coeff_a"
f: -0.75
type: FLOAT
}
attribute {
name: "mode"
s: "linear"
type: STRING
}
attribute {
name: "nearest_mode"
s: "floor"
type: STRING
}
[01/08/2022-20:20:36] [E] [TRT] ModelImporter.cpp:776: --- End node ---
[01/08/2022-20:20:36] [E] [TRT] ModelImporter.cpp:779: ERROR: builtin_op_importers.cpp:3608 In function importResize:
[8] Assertion failed: scales.is_weights() && "Resize scales must be an initializer!"
At this time, it is necessary to manually modify the model.
First, install the necessary tools:
snap install netron
pip install onnx-simplifier
pip install onnx_graphsurgeon --index-url https://pypi.ngc.nvidia.com
After that, Netron View model Resize_3
node:
It is found that the scales
input is downsample_ratio
, that is, [1,1,downsample_ratio,downsample_ratio]
, which can be modified into a constant ONNX GraphSurgeon
Finally, the model modification steps are as follows:
# ONNX 模型简化,并改为静态输入尺寸
python -m onnxsim rvm_mobilenetv3_fp32.onnx rvm_mobilenetv3_fp32_sim.onnx \
--input-shape src:1,3,1080,1920 r1i:1,1,1,1 r2i:1,1,1,1 r3i:1,1,1,1 r4i:1,1,1,1
# ONNX GraphSurgeon 修改模型
python rvm_onnx_modify.py -i rvm_mobilenetv3_fp32_sim.onnx --input-size 1920 1280
# trtexec 将 ONNX 转成 TensorRT engine
trtexec --onnx=rvm_mobilenetv3_fp32_sim_modified.onnx --workspace=64 --saveEngine=rvm_mobilenetv3_fp32_sim_modified.engine
def modify(input: str, output: str, downsample_ratio: float = 0.25) -> None:
print(f'\nonnx load: {input}')
graph = gs.import_onnx(onnx.load(input))
_print_graph(graph)
# update node Resize_3: scales
resize_3 = [n for n in graph.nodes if n.name == 'Resize_3'][0]
print()
print(resize_3)
scales = gs.Constant('388',
np.asarray([1, 1, downsample_ratio, downsample_ratio], dtype=np.float32))
resize_3.inputs = [i if i.name != '388' else scales for i in resize_3.inputs]
print()
print(resize_3)
# remove input downsample_ratio
graph.inputs = [i for i in graph.inputs if i.name != 'downsample_ratio']
# remove node Concat_2
concat_2 = [n for n in graph.nodes if n.name == 'Concat_2'][0]
concat_2.outputs.clear()
# remove unused nodes/tensors
graph.cleanup()
onnx.save(gs.export_onnx(graph), output)
Output difference between ONNX and TRT model
Use Polygraphy see output differences between ONNX and TRT models.
First, install
# 安装 TensorRT Python API
cd /usr/local/TensorRT/python/
pip install tensorrt-8.2.2.1-cp39-none-linux_x86_64.whl
export LD_LIBRARY_PATH=/usr/local/TensorRT/lib:$LD_LIBRARY_PATH
python -c "import tensorrt; print(tensorrt.__version__)"
# 安装 Polygraphy,或者通过 TensorRT/tools/Polygraphy 源码安装
python -m pip install colored polygraphy --extra-index-url https://pypi.ngc.nvidia.com
Run the ONNX and TRT models and compare the output errors:
# 运行 ONNX 模型,保存输入输出
polygraphy run rvm_mobilenetv3_fp32_sim_modified.onnx --onnxrt --val-range [0,1] --save-inputs onnx_inputs.json --save-outputs onnx_outputs.json
# 运行 TRT 模型,载入 ONNX 输入输出,对比输出的相对误差与绝对误差
polygraphy run rvm_mobilenetv3_fp32_sim_modified.engine --model-type engine --trt --load-inputs onnx_inputs.json --load-outputs onnx_outputs.json --rtol 1e-3 --atol 1e-3
It can be seen fp32
accuracy error of 1e-3
PASSED
, and 061deed0270471:
[I] PASSED | All outputs matched | Outputs: ['r4o', 'r3o', 'r2o', 'r1o', 'fgr', 'pha']
[I] PASSED | Command: /home/john/anaconda3/envs/torch/bin/polygraphy run rvm_mobilenetv3_fp32_sim_modified.engine --model-type engine --trt --load-inputs onnx_inputs.json --load-outputs onnx_outputs.json --rtol 1e-3 --atol 1e-3
I also tried fp16
, and its precision loss is relatively large, FAILED
:
[E] FAILED | Mismatched outputs: ['r4o', 'r3o', 'r2o', 'r1o', 'fgr', 'pha']
[!] FAILED | Command: /home/john/anaconda3/envs/torch/bin/polygraphy run rvm_mobilenetv3_fp16_sim_modified.engine --model-type engine --trt --load-inputs onnx_inputs.json --load-outputs onnx_outputs.json --rtol 1e-3 --atol 1e-3
Running the TRT model
Here we take TensorRT C++ runtime APIs as an example to run the exported RVM TRT model. rvm_infer.cc complete code.
1. Load the model: create runtime
, deserialize the data of the TRT model file
static Logger logger{Logger::Severity::kINFO};
auto runtime = std::unique_ptr<nvinfer1::IRuntime>(nvinfer1::createInferRuntime(logger));
auto engine = runtime->deserializeCudaEngine(engine_data.data(), fsize, nullptr);
Traverse all input and output bindings
:
auto nb = engine->getNbBindings();
for (int32_t i = 0; i < nb; i++) {
auto is_input = engine->bindingIsInput(i);
auto name = engine->getBindingName(i);
auto dims = engine->getBindingDimensions(i);
auto datatype = engine->getBindingDataType(i);
// ...
}
Engine
Name=Unnamed Network 0
DeviceMemorySize=148 MiB
MaxBatchSize=1
Bindings
Input[0] name=src dims=[1,3,1080,1920] datatype=FLOAT
Input[1] name=r1i dims=[1,1,1,1] datatype=FLOAT
Input[2] name=r2i dims=[1,1,1,1] datatype=FLOAT
Input[3] name=r3i dims=[1,1,1,1] datatype=FLOAT
Input[4] name=r4i dims=[1,1,1,1] datatype=FLOAT
Output[5] name=r4o dims=[1,64,18,32] datatype=FLOAT
Output[6] name=r3o dims=[1,40,36,64] datatype=FLOAT
Output[7] name=r2o dims=[1,20,72,128] datatype=FLOAT
Output[8] name=r1o dims=[1,16,144,256] datatype=FLOAT
Output[9] name=fgr dims=[1,3,1080,1920] datatype=FLOAT
Output[10] name=pha dims=[1,1,1080,1920] datatype=FLOAT
After that, a good distribution of all bindings
of device
memory:
auto nb = engine->getNbBindings();
std::vector<void *> bindings(nb, nullptr);
std::vector<int32_t> bindings_size(nb, 0);
for (int32_t i = 0; i < nb; i++) {
auto dims = engine->getBindingDimensions(i);
auto size = GetMemorySize(dims, sizeof(float));
if (cudaMalloc(&bindings[i], size) != cudaSuccess) {
std::cerr << "ERROR: cuda memory allocation failed, size = " << size
<< " bytes" << std::endl;
return false;
}
bindings_size[i] = size;
}
At this point, the preparations are done.
2. Pre-processing: The input data is processed into input format and stored in input bindings
Read the image with OpenCV and scale it to an input size of src
Then process the data from BGR [0,255]
to RGB [0,1]
. Because of batch=1
, it can be ignored during processing.
// img: HWC BGR [0,255] u8
auto img = cv::imread(input_filename, cv::IMREAD_COLOR);
if (src_h != img.rows || src_w != img.cols) {
cv::resize(img, img, cv::Size(src_w, src_h));
}
// src: BCHW RGB [0,1] fp32
auto src = cv::Mat(img.rows, img.cols, CV_32FC3);
{
auto src_data = (float*)(src.data);
for (int y = 0; y < src_h; ++y) {
for (int x = 0; x < src_w; ++x) {
auto &&bgr = img.at<cv::Vec3b>(y, x);
/*r*/ *(src_data + y*src_w + x) = bgr[2] / 255.;
/*g*/ *(src_data + src_n + y*src_w + x) = bgr[1] / 255.;
/*b*/ *(src_data + src_n*2 + y*src_w + x) = bgr[0] / 255.;
}
}
}
if (cudaMemcpyAsync(bindings[0], src.data, bindings_size[0],
cudaMemcpyHostToDevice, stream) != cudaSuccess) {
std::cerr << "ERROR: CUDA memory copy of src failed, size = "
<< bindings_size[0] << " bytes" << std::endl;
return false;
}
3. Reasoning: give bindings
to engine
execution context for reasoning
auto context = std::unique_ptr<nvinfer1::IExecutionContext>(
engine->createExecutionContext());
if (!context) {
return false;
}
bool status = context->enqueueV2(bindings.data(), stream, nullptr);
if (!status) {
std::cout << "ERROR: TensorRT inference failed" << std::endl;
return false;
}
4. Post-processing: bindings
, and process the data according to the output format
With cv::Mat
receives the output of the foreground fgr
and transparent channel pha
:
auto fgr = cv::Mat(src_h, src_w, CV_32FC3); // BCHW RGB [0,1] fp32
if (cudaMemcpyAsync(fgr.data, bindings[9], bindings_size[9],
cudaMemcpyDeviceToHost, stream) != cudaSuccess) {
std::cerr << "ERROR: CUDA memory copy of output failed, size = "
<< bindings_size[9] << " bytes" << std::endl;
return false;
}
auto pha = cv::Mat(src_h, src_w, CV_32FC1); // BCHW A [0,1] fp32
if (cudaMemcpyAsync(pha.data, bindings[10], bindings_size[10],
cudaMemcpyDeviceToHost, stream) != cudaSuccess) {
std::cerr << "ERROR: CUDA memory copy of output failed, size = "
<< bindings_size[10] << " bytes" << std::endl;
return false;
}
cudaStreamSynchronize(stream);
Then fgr
pha
into RGBA
data, and restore it to the original size:
// Compose `fgr` and `pha`
auto com = cv::Mat(src_h, src_w, CV_8UC4); // HWC BGRA [0,255] u8
{
auto fgr_data = (float*)(fgr.data);
auto pha_data = (float*)(pha.data);
for (int y = 0; y < com.rows; ++y) {
for (int x = 0; x < com.cols; ++x) {
auto &&elem = com.at<cv::Vec4b>(y, x);
auto alpha = *(pha_data + y*src_w + x);
if (alpha > 0) {
/*r*/ elem[2] = *(fgr_data + y*src_w + x) * 255;
/*g*/ elem[1] = *(fgr_data + src_n + y*src_w + x) * 255;
/*b*/ elem[0] = *(fgr_data + src_n*2 + y*src_w + x) * 255;
} else {
/*r*/ elem[2] = 0;
/*g*/ elem[1] = 0;
/*b*/ elem[0] = 0;
}
/*a*/ elem[3] = alpha * 255;
}
}
}
if (dst_h != com.rows || dst_w != com.cols) {
cv::resize(com, com, cv::Size(dst_w, dst_h));
}
5. Run the resulting keying result (with transparent background):
At last
If you want to get started with TensorRT, try it out!
GoCoding personal practice experience sharing, you can pay attention to the public number!
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。