PyTorch->ONNX->TensorRT，TensorRT官方插件使用demo

课程说明

课程目标：讲解PyTorch->ONNX->TensorRT，模型导出时，TensorRT官方插件如何使用。

软件版本说明：
TensorRT: 8.6.1.6
Python: 3.8

下面练习用到的输入数据与输出数据：
链接：https://pan.baidu.com/s/14NQaxeTIXRi9YAbdSWNNtQ?pwd=y0jm
提取码：y0jm

导出原理

[1] https://blog.csdn.net/blanokvaffy/article/details/128046413 (TensorRT加速Deformable Detr实践_deformable detr onnx-CSDN博客)
[2] https://github.com/talebolano/Tensorrt-Deformable-Detr
[3] https://github.com/NVIDIA/TensorRT/tree/main/plugin/multiscaleDeformableAttnPlugin
[4] https://zhuanlan.zhihu.com/p/513387413 (模型部署入门教程（四）：在 PyTorch 中支持更多 ONNX 算子 - 知乎)

在文献4，下面有一个评论，特别棒。

提问人：意思是，onnx只是制定了一套标准，如果要真的实现deform_conv这个算子功能，就需要在推理引擎（比如onnxruntime, tensorrt）里面根据这个标准去实现算子，是这么理解吗？
作者：是的。

举个例子：我们的模型，有一个resize操作，导出的onnx，会有一个resize节点，但是resize节点内部如何实现,onnx就不再具体描述了。
即一个onnx节点的内容，可粗可细，细的话，一个函数实现细节都可以描述出来。粗的话，一个onnx算子可以描述整个函数模块。
我们具体，来看下面的例子(来自于文献1，2)，理解下上面的事情。

class Etmpy_MultiScaleDeformableAttnFunction(torch.autograd.Function):
    @staticmethod
    def symbolic(g,value, value_spatial_shapes, value_level_start_index,
                sampling_locations, attention_weights, im2col_step):

        return g.op('com.microsoft::MultiscaleDeformableAttnPlugin_TRT',value, value_spatial_shapes, value_level_start_index,
                    sampling_locations, attention_weights)
    @staticmethod
    def forward(ctx, value, value_spatial_shapes, value_level_start_index,
                sampling_locations, attention_weights, im2col_step):
        '''
        no real mean,just for inference
        '''
        bs, _, mum_heads, embed_dims_num_heads = value.shape
        bs ,num_queries, _, _, _, _ = sampling_locations.shape
        return value.new_zeros(bs, num_queries, mum_heads, embed_dims_num_heads)

    @staticmethod
    def backward(ctx, grad_output):
        pass

上面是可变形注意力机制插件使用，在pytorch代码中需要做的事情：
1 替换原有的MultiScaleDeformableAttnFunction为Etmpy_MultiScaleDeformableAttnFunction。
2 定义符号函数，实现从pytorchMultiScaleDeformableAttnFunction到onnx算子的映射。
符号函数中g.op函数，第1个参数，即为onnx算子的名字。
这个onnx算子的名字，com.microsoft::MultiscaleDeformableAttnPlugin_TRT，::前面是命名空间，可以随意取，不影响，::是onnx插件的名字。这里，我们使用nvidia官方提供的插件，要与nvidia官方插件的名字相同。
要与nvidia官方提供的算子名字保持一致，在插件的cpp文件中，有这样的代码，其中"MultiscaleDeformableAttnPlugin_TRT"，即为插件的名字。

// https://github.com/NVIDIA/TensorRT/blob/main/plugin/multiscaleDeformableAttnPlugin/multiscaleDeformableAttnPlugin.cpp
namespace
{
static char const* DMHA_VERSION{"1"};
static char const* DMHA_NAME{"MultiscaleDeformableAttnPlugin_TRT"};
} // namespace

我们在使用trtexec导出engine时，trtexec --verbose，在输出日志中，也能看到它会加载的官方自定义的插件名称，可以通过上面的方法快速确认。
有了上面两步的工作，即可实现这个模块的onnx trace方式导出。

gpu版多尺度注意力机制插件的使用测试

1 编写一段pytorch代码，定义一个模型，这个模型，在导出onnx时，会映射到tensorrt插件。
2 导出这个模型，到onnx。
3 onnx to tensort engine。
4 tensort engine file推理。

pytorch导出onnx hello world demo

import torch
import torch.onnx

# 定义一个简单的 PyTorch 模型
class SimpleModel(torch.nn.Module):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.fc = torch.nn.Linear(10, 1)

    def forward(self, x):
        return self.fc(x)

# 创建模型实例
model = SimpleModel()

# 创建一个示例输入
example_input = torch.randn(1, 10)

# 导出模型到 ONNX 格式
output_path = "simple_model.onnx"
torch.onnx.export(model, example_input, output_path)

print("Model exported to:", output_path)

"""
在这个示例中，我们首先定义了一个简单的 PyTorch 模型 SimpleModel，该模型包含一个线性层。然后，我们创建了一个模型实例，并准备了一个示例输入 example_input。最后，我们使用 torch.onnx.export 函数将模型导出为 ONNX 格式，并指定输出路径。
"""

pytorch代码

在pytorch模型源码中的MSDeformAttnFunction：

try:
    output = MSDeformAttnFunction.apply(
        value, input_spatial_shapes, input_level_start_index, sampling_locations, attention_weights, self.im2col_step)
except:
    # CPU
    output = ms_deform_attn_core_pytorch(value, input_spatial_shapes, sampling_locations, attention_weights)

保存输入变量

torch.save(value, 'res/value.pt')
torch.save(input_spatial_shapes, 'res/input_spatial_shapes.pt')
torch.save(input_level_start_index, 'res/input_level_start_index.pt')
torch.save(sampling_locations, 'res/sampling_locations.pt')
torch.save(attention_weights, 'res/attention_weights.pt')
torch.save(im2col_step, 'res/im2col_step.pt')
torch.save(output, '/home/demo/assets/layer_05/output.pt')

pytorch->onnx

下面列举一个测试这个插件的demo程序：

import torch
from torch import nn
import numpy as np
import onnx

class Etmpy_MultiScaleDeformableAttnFunction(torch.autograd.Function):

    @staticmethod    
    def symbolic(g, value, spatial_shapes, level_start_index, sampling_locations,
                  atttention_weights, im2col_step):
        # return g.op('com.microsoft::MultiscaleDeformableAttnPlugin_TRT',value, value_spatial_shapes, value_level_start_index,
        #             sampling_locations, attention_weights)

        # 按multiscaleDeformableAttnPlugin以下顺序接受 5 个输入： value、spatial_shapes、level_start_index、sampling_locations和atttention_weights。
        return g.op('nvinfer1.plugin::MultiscaleDeformableAttnPlugin_TRT', value, spatial_shapes, level_start_index, sampling_locations,
                  atttention_weights)
        
    @staticmethod
    def forward(ctx, value, value_spatial_shapes, value_level_start_index,
                sampling_locations, attention_weights, im2col_step):
        '''
        no real mean,just for inference
        '''
        bs, _, mum_heads, embed_dims_num_heads = value.shape
        bs ,num_queries, _, _, _, _ = sampling_locations.shape
        
        return value.new_ones((bs, num_queries, mum_heads, embed_dims_num_heads))

    @staticmethod
    def backward(ctx, grad_output):
        pass   

class MyMSDeformAttnModel(nn.Module):

    def __init__(self):
        super().__init__()

    def forward(self, value, input_spatial_shapes, input_level_start_index, sampling_locations, attention_weights, im2col_step):
        output = Etmpy_MultiScaleDeformableAttnFunction.apply(value, input_spatial_shapes,
             input_level_start_index, sampling_locations, attention_weights, im2col_step)        
        return output
    

def export_py_model_to_onnx():
    pass

    my_ms_deform_attn_model = My2MSDeformAttnModel()
    value = torch.load('/home/demo/assets/layer_07/value.pt')
    input_spatial_shapes = torch.load('/home/demo/assets/layer_07/input_spatial_shapes.pt')
    input_level_start_index = torch.load('/home/demo/assets/layer_07/input_level_start_index.pt')
    sampling_locations = torch.load('/home/demo/assets/layer_07/sampling_locations.pt')
    attention_weights = torch.load('/home/demo/assets/layer_07/attention_weights.pt')
    im2col_step = torch.load('/home/demo/assets/layer_07/im2col_step.pt')

    # The multiscaleDeformableAttnPlugin takes 5 inputs in the following order : 
    # value, spatial_shapes, level_start_index, sampling_locations, and atttention_weights.
    
    with torch.no_grad():
        torch.onnx.export(my_ms_deform_attn_model, (value, input_spatial_shapes, input_level_start_index,
                                                     sampling_locations, attention_weights, im2col_step, ),
                        "res/my_ms_deform_attn_model_v03.onnx",
                        opset_version=11,
                        input_names=['value', 'spatial_shapes', 'level_start_index', 
                                     'sampling_locations', 'attention_weights', 'im2col_step'],
                        output_names=['output'])


def tensor_to_numpy():
    pass
    value = torch.load('/home/demo/assets/layer_07/value.pt')
    value_np = value.cpu().numpy()
    np.save('/home/demo/assets/layer_07/value_np.npy', value_np)
    # value_np_load = np.load('/home/demo/assets/layer_07/value_np.npy')
    # print(value_np_load)

    input_spatial_shapes = torch.load('/home/demo/assets/layer_07/input_spatial_shapes.pt')
    input_spatial_shapes_np = input_spatial_shapes.cpu().numpy()
    np.save('/home/demo/assets/layer_07/input_spatial_shapes.npy', input_spatial_shapes_np)

    input_level_start_index = torch.load('/home/demo/assets/layer_07/input_level_start_index.pt')
    input_level_start_index_np = input_level_start_index.cpu().numpy()
    np.save('/home/demo/assets/layer_07/input_level_start_index.npy', input_level_start_index_np)

    sampling_locations = torch.load('/home/demo/assets/layer_07/sampling_locations.pt')
    sampling_locations_np = sampling_locations.cpu().numpy()
    np.save('/home/demo/assets/layer_07/sampling_locations.npy', sampling_locations_np)

    attention_weights = torch.load('/home/demo/assets/layer_07/attention_weights.pt')
    attention_weights_np = attention_weights.cpu().numpy()
    np.save('/home/demo/assets/layer_07/attention_weights.npy', attention_weights_np)

    # im2col_step = torch.load('/home/demo/assets/layer_07/im2col_step.pt')
    # im2col_step_np = im2col_step.cpu().numpy()
    # np.save('/home/demo/assets/layer_07/im2col_step.npy', im2col_step_np)


"""
使用空的模块名与pytorch到onnx的映射（符号函数）,符号函数return的是我们自定义的onnx算子名称，
测试这个导出的onnx文件，转成tensorrt需要的engine格式后，能否正常推理;
"""
if __name__ == "__main__":
    pass    
    export_py_model_to_onnx()
    # tensor_to_numpy()

已知：Etmpy_MultiScaleDeformableAttnFunction，它拥有MultiScaleDeformableAttnFunction模块到onnx算子（nvidia官方插件）的符号映射。
1 我定义了一个MyMSDeformAttnModel，它的内容就是调用下Etmpy_MultiScaleDeformableAttnFunction。
2 使用vscode debug到MultiScaleDeformableAttnFunction的输出、输出位置，可以使用torch.save记录下输入、输出变量，并保存到本地。
3 基于1，2，可以调用torch.onnx.export函数，导出这个测试模型的onnx文件。得到下面的模型：
my_ms_deform_attn_model_v03.onnx

onnx to tensorrt。

trtexec --onnx=my_ms_deform_attn_model_v03.onnx --explicitBatch --workspace=4096 --saveEngine=my_ms_deform_attn_model_v03.engine

tensorrt推理测试

4 有了onnx模型，与输入、输出的变量数据。我们可以编写一个tensorrt推理的脚本，验证下这个插件推理结果，是否与输出变量一致。推理脚本如下所示：

import numpy as np
import pycuda.driver as cuda
import pycuda.autoinit
import tensorrt as trt
from my_debug_tools import *


def load_engine(engine_path):
    with open(engine_path, "rb") as f, trt.Runtime(trt.Logger(trt.Logger.WARNING)) as runtime:
        engine = runtime.deserialize_cuda_engine(f.read())
    return engine


def alloc_and_binding_input(np_data, bindings, inputs, inputs_d_mem):
    """
    分配GPU内存，并添加到绑定列表
    """
    # 并分配设备端内存
    input_memory = cuda.mem_alloc(np_data.nbytes)

    # 然后添加到绑定列表
    bindings.append(int(input_memory))  # 添加设备端输入内存到绑定列表

    inputs.append(np_data) # 记录CPU上的输入变量
    inputs_d_mem.append(input_memory)   # 记录在GPU上分配的"输入"内存变量


def infer_entry():
    """
    
    """    
    trt.init_libnvinfer_plugins(None, "")
    model_path = 'model/my_ms_deform_attn_model_v04.engine'
    engine = load_engine(model_path)
    context = engine.create_execution_context()
    print(context)

    # Prepare input data
    value = np.load('input/value_np.npy')
    input_spatial_shapes = np.load('input/input_spatial_shapes.npy')
    input_spatial_shapes = input_spatial_shapes.astype(np.int32)

    input_level_start_index = np.load('input/input_level_start_index.npy')
    input_level_start_index = input_level_start_index.astype(np.int32)

    sampling_locations = np.load('input/sampling_locations.npy')
    attention_weights = np.load('input/attention_weights.npy')
    im2col_step = 128

    # 在GPU上，为输入、输出分配GPU显存, 然后添加到绑定列表    
    bindings = []
    inputs = []   # 依次记录CPU上的变量
    inputs_d_mem = [] # 记录在GPU上分配的内存变量

    output_buffer = None # 在CPU上分配的输出缓存
    output_memory = None  # 记录在GPU上分配的"输出"内存变量
    for binding in engine:
        # 获取当前绑定的索引
        binding_idx = engine.get_binding_index(binding)

        # 计算当前绑定所需内存的大小
        size = trt.volume(context.get_binding_shape(binding_idx))

        # 获取当前绑定的数据类型，并转换为NumPy对应的类型
        dtype = trt.nptype(engine.get_binding_dtype(binding))
        
        if engine.binding_is_input(binding):
            # 当前绑定是模型的输入
            # value input_spatial_shapes input_level_start_index sampling_locations attention_weights im2col_step
            if binding == 'value':
                alloc_and_binding_input(value, bindings, inputs, inputs_d_mem)
            elif binding == 'spatial_shapes':
                alloc_and_binding_input(input_spatial_shapes, bindings, inputs, inputs_d_mem)
            elif binding == 'level_start_index':
                alloc_and_binding_input(input_level_start_index, bindings, inputs, inputs_d_mem)
            elif binding == 'sampling_locations':
                alloc_and_binding_input(sampling_locations, bindings, inputs, inputs_d_mem)
            elif binding == 'attention_weights':
                alloc_and_binding_input(attention_weights, bindings, inputs, inputs_d_mem)
        else: 
            # 当前绑定是模型的输出
            output_buffer = cuda.pagelocked_empty(size, dtype)    # 创建一个锁定在内存页的空数组
            output_memory = cuda.mem_alloc(output_buffer.nbytes)  # 分配设备端内存用于存储推理结果
            bindings.append(int(output_memory))                   # 添加设备端输出内存到绑定列表
        
    stream = cuda.Stream()  

    for i in range(len(inputs)):
        # 将输入为连续的内存布局
        input_buffer = np.ascontiguousarray(inputs[i])

        # 将输入数据从内存(CPU)复制到设备端(GPU)
        cuda.memcpy_htod_async(inputs_d_mem[i], input_buffer, stream)    

    stream.synchronize()

    # 执行推理
    res = context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
    print(res)
            
    # Transfer prediction output from the GPU.
    # Transfer prediction output from the GPU.
    cuda.memcpy_dtoh_async(output_buffer, output_memory, stream)
    
    # Synchronize the stream
    stream.synchronize()

    output_buffer = output_buffer.view(np.float32).reshape(1, 21000, 256)

    # 释放内存
    for input_d_mem in inputs_d_mem:
        input_d_mem.free()
    
    output_memory.free()

    # context.destroy()
    engine.destroy()

    print('hi ~~~~~~~~~')


if __name__ == "__main__":
    pass
    infer_entry()

补充：这个脚本注意事项：一开始，这个脚本，推理输出全为0，后面查阅了很多资料，猜测可能与数据类型有关，就把整形数据，由int64，编程切换成int32，即下面的代码。

 input_level_start_index = input_level_start_index.astype(np.int32)
 input_spatial_shapes = input_spatial_shapes.astype(np.int32)

至此，推理输出结果有值，而且输出结果与pytorch输出结果基本一致。验证了这个插件实现的功能，跟mask dino官方仓库使用的接口函数是一致的。并且我们这样定义与使用插件，是正确的，没有问题的。

实际使用时，pytorch代码的调整

我实际用的代码是这样的：

    # 在调用处，把原来的使用MSDeformAttnFunction.apply的代码注释掉，换上Empty_MultiScaleDeformableAttnFunction

    # try:
    #     output = MSDeformAttnFunction.apply(
    #         value, input_spatial_shapes, input_level_start_index, sampling_locations, attention_weights, self.im2col_step)
    # except:
    #     # CPU
    #     output = ms_deform_attn_core_pytorch(value, input_spatial_shapes, sampling_locations, attention_weights)

    # my: cuda version, 
    output = Empty_MultiScaleDeformableAttnFunction.apply(value, input_spatial_shapes, input_level_start_index, 
                                                          sampling_locations,  attention_weights, self.im2col_step)
    # 生成multiscaleDeformableAttnPlugin形状 的注意力输出[N, Lq, M, D]。
    N, Lq, M, D = output.shape
    output = output.view(N, Lq, M * D)

Empty_MultiScaleDeformableAttnFunction的内容：

import torch
# from torch import nn
# import numpy as np
# import onnx
from .ms_deform_attn_func import MSDeformAttnFunction


class Empty_MultiScaleDeformableAttnFunction(torch.autograd.Function):

    @staticmethod    
    def symbolic(g, value, spatial_shapes, level_start_index, sampling_locations,
                  atttention_weights, im2col_step):

        # 按multiscaleDeformableAttnPlugin以下顺序接受 5 个输入： value、spatial_shapes、level_start_index、sampling_locations和atttention_weights。
        return g.op('nvinfer1.plugin::MultiscaleDeformableAttnPlugin_TRT', value, spatial_shapes, level_start_index, sampling_locations,
                  atttention_weights)
        
    @staticmethod
    def forward(ctx, value, value_spatial_shapes, value_level_start_index,
                sampling_locations, attention_weights, im2col_step):
        '''
        no real mean,just for inference
        '''
        # bs, _, num_heads, embed_dims_num_heads = value.shape
        bs = value.shape[0]
        num_heads = value.shape[2]
        embed_dims_num_heads = value.shape[3]

        # bs ,num_queries, _, _, _, _ = sampling_locations.shape
        num_queries = sampling_locations.shape[1]
        
        # return value.new_zeros(size=(bs, num_queries, num_heads, embed_dims_num_heads), device=value.device, dtype=value.dtype)

        # forward(ctx, value, value_spatial_shapes, value_level_start_index, sampling_locations, attention_weights, im2col_step)
        output = MSDeformAttnFunction.apply(value, value_spatial_shapes, value_level_start_index, sampling_locations, attention_weights, im2col_step)
        output = output.view(bs, num_queries, num_heads, embed_dims_num_heads)

        return output
    

    @staticmethod
    def backward(ctx, grad_output):
        pass

考虑到trace方式，生成onnx，尽量输出有意义的结果，所以这里的Empty_MultiScaleDeformableAttnFunction的forward函数，调用了MSDeformAttnFunction.apply，把这个函数输出reshape一下，做为Empty_MultiScaleDeformableAttnFunction。

PyTorch->ONNX->TensorRT，TensorRT官方插件使用demo

课程说明

导出原理

gpu版多尺度注意力机制插件的使用测试

pytorch导出onnx hello world demo

pytorch代码

pytorch->onnx

onnx to tensorrt。

tensorrt推理测试

实际使用时，pytorch代码的调整

楚知行

引用和评论

yapf设置python函数调用参数格式

深度学习中的学习率调度:循环学习率、SGDR、1cycle 等方法介绍及实践策略研究

三种Transformer模型中的注意力机制介绍及Pytorch实现：从自注意力到因果自注意力

深度学习工程实践：PyTorch Lightning与Ignite框架的技术特性对比分析

【Triton 教程】融合 Softmax (Fused Softmax)

VisionTS：基于时间序列的图形构建高性能时间序列预测模型，利用图像信息进行时间序列预测