python - NVIDIA AI Enterprise 科普 | Triton 推理服务器 & TensorRT-LLM 两大组件介绍及实践 - 个人文章

2023年⼤语⾔ AI 模型的蓬勃发展，在⽣产环境中部署应⽤⼤模型的需求也与⽇俱增，然⽽开发⼀个推理解决⽅案来部署这些模型是⼀项艰巨的任务。时延、吞吐量、AI 框架的⽀持度、模型并⾏和负载均衡、GPU优化等因素都是要考虑到的重点，如何进⾏快速部署及管理是⽐较迫切的需求。

NVIDIA作为全球领先的 GPU 芯⽚制造商，为 AI 训练领域提升算⼒赋能⼤模型突破的同时，致⼒于软件⽣态开发，推出了NVIDIA AI Enterprise (NAIE) 的全栈软件套件，其中包含丰富的SDK⼯具库。

本⽂将简单介绍 NAIE 的组件：Triton inference server 和 TensorRT-LLM，并使⽤容器化⽅式部署和测试了 LlaMa2 ⼤模型的推理应⽤。

Triton inference server

Triton 推理服务器是英伟达 NVIDIA AIE 的组成部分，同时也是一个开源的推理服务软件，用于简化 AI 模型的部署和推理过程，并提供高性能的推理服务。

NVIDIA AI Enterprise 平台（图片源于NVIDIA）

Triton 推理服务器提供了标准化的 AI 推理流程，支持部署各种深度学习和机器学习框架的AI模型，包括 TensorRT、TensorFlow、PyTorch、ONNX、OpenVINO、Python、RAPIDS FIL等。Triton 推理服务器可以在 NVIDIA GPU、x86 和 ARM CPU 以及 AWS Inferentia 等设备上进行云端、数据中心、边缘和嵌入式设备的推理。

Triton 推理服务器框架（图片源于NVIDIA）

Triton的主要特性包括：

支持多种机器学习/深度学习框架
并发模型执行
动态批处理
序列批处理和隐式状态管理用于有状态模型
提供后端API，允许添加自定义后端和前/后处理操作
使用集成（ Ensembles）和业务逻辑脚本（ BLS）构建模型Pipeline
基于社区开发的KServe协议的HTTP/REST和GRPC推理协议
支持C API和Java API直接链接到应用程序
指示GPU利用率、服务器吞吐量、服务器延迟等指标

Triton 推理服务器对多种查询类型提供高效的推理，支持实时查询、批处理查询、集成模型查询和音视频流查询等。下图示意了使用 Triton 推理服务器在云端、数据中心上部署 AI 生产服务的解决方案。

（图片源于NVIDIA）

TensorRT-LLM

2023年10月中旬 NVIDIA 发布了第一版的 TensorRT-LLM，目前更新频繁已经发布了三个版本。它是针对大型语言模型构建最优化的 TensorRT 引擎，以在 NVIDIA GPU 上高效执行推理。

TensorRT-LLM 包含用于创建执行这些 TensorRT 引擎的 Python 和 C++ 运行时的组件，还包括与 NVIDIA Triton 推理服务器集成的后端，用于提供大模型服务的生产级系统。TensorRT-LLM 支持单个 GPU 到多节点多 GPU 的各种配置环境的使用，同时支持近30余种国内外流行大模型的优化。

TensorRT-LLM 的具体性能可以查看官方性能页面，其优势在一些测试和报道中也已经得到体现：NVIDIA TensorRT-LLM 在 NVIDIA H*GPU （80GB）上大幅提升大型语言模型的推理速度。

TensorRT-LLM 优化特性覆盖了以下几个方面：

1. 注意力优化（Attention Optimizations）

Multi-head Attention (MHA)：将注意力计算分解为多个头，提高并行性，并允许模型关注输入的不同维度语义空间的信息，然后再进行拼接。
Multi-query Attention (MQA)：与MHA不同的，MQA 让所有的头之间共享同一份 Key 和 Value 矩阵，每个头只单独保留了一份 Query 参数，从而大大减少 Key 和 Value 矩阵的参数量，提高吞吐量并降低延迟。
Group-query Attention (GQA)：介于MHA和MQA，将查询分组以减少内存访问和计算，提高效率。
In-flight Batching：重叠计算和数据传输以隐藏延迟并提高性能。
Paged KV Cache for the Attention ：在注意力层中缓存键值对，减少内存访问并加快计算速度。

2. 并行性（ Parallelism）

Tensor Parallelism ：将模型层分布在多个 GPU 上，使其能够扩展到大型模型。
Pipeline Parallelism ：重叠不同层的计算，降低整体延迟。

3. 量化（ Quantization）

INT4/INT8 weight-only (W4A16 和 W8A16)：将权重存储为 4 位或 8 位整型减少模型大小和内存占用，同时保持激活在 16 位浮点精度。
SmoothQuant：为注意力层提供平滑量化，保留准确性。
GPTQ：一次性权重量化方法，针对 GPT 类似模型架构量身定制的量化技术，同时保持精度。
AWQ：自适应权重量化，动态调整不同部分模型的量化精度，确保高精度和效率。
FP8：在支持的 GPU（如 NVIDIA Hopper）上利用 8 位浮点精度进行计算，进一步减少内存占用并加速处理。

4. 解码优化（ Decoding Optimizations）

Greedy-search：贪婪搜索，一次生成一个文本令牌，通过选择最可能的下一个令牌，快速但可能不太准确。
Beam-search：束搜索，跟踪多个可能的令牌序列，提高准确性但增加计算成本。

5. 其他

RoPE (相对位置编码)：高效地嵌入令牌的相对位置信息，增强模型对上下文的理解。

能否使用特定优化取决于模型架构、硬件配置和所需的性能权衡，目前最新版本中，并非所有模型都支持上述优化。TensorRT-LLM 提供了一个灵活的框架，可用于尝试不同的优化策略，以实现特定用例的最佳结果。通过一系列的优化技术，能显著提高大语言模型在 NVIDIA GPU 上的性能和效率。

部署实践

1、系统环境

CPU: Intel® Xeon® Gold 6326 CPU @ 2.90GHz
GPU: NVIDIA GPU （80GB PCIe）
Memory: 512GB
Host OS：Rocky Linux 8.9
GPU Driver：535.129.03
Docker：25.0.0
Docker Image：nvcr.io/nvidia/tritonserver:23.12-trtllm-python-py3

本次部署主机使⽤NVIDIA 80G GPU单卡，系统为Rocky Linux 8.9，是类似CentOS, RHEL下游的⼀个新发⾏版本，系统的docker容器环境可以参考此链接（Install Docker Engine on CentOS | Docker Docs）配置，使⽤当前最新的NVIDIA官⽅提供的镜像tritonserver:23.12-trtllm-python-py3，此版本镜像部分配置如下，⼏乎包含了运⾏TensorRT-LLM的所有环境，详情请参考此链接：（Release Notes :: NVIDIA Deep Learning Triton Inference Server Documentation）

Image Version: 23.12
Triton Inference Server: 2.41
Ubuntu: 22.04
Python: 3.10
CUDA Toolkit: 12.3.2
TensorRT: 8.6.1.6
TensorRT-LLM: release/0.7.0

2、拉取镜像

# host主机上拉取镜像 

docker pull nvcr.io/nvidia/tritonserver:23.12-trtllm-python-py3

3、启动容器

# host主机上启动容器 

docker run -it -d --cap-add=SYS_PTRACE --cap-add=SYS_ADMIN --security-opt 

seccomp=unconfined --gpus=all --shm-size=16g --privileged --ulimit 

memlock=-1 --name=test -v /home/docker/mnt:/home 

nvcr.io/nvidia/tritonserver:23.12-trtllm-python-py3

注意：

容器挂载模型⽂件所在的主机⽬录，先安装NVIDIA Container Toolkit，容器才能使⽤NVIDIA GPU（安装地址：https://docs.nvidia.com/datacenter/cloud-native/container-too...）

4、模型准备

本⽂使⽤ Meta 的 llama2-7b-hf ⼤模型作为测试，TensorRT-LLM 对于 LlaMA 的⽀持详情可以查看链接：https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama。

镜像容器中只有运⾏依赖库，类似地TensorRT使⽤时将模型转为 tensorrt 引擎，需要⾃⾏构建不同⼤模型的引擎。我们可以使⽤TensorRT-LLM仓库的LLaMA示例，代码位于：https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama，以模型HF权重作为输⼊，并构建相应的TensorRT引擎，TensorRT引擎的数量取决于⽤于推理的GPU数量。

# 进⼊容器交互式 bash 

docker exec -it test bash 



# 以下命令在容器内操作 



# 安装 git 和 git-lfs 

apt-get update && apt-get -y install cmake git git-lfs 



# 拉取 TensorRT-LLM git 仓库 

cd /home/ 

git clone https://github.com/NVIDIA/TensorRT-LLM.git 

cd TensorRT-LLM 

git submodule update --init --recursive 

git lfs install 

git lfs pull 



# 下载 Huggingface 上的 LlaMA2-7b-hf 模型 

GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/meta-llama/Llama-2- 

7b-hf 

cd Llama-2-7b-hf 

git lfs install 

git lfs pull 



# 编译⽣成 TensorRT-LLM python wheel 

cd TensorRT-LLM/ 

python ./scripts/build_wheel.py --trt_root /usr/local/tensorrt 



# 安装 TensorRT-LLM python wheel 

pip install build/tensorrt_llm-0.7.0-cp310-cp310-linux_x86_64.whl 



# 安装 Python libs 依赖 

pip3 install torch torchvision torchaudio 



# 以下编译构建 engine 

mkdir trt_engines 



# 单卡 + INT8 weight-only 量化 

python TensorRT-LLM/examples/llama/build.py \ 

--model_dir ./models/Llama-2-7b-hf/ \ --dtype float16 \ 

--use_gpt_attention_plugin float16 \ 

--use_weight_only \ 

--output_dir ./trt_engines/Llama-2-7b-hf-WINT8-1gpu/ 



# trt engines ⽂件列表 

tree trt_engines/ -lh 

[ 116] trt_engines/ 

|-- [ 98] Llama-2-7b-hf-WINT8-1gpu 

| |-- [2.1K] config.json 

| |-- [6.5G] llama_float16_tp1_rank0.engine 

| `-- [213K] model.cache

5、Benchmark结果

这⾥顺便跑了 TensorRT-LLM 的 llama2-7b benchmark（python和cpp），结果请看下⾯表格。

# python benchmark 

python3 TensorRT-LLM/benchmarks/python/benchmark.py \ 

-m llama_7b \ 

--mode plugin \ 

--output_dir "./trt_engines/benchmark/" \ 

--batch_size "1;8;16;32;64;128" \ 

--input_output_len "60,20;128,20;512,20"

# cpp benchmark 

./TensorRT-LLM/cpp/build/benchmarks/gptSessionBenchmark \ 

--model llama_7b \ 

--engine_dir "trt_engines/benchmark/" \ 

--batch_size "1;8;16;32;64;128" \ 

--input_output_len "60,20;128,20;512,20"

python

cpp

我们构建了⼀种 Llama-2-7b-hf 模型的针对单卡GPU和仅权重INT8量化的优化引擎。接下来根据 tensorrtllm_backend 模型库模板⽂件夹配置模型：


git clone https://github.com/triton-inference-server/tensorrtllm_backend 

mkdir triton_model_repo/Llama-2-7b-hf-WINT8-1gpu 

cp -r tensorrtllm_backend/all_models/inflight_batcher_llm/* 

triton_model_repo/Llama-2-7b-hf-WINT8-1gpu

# 以 Llama-2-7b-hf-WINT8-1gpu 为例 

# 拷⻉ engine ⽂件到 triton_model_repo 

cp ./trt_engines/Llama-2-7b-hf-WINT8-1gpu/* triton_model_repo/Llama-2-7b-hf

WINT8-1gpu/tensorrt_llm/1

模型库中的每个模型都必须包含⼀个模型配置，该配置提供有关模型的必需和可选信息。通常，这个配置以ModelConfig protobuf指定的config.pbtxt⽂件形式提供。在某些情况下，可以通过Triton⾃动⽣成模型配置，不需要显式提供。不同的配置参数可能会对模型的性能有较⼤差异，可以借助https://github.com/triton-inference-server/model_analyzer搜索到最佳的参数，有兴趣的可以⾃⾏深⼊学习。

此外 triton server 部署中还有很多可调的细节设置来优化性能和便利性，⽐如：全局或模型的响应缓存(global or model specific response cache)，模型轮询间隔（model poll interval），模型预热（model warmup），模型实例（model instances）等。

编辑相关的 config.pbtxt ，主要参数保持默认，参数调整请参考⽂档（https://github.com/triton-inference-server/tensorrtllm_backend）：

triton_model_repo/Llama-2-7b-hf- WINT8-1gpu/preprocessing/config.pbtxt

triton_model_repo/Llama-2-7b-hf-WINT8-1gpu/tensorrt_llm/config.pbtxt

triton_model_repo/Llama-2-7b-hf-WINT8-1gpu/postprocessing/config.pbtxt

6、启动 tritonserver

tritonserver --model_repo /home/triton_model_repo/Llama-2-7b-hf-WINT8-1gpu - 

-allow-metrics true --allow-grpc true --allow-http true & 

# 开启服务后，命令⾏会显示相关后端、模型、配置和硬件信息 

I0125 09:01:53.748648 73097 server.cc:633] 

+-------------+------------------------------------------------------------- 

---+----------------------------------------------------------------+ 

| Backend | Path 

| Config | 

+-------------+------------------------------------------------------------- 

---+----------------------------------------------------------------+ 

| python | /opt/tritonserver/backends/python/libtriton_python.so 

| {"cmdline":{"auto-complete-config":"true","backend-directory": | 

| | 

| "/opt/tritonserver/backends","min-compute-capability":"6.00000 | 

| | 

| 0","default-max-batch-size":"4"}} | 

| tensorrtllm | 

/opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.s | {"cmdline": 

{"auto-complete-config":"true","backend-directory": | 

| | o 

| "/opt/tritonserver/backends","min-compute-capability":"6.00000 | 

| | 

| 0","default-max-batch-size":"4"}} | 

| | 

| |+-------------+------------------------------------------------------------- 

---+----------------------------------------------------------------+ 

I0125 09:01:53.748679 73097 server.cc:676] 

+------------------+---------+--------+ 

| Model | Version | Status | 

+------------------+---------+--------+ 

| ensemble | 1 | READY | 

| postprocessing | 1 | READY | 

| preprocessing | 1 | READY | 

| tensorrt_llm | 1 | READY | 

| tensorrt_llm_bls | 1 | READY | 

+------------------+---------+--------+ 

I0125 09:01:53.949752 73097 metrics.cc:817] Collecting metrics for GPU 0: 

NVIDIA A100 80GB PCIe 

I0125 09:01:54.000065 73097 metrics.cc:710] Collecting CPU metrics 

I0125 09:01:54.000192 73097 tritonserver.cc:2483] 

+----------------------------------+---------------------------------------- 

--------------------------------------------------------------------+ 

| Option | Value 

| 

+----------------------------------+---------------------------------------- 

--------------------------------------------------------------------+ 

| server_id | triton 

| 

| server_version | 2.41.0 

| 

| server_extensions | classification sequence 

model_repository model_repository(unload_dependents) schedule_policy 

model_configu | 

| | ration system_shared_memory 

cuda_shared_memory binary_tensor_data parameters statistics trace logging 

| 

| model_repository_path[0] | /home/triton_model_repo/Llama-2-7b-hf

WINT8-1gpu 

| 

| model_control_mode | MODE_NONE 

| 

| strict_model_config | 0 

| 

| rate_limit | OFF 

|| pinned_memory_pool_byte_size | 268435456 

| 

| cuda_memory_pool_byte_size{5} | 67108864 

| min_supported_compute_capability | 6.0 

| 

| strict_readiness | 1 

| 

| exit_timeout | 30 

| 

| cache_enabled | 0 

| 

+----------------------------------+---------------------------------------- 

--------------------------------------------------------------------+ 

I0125 09:01:54.001568 73097 grpc_server.cc:2495] Started 

GRPCInferenceService at 0.0.0.0:8001 

I0125 09:01:54.001754 73097 http_server.cc:4619] Started HTTPService at 

0.0.0.0:8000 

I0125 09:01:54.042876 73097 http_server.cc:282] Started Metrics Service at 

0.0.0.0:8002 

# 显存加载情况 

nvidia-smi 

+--------------------------------------------------------------------------- 

------------+ 

| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA 

Version: 12.2 | 

|-----------------------------------------+----------------------+---------- 

------------+ 

| GPU Name Persistence-M | Bus-Id Disp.A | Volatile 

Uncorr. ECC | 

| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util 

Compute M. | 

| | | 

MIG M. | 

|=========================================+======================+========== 

============| 

| 0 NVIDIA A100 80GB PCIe Off | 00000000:4F:00.0 Off | 

0 | 

| N/A 30C P0 63W / 300W | 18739MiB / 81920MiB | 0% 

Default | 

| | | 

Disabled |+-----------------------------------------+----------------------+---------- 

------------+ 

+--------------------------------------------------------------------------- 

------------+ 

| Processes: 

| 

| GPU GI CI PID Type Process name 

GPU Memory | 

| ID ID 

Usage | 

|=========================================================================== 

============| 

| 0 N/A N/A 1446460 C tritonserver 

18726MiB | 

+--------------------------------------------------------------------------- 

------------+

注意：

构建引擎时使⽤的TensorRT-LLM版本必须与tritonserver的TensorRT-LLM版本保持⼀致。

7、客户端测试

HTTP客户端测试：

curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": 

"What is machine learning?", "max_tokens": 20, "bad_words": "", 

"stop_words": "", "pad_id": 2, "end_id": 2}' 

{"cum_log_probs":0.0,"model_name":"ensemble","model_version":"1","output_log 

_probs": 

[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0 

,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_outp 

ut":"What is machine learning? | Machine Learning\nMachine learning is a 

type of artificial intelligence that involves training algorithms on data 

in"}

GRPC 客户端测试：

如果使⽤客户端的Streaming选项，则必须先更改模型 tensorrt_llm/config.pbtxt 设置将“Decoupled” 设置为 “True”，然后重启tritonserver服务。

cd /home/tensorrtllm_backend/inflight_batcher_llm/client 

python3 end_to_end_grpc_client.py --streaming --output-len 10 --prompt "What 

is machine learning?""What is machine learning? | Machine Learning\nMachine learning is a type of 

artificial intelligence that involves training algorithms on data in"

另外对于语⾔⼤模型的推理官⽅也推出了⼀个集成了vllm的triton server镜像，⼤家有兴趣可以尝试⽐较。

到这⾥完成了使⽤ triton server 以及 tensorRT-LLM 作为推理后端的服务部署和客户端利⽤ LlaMA2⼤语⾔模型的推理应⽤，这类推理应⽤可以扩展到其他领域的模型⽐如⽬标检测、图像识别等。

NVIDIA AI Enterprise 科普 | Triton 推理服务器 & TensorRT-LLM 两大组件介绍及实践

Triton inference server

TensorRT-LLM

1. 注意力优化（Attention Optimizations）

2. 并行性（ Parallelism）

3. 量化（ Quantization）

4. 解码优化（ Decoding Optimizations）

5. 其他

部署实践

1、系统环境

2、拉取镜像

3、启动容器

4、模型准备

6、启动 tritonserver

7、客户端测试

老IT人

引用和评论

早鸟价预定！您随身的 AI 超级计算机：NVIDIA DGX Spark

Open WebUI：开源AI交互平台的全面解析

大模型中的Token究竟是什么？从原理到作用深度解析

一文掌握 MCP 上下文协议：从理论到实践

MySQL × 向量数据库：大模型时代的黄金组合实战指南

AdventureX 2025 正式启动：五天四夜，120小时极限创造！一起在杭州点燃青年创新之火！

大模型时代，后端程序员如何避免被AI卷死？