4090显卡玩转Dynamo推理框架

Dynamo 是Nvidia开源的一个面向多机分布式环境高吞吐低延时的推理框架，并可以集成多个主流的推理引擎(vllm、TRT-LLM，vLLM,SGLang等)，支持以下推理能力：

PD分离推理：最大化提升推理的吞吐量，并在吞吐和延时之间找到平衡点
动态GPU调度：基于波动优化性能
LLM感知的请求路由：重用LLM的KV Cache，避免重复的计算
加速数据传输：基于NIXL高速库，减少推理时间
KV缓存卸载：支持卸载KV缓存到不同的内存层次中，以获得更高的系统吞吐

构建Dynamo

依赖环境：

docker 23.0以上
docker buildx 插件必装
需要配置Proxy代理，用来拉取外网镜像和构建过程中的下载文件用

cd dynamo
./container/build.sh

# 如果构建出现访问外网问题，可以追加构建参数
./container/build.sh --build-arg HTTP_PROXY=http://10.98.26.187:7897 --build-arg HTTPS_PROXY=http://10.98.26.187:7897

# 完成构建后，推送镜像到仓库
docker tag dynamo:latest-vllm my.harbor.com/bingomatrix/dynamo:latest-vllm
docker push my.harbor.com/bingomatrix/dynamo:latest-vllm

这个过程很长，并且很耗内存，建议使用配置稍微好一些的机器，必要的时候要配置翻墙代理。

Start Demo

启动Demo测试一下，首先启动Docker容器

docker run -it --rm --gpus '"device=0"' \
    --shm-size 32g \
    -v /data/:/data \
    --network=host \
    my.harbor.com/bingomatrix/dynamo:latest-vllm bash

测试`dynamo run`:

# 正常来说启动是这样的
dynamo run out=vllm DeepSeek-R1-Distill-Llama-8B

dynamo确实处于极度不稳定的状态，之前的某一个版本还可以正常运行，写文章的这一刻用到的main分支又不行了：

dynamo run out=vllm /data/models/DeepSeek-R1-Distill-Llama-8B
2025-05-27T07:58:44.074Z  INFO dynamo_run::input::common: Waiting for remote model..

看起来没有识别到模型本地路径，而是尝试去HF 上拉取model:

我尝试按照cli的提示修改命令：

dynamo-run out=vllm --model-path=DeepSeek-R1-Distill-Llama-8B/ --model-name=DeepSeek-R1-Distill-Llama-8B/
2025-05-27T08:04:28.813Z ERROR dynamo_runtime::worker: Application shutdown with error: Invalid argument, must start with 'in' or 'out. USAGE: dynamo-run in=[http|text|dyn://<path>|batch:<folder>] out=ENGINE_LIST|dyn://<path> [--http-port 8080] [--model-path <path>] [--model-name <served-model-name>] [--model-config <hf-repo>] [--tensor-parallel-size=1] [--context-length=N] [--kv-cache-block-size=16] [--num-nodes=1] [--node-rank=0] [--leader-addr=127.0.0.1:9876] [--base-gpu-id=0] [--extra-engine-args=args.json] [--router-mode random|round-robin|kv]
Error: Invalid argument, must start with 'in' or 'out. USAGE: dynamo-run in=[http|text|dyn://<path>|batch:<folder>] out=ENGINE_LIST|dyn://<path> [--http-port 8080] [--model-path <path>] [--model-name <served-model-name>] [--model-config <hf-repo>] [--tensor-parallel-size=1] [--context-length=N] [--kv-cache-block-size=16] [--num-nodes=1] [--node-rank=0] [--leader-addr=127.0.0.1:9876] [--base-gpu-id=0] [--extra-engine-args=args.json] [--router-mode random|round-robin|kv]

这个提示就无语了，放弃这个demo测试。

测试 `dynamo serve`

cd example/vllm_v1

# 在这里运行dynamo
dynamo serve graphs.agg:Frontend -f configs/agg.yaml

值得注意的是在运行configs/agg.yaml这个配置文件之前，需要修改三处位置，否则无法正常运行：

修改模型位置

# 修改所有用到model的参数
model: /data/models/DeepSeek-R1-Distill-Llama-8B 
served_model_name: /data/models/DeepSeek-R1-Distill-Llama-8B

修改max_model_len

由于4090的显存只有24GB，除去8B模型占用的权重(14.99GB)，torch激活值占用的1.19GB , 留给KV Cache使用的空间只有 5.03GiB。

默认情况下vllm在启动的时候，传入的参数 max_seq_len=131072, 其占用的存储空间超过了5.03GiB. 因此会报以下的错误，KV Cache 的存储空间大小最多只能接受的长度是41168。

Loading weights took 4.57 seconds
2025-05-27T02:59:30.591Z  INFO model_runner.load_model: Model loading took 14.9889 GiB and 4.752349 seconds
2025-05-27T02:59:32.079Z  INFO worker.determine_num_available_blocks: Memory profiling takes 0.75 seconds
the current vLLM instance can use total_gpu_memory (23.64GiB) x gpu_memory_utilization (0.90) = 21.28GiB
model weights take 14.99GiB; non_torch_memory takes 0.08GiB; PyTorch activation peak memory takes 1.19GiB; the rest of the memory reserved for KV Cache is 5.03GiB.
2025-05-27T02:59:32.331Z  INFO executor_base.initialize_cache: # cuda blocks: 2573, # CPU blocks: 2048
2025-05-27T02:59:32.331Z  INFO executor_base.initialize_cache: Maximum concurrency for 131072 tokens per request: 0.31x
2025-05-27T02:59:32.333Z ERROR engine.run_mp_engine: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (41168). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.

dynamo的官方文档很少，经过一番查找发现要在config/agg.yaml文件中追加参数 max_model_len，这个参数会被转换为 max_seq_len，并发送给vllm。我们设置如下：

VllmDecodeWorker:
  enforce-eager: true
  ServiceArgs:
    workers: 1
    resources:
      gpu: 1
  common-configs: [model, served_model_name]
  max_model_len: 40960

修改components/worker.py

到这里的时候，dynamo已经能够正常启动了，但是在调用请求时，发现报了一个编译错误：

curl localhost:8000/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "/data/models/DeepSeek-R1-Distill-Llama-8B",
    "messages": [
    {
        "role": "user",
        "content": "Hello, how are you?"
    }
    ],
    "stream":false,
    "max_tokens": 300
  }' | jq
  
  # 错误信息
  2025-05-27T07:19:07.924Z  INFO engine._handle_process_request: Added request 9d0476501ee14feca5131f4150e92d88.

Traceback (most recent call last):
  File "/opt/dynamo/venv/lib/python3.12/site-packages/dynamo/runtime/__init__.py", line 85, in wrapper
    async for item in func(*args_list, **kwargs):
  File "/data/huangjch/dynamo/examples/vllm_v1/components/worker.py", line 85, in generate
    kv_transfer_params=response.kv_transfer_params,
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'RequestOutput' object has no attribute 'kv_transfer_params'

Traceback (most recent call last):
  File "/opt/dynamo/venv/lib/python3.12/site-packages/dynamo/runtime/__init__.py", line 85, in wrapper
    async for item in func(*args_list, **kwargs):
  File "/data/huangjch/dynamo/examples/vllm_v1/components/simple_load_balancer.py", line 150, in generate
    async for res in self._stream_response(gen):
  File "/data/huangjch/dynamo/examples/vllm_v1/components/simple_load_balancer.py", line 177, in _stream_response
    async for res in gen:
  File "/data/huangjch/dynamo/examples/vllm_v1/components/simple_load_balancer.py", line 127, in send_request_to_decode
    async for decode_response in await self.decode_worker_client.round_robin(
ValueError: a python exception was caught while processing the async generator: AttributeError: 'RequestOutput' object has no attribute 'kv_transfer_params'
2025-05-27T07:19:07.991Z ERROR chat_completions: dynamo_llm::http::service::openai: Failed to fold chat completions stream for: "a python exception was caught while processing the async generator: ValueError: a python exception was caught while processing the async generator: AttributeError: 'RequestOutput' object has no attribute 'kv_transfer_params'" request_id="67e96d16-9af2-42c0-a38a-b377bba71622"

这种绝对是代码错误了，模型返回的response中没有 kv_transfer_params 参数。很奇怪，开源社区的example中使用的模型也是 DeepSeek-R1-Distill-Llama-8B ,同一个模型都无法跑通example代码，这个质量实在是太差。

既然属性不存在，kv_transfer_params 又没有使用，我们直接注释掉这一行，然后再次运行，dynamo愉快的返回了结果。

到这里就已经能够正常运行dynamo的demo了，看起来无论是文档还是demo都还不够稳定和健全，但是dynamo仍然是一个值得期待的项目，期待社区的更新。

4090显卡玩转Dynamo推理框架

构建Dynamo

Start Demo

测试`dynamo run`:

测试 `dynamo serve`

行愚

引用和评论

MCP 协议为何不如你想象的安全？从技术专家视角解读

大语言模型的发展与应用综述（2025年5月）

面对开源大模型浪潮，基础模型公司如何持续盈利？

业内首次! 全面复现DeepSeek-R1-Zero 数学、代码能力，训练步数仅需R1-Zero 1/10

MCP Client 开发教程

MCP Server开发教程

图解「模型上下文协议（MCP）」

4090显卡玩转Dynamo推理框架

构建Dynamo

Start Demo

测试dynamo run:

测试 dynamo serve

行愚

引用和评论

MCP 协议为何不如你想象的安全？从技术专家视角解读

大语言模型的发展与应用综述（2025年5月）

面对开源大模型浪潮，基础模型公司如何持续盈利？

业内首次! 全面复现DeepSeek-R1-Zero 数学、代码能力，训练步数仅需R1-Zero 1/10

MCP Client 开发教程

MCP Server开发教程

图解「模型上下文协议（MCP）」

测试`dynamo run`:

测试 `dynamo serve`