Dynamo 是Nvidia开源的一个面向多机分布式环境高吞吐低延时的推理框架,并可以集成多个主流的推理引擎(vllm、TRT-LLM,vLLM,SGLang等),支持以下推理能力:
- PD分离推理: 最大化提升推理的吞吐量,并在吞吐和延时之间找到平衡点
- 动态GPU调度:基于波动优化性能
- LLM感知的请求路由:重用LLM的KV Cache,避免重复的计算
- 加速数据传输:基于NIXL高速库,减少推理时间
- KV缓存卸载:支持卸载KV缓存到不同的内存层次中,以获得更高的系统吞吐
构建Dynamo
依赖环境:
- docker 23.0以上
- docker buildx 插件必装
- 需要配置Proxy代理,用来拉取外网镜像和构建过程中的下载文件用
cd dynamo
./container/build.sh
# 如果构建出现访问外网问题,可以追加构建参数
./container/build.sh --build-arg HTTP_PROXY=http://10.98.26.187:7897 --build-arg HTTPS_PROXY=http://10.98.26.187:7897
# 完成构建后,推送镜像到仓库
docker tag dynamo:latest-vllm my.harbor.com/bingomatrix/dynamo:latest-vllm
docker push my.harbor.com/bingomatrix/dynamo:latest-vllm
这个过程很长,并且很耗内存,建议使用配置稍微好一些的机器,必要的时候要配置翻墙代理。
Start Demo
启动Demo测试一下,首先启动Docker容器
docker run -it --rm --gpus '"device=0"' \
--shm-size 32g \
-v /data/:/data \
--network=host \
my.harbor.com/bingomatrix/dynamo:latest-vllm bash
测试dynamo run
:
# 正常来说启动是这样的
dynamo run out=vllm DeepSeek-R1-Distill-Llama-8B
dynamo确实处于极度不稳定的状态,之前的某一个版本还可以正常运行,写文章的这一刻用到的main分支又不行了:
dynamo run out=vllm /data/models/DeepSeek-R1-Distill-Llama-8B
2025-05-27T07:58:44.074Z INFO dynamo_run::input::common: Waiting for remote model..
看起来没有识别到模型本地路径,而是尝试去HF 上拉取model:
我尝试按照cli的提示修改命令:
dynamo-run out=vllm --model-path=DeepSeek-R1-Distill-Llama-8B/ --model-name=DeepSeek-R1-Distill-Llama-8B/
2025-05-27T08:04:28.813Z ERROR dynamo_runtime::worker: Application shutdown with error: Invalid argument, must start with 'in' or 'out. USAGE: dynamo-run in=[http|text|dyn://<path>|batch:<folder>] out=ENGINE_LIST|dyn://<path> [--http-port 8080] [--model-path <path>] [--model-name <served-model-name>] [--model-config <hf-repo>] [--tensor-parallel-size=1] [--context-length=N] [--kv-cache-block-size=16] [--num-nodes=1] [--node-rank=0] [--leader-addr=127.0.0.1:9876] [--base-gpu-id=0] [--extra-engine-args=args.json] [--router-mode random|round-robin|kv]
Error: Invalid argument, must start with 'in' or 'out. USAGE: dynamo-run in=[http|text|dyn://<path>|batch:<folder>] out=ENGINE_LIST|dyn://<path> [--http-port 8080] [--model-path <path>] [--model-name <served-model-name>] [--model-config <hf-repo>] [--tensor-parallel-size=1] [--context-length=N] [--kv-cache-block-size=16] [--num-nodes=1] [--node-rank=0] [--leader-addr=127.0.0.1:9876] [--base-gpu-id=0] [--extra-engine-args=args.json] [--router-mode random|round-robin|kv]
这个提示就无语了,放弃这个demo测试。
测试 dynamo serve
cd example/vllm_v1
# 在这里运行dynamo
dynamo serve graphs.agg:Frontend -f configs/agg.yaml
值得注意的是在运行configs/agg.yaml这个配置文件之前,需要修改三处位置,否则无法正常运行:
修改模型位置
# 修改所有用到model的参数 model: /data/models/DeepSeek-R1-Distill-Llama-8B served_model_name: /data/models/DeepSeek-R1-Distill-Llama-8B
修改max_model_len
由于4090的显存只有24GB,除去8B模型占用的权重(14.99GB),torch激活值占用的
1.19GB
, 留给KV Cache使用的空间只有 5.03GiB。默认情况下vllm在启动的时候,传入的参数
max_seq_len=131072
, 其占用的存储空间超过了5.03GiB. 因此会报以下的错误,KV Cache 的存储空间大小最多只能接受的长度是41168。Loading weights took 4.57 seconds 2025-05-27T02:59:30.591Z INFO model_runner.load_model: Model loading took 14.9889 GiB and 4.752349 seconds 2025-05-27T02:59:32.079Z INFO worker.determine_num_available_blocks: Memory profiling takes 0.75 seconds the current vLLM instance can use total_gpu_memory (23.64GiB) x gpu_memory_utilization (0.90) = 21.28GiB model weights take 14.99GiB; non_torch_memory takes 0.08GiB; PyTorch activation peak memory takes 1.19GiB; the rest of the memory reserved for KV Cache is 5.03GiB. 2025-05-27T02:59:32.331Z INFO executor_base.initialize_cache: # cuda blocks: 2573, # CPU blocks: 2048 2025-05-27T02:59:32.331Z INFO executor_base.initialize_cache: Maximum concurrency for 131072 tokens per request: 0.31x 2025-05-27T02:59:32.333Z ERROR engine.run_mp_engine: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (41168). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.
dynamo的官方文档很少,经过一番查找发现要在config/agg.yaml文件中追加参数
max_model_len
,这个参数会被转换为max_seq_len
,并发送给vllm。我们设置如下:VllmDecodeWorker: enforce-eager: true ServiceArgs: workers: 1 resources: gpu: 1 common-configs: [model, served_model_name] max_model_len: 40960
- 修改components/worker.py
到这里的时候,dynamo已经能够正常启动了,但是在调用请求时,发现报了一个编译错误:
curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "/data/models/DeepSeek-R1-Distill-Llama-8B",
"messages": [
{
"role": "user",
"content": "Hello, how are you?"
}
],
"stream":false,
"max_tokens": 300
}' | jq
# 错误信息
2025-05-27T07:19:07.924Z INFO engine._handle_process_request: Added request 9d0476501ee14feca5131f4150e92d88.
Traceback (most recent call last):
File "/opt/dynamo/venv/lib/python3.12/site-packages/dynamo/runtime/__init__.py", line 85, in wrapper
async for item in func(*args_list, **kwargs):
File "/data/huangjch/dynamo/examples/vllm_v1/components/worker.py", line 85, in generate
kv_transfer_params=response.kv_transfer_params,
^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'RequestOutput' object has no attribute 'kv_transfer_params'
Traceback (most recent call last):
File "/opt/dynamo/venv/lib/python3.12/site-packages/dynamo/runtime/__init__.py", line 85, in wrapper
async for item in func(*args_list, **kwargs):
File "/data/huangjch/dynamo/examples/vllm_v1/components/simple_load_balancer.py", line 150, in generate
async for res in self._stream_response(gen):
File "/data/huangjch/dynamo/examples/vllm_v1/components/simple_load_balancer.py", line 177, in _stream_response
async for res in gen:
File "/data/huangjch/dynamo/examples/vllm_v1/components/simple_load_balancer.py", line 127, in send_request_to_decode
async for decode_response in await self.decode_worker_client.round_robin(
ValueError: a python exception was caught while processing the async generator: AttributeError: 'RequestOutput' object has no attribute 'kv_transfer_params'
2025-05-27T07:19:07.991Z ERROR chat_completions: dynamo_llm::http::service::openai: Failed to fold chat completions stream for: "a python exception was caught while processing the async generator: ValueError: a python exception was caught while processing the async generator: AttributeError: 'RequestOutput' object has no attribute 'kv_transfer_params'" request_id="67e96d16-9af2-42c0-a38a-b377bba71622"
这种绝对是代码错误了,模型返回的response中没有 kv_transfer_params
参数。很奇怪,开源社区的example中使用的模型也是 DeepSeek-R1-Distill-Llama-8B
,同一个模型都无法跑通example代码,这个质量实在是太差。
既然属性不存在,kv_transfer_params
又没有使用,我们直接注释掉这一行,然后再次运行,dynamo愉快的返回了结果。
到这里就已经能够正常运行dynamo的demo了,看起来无论是文档还是demo都还不够稳定和健全,但是dynamo仍然是一个值得期待的项目,期待社区的更新。
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。