英伟达的 GB200 NVL72 超级计算机在 DeepSeek V2 上实现了 2.7 倍更快的推理速度

发布于 6 月 29 日

Research Collaboration and Publication: Researchers from SGLang collaborated with NVIDIA and published early benchmarks of the GB200 (Grace Blackwell) NVL72 system. It shows a 2.7× increase in LLM inference throughput compared to the H100 on the DeepSeek-V2 671B model.
Software Optimizations: Uplift is attributed to software optimizations for Blackwell architecture like FP8-optimized matrix multiplication, accelerated attention kernels, and high-speed token routing over NVLink. These were integrated into the SGLang runtime.
GB200 NVL72 Platform: Positioned as a general-purpose platform for large-scale AI, this benchmark focuses only on inference, showing early performance under realistic load before broader workloads are tested.
Decoding Benchmarks: In decoding benchmarks using a 2,000-token prompt, SGLang achieved 7,583 tokens per second per GPU, a 2.7× improvement over H100 HGX systems. It enables faster responses for large-context inputs and high concurrency.
DeepSeek-V2 Model: The benchmark used DeepSeek-V2, a 671-billion parameter decoder-only large language model with a Mixture-of-Experts (MoE) design, activating ~21B parameters per token.
Optimization Components: SGLang team integrated Blackwell-specific optimizations like DeepGEMM, FlashInfer FMHA, DeepEP, CUTLASS MLA, and Mooncake into the runtime to minimize overhead during multi-GPU inference.
Future Work: Authors note several under-optimized areas like prefill stage and kernels not saturating memory bandwidth or compute capacity. Future work will focus on optimizing these.

阅读 411