- Research Collaboration and Publication: Researchers from SGLang collaborated with NVIDIA and published early benchmarks of the GB200 (Grace Blackwell) NVL72 system. It shows a 2.7× increase in LLM inference throughput compared to the H100 on the DeepSeek-V2 671B model.
- Software Optimizations: Uplift is attributed to software optimizations for Blackwell architecture like FP8-optimized matrix multiplication, accelerated attention kernels, and high-speed token routing over NVLink. These were integrated into the SGLang runtime.
- GB200 NVL72 Platform: Positioned as a general-purpose platform for large-scale AI, this benchmark focuses only on inference, showing early performance under realistic load before broader workloads are tested.
- Decoding Benchmarks: In decoding benchmarks using a 2,000-token prompt, SGLang achieved 7,583 tokens per second per GPU, a 2.7× improvement over H100 HGX systems. It enables faster responses for large-context inputs and high concurrency.
- DeepSeek-V2 Model: The benchmark used DeepSeek-V2, a 671-billion parameter decoder-only large language model with a Mixture-of-Experts (MoE) design, activating ~21B parameters per token.
- Optimization Components: SGLang team integrated Blackwell-specific optimizations like DeepGEMM, FlashInfer FMHA, DeepEP, CUTLASS MLA, and Mooncake into the runtime to minimize overhead during multi-GPU inference.
- Future Work: Authors note several under-optimized areas like prefill stage and kernels not saturating memory bandwidth or compute capacity. Future work will focus on optimizing these.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。