云上玩转DeepSeek系列之五：实测优化16%, 体验FlashMLA加速DeepSeek-V2-Lite推理

2月25日，DeepSeek-AI 面向社区开源了其技术成果 FlashMLA（https://github.com/deepseek-ai/FlashMLA），这是一个面向推理优化的高效多层注意力（Multi-Head Latent Attention）解码内核。该技术通过优化多头潜在注意力机制和分页 KV 缓存系统，显著提升了大语言模型的长序列处理能力与推理效率。

我们第一时间在人工智能平台 PAI 上进行拆箱体验，本文将一步步带领用户安装 FlashMLA 软件库，运行内置 benchmark 对比 FlashMLA 和其他 MLA 实现的前向性能，并基于应用了 FlashMLA 的 vLLM 框架在本地部署 DeepSeek-V2-Lite-Chat 模型。

本次实验将使用 PAI-DSW 进行开发，以下实验文档及代码也已发布至 PAI-Notebook Gallery。

● PAI-DSW 是人工智能平台 PAI 的交互式建模模块，集成了 JupyterLab、WebIDE、Terminal 等云端开发环境，并提供异构计算资源和预置的开源框架镜像。

● PAI-Notebook Gallery 提供海量 Notebook 前沿案例，开发者可以直接在 PAI-DSW 中运行这些教程，也可以基于教程二次开发，打造自己的创意场景。

一、准备工作

1. 访问【体验 FlashMLA 加速 DeepSeek-V2-Lite 部署】Notebook 教程

https://gallery.pai-ml.com/#/preview/deepLearning/nlp/deepsee...

2. 点击【在 DSW 中打开】，选择相应的环境和资源配置

FlashMLA 优化实现针对 Hopper 架构设计，同时我们选择了一个参数量较小的模型，用户可以仅使用单卡 GPU 就体验到 Deepseek 最新的开源技术。

● 推荐规格：ecs.gn8v.4xlarge/ecs.gn8v.6xlarge

● 推荐镜像：modelscope:1.23.1-pytorch2.5.1-gpu-py310-cu124-ubuntu22.04

3. 点击【打开Notebook】

二、操作步骤

1. 环境配置

运行以下脚本，从 GitHub 克隆最新源代码并安装 FlashMLA。

!git clone https://github.com/deepseek-ai/FlashMLA.git
!cd FlashMLA && python setup.py develop

如果由于网络环境波动问题造成克隆失败，您也可以运行以下脚本，从缓存的 FlashMLA 进行安装。

#如果前一个cell执行成功，这个cell无需执行，可直接跳过
!wget https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/flashmla/FlashMLA.tgz && tar -xvzf FlashMLA.tgz
!cd FlashMLA && python setup.py develop

FlashMLA 安装完成后，为了实现模型部署和性能对比，我们还需运行以下指令配置更多环境依赖。

# 安装最新版本的vllm，来获取部分FlashMLA运行需要的fix (build on commit id 4a8cfc75516f3df2ead816c0270b63470eb8e4ee)
!pip install https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/flashmla/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl
# 安装最新flashinfer用于MLA性能比较 (build on commit id 56e56ea4e22fc7cf5ca3b41f8143e8847eb5f3fa)
!pip install https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/flashmla/flashinfer_python-0.2.2-py3-none-any.whl

2. 快速在 vLLM 中应用 FlashMLA

当前最新版本(v0.7.3)的 vLLM 框架尚不支持 FlashMLA 作为可选择的后端注意力实现，因此，在最新的 vLLM 基础上，我们实现了基于 vLLM 已有接口接入 FlashMLA 的 API。为了方便用户快速体验，我们已将所有修改集成为一个压缩包文件，解压后直接替换框架内的部分代码即可。

# 下载、解压并替换vLLM库中的部分文件
!wget https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/flashmla/vllm_patch.tar && tar -xvf vllm_patch.tar
!cp -r vllm-patch/vllm/* /usr/local/lib/python3.10/site-packages/vllm/

具体而言，为了在 vLLM 中加入新后端选项，首先需要在 vllm/attention/backends/ 下新建一个模块，实现 FlashMLAImpl 以及 FlashMLAMetadataBuilder 两个关键类，分别用于调用底层 FlashMLA 库的 flash_mla_with_kvcache 进行前向计算和以及调用 get_mla_metadata 来准备前向所需的元数据配置。随后在 vllm/platforms/cuda.py 中对调用 MLA 的情形进行判断，当设备满足条件，同时用户设置了 VLLM_ATTENTION_BACKEND 环境变量为 FLASHMLA 后，调用集成的 FlashMLA 组件。如果希望进一步了解在 vLLM 中接入 FlashMLA 的实现细节，可在 vllm-patch 目录下查看源码。

3. 模型下载

FlashMLA 作为对 MLA 的推理侧优化，仅在少量应用 MLA 的模型(DeepSeek-V2/V3等)上有效。在本文中，我们使用具有较少参数量的 DeepSeek-V2-Lite-Chat 在单卡上完成部署测试，推荐运行以下指令下载对应模型权重。此外您也可以选择从 ModelScope 下载模型（链接：https://www.modelscope.cn/deepseek-ai/DeepSeek-V2-Lite-Chat）。

import os
dsw_region = os.environ.get("dsw_region")
url_link = {
"cn-shanghai": "https://atp-modelzoo-sh.oss-cn-shanghai-internal.aliyuncs.com/release/tutorials/flashmla/DeepSeek_v2_lite_chat.tar"
}
path = url_link[dsw_region] if dsw_region in url_link else "https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/flashmla/DeepSeek_v2_lite_chat.tar"
os.environ['LINK_CHAT'] = path
!wget $LINK_CHAT
!tar -xvf DeepSeek_v2_lite_chat.tar

4. 不同 MLA 实现的简易性能比较

在本节中，我们将使用 FlashMLA 内置的 benchmark 对现有的多种 MLA 实现的前向性能进行比较，包括纯 PyTorch 实现(torch)、FlashMLA 实现(flash_mla)、FlashInfer 实现(flash_infer)、以及基于 Triton 的 FlashMLA 实现(flash_mla_triton)。运行 benchmark，您可以看到不同 MLA 实现间的性能随序列长度变化的直观比较。我们在实验机型上观察到相对于 flash-infer 的 MLA 实现，FLashMLA 有近16%的提升。

import sys
import os
sys.path.append(os.path.join(os.getcwd(), 'FlashMLA'))
from benchmark.bench_flash_mla import *
import matplotlib.pyplot as plt
import pandas as pd
from collections import defaultdict

# 测试函数
def compare_a(target, b, s_q, cache_seqlens, h_q, h_kv, d, dv, causal, dtype):
# print(f"{target}: {b=}, {s_q=}, mean_seqlens={cache_seqlens.float().mean()}, {h_q=}, {h_kv=}, {d=}, {dv=}, {causal=}, {dtype=}")
torch.set_default_dtype(dtype)
device = torch.device("cuda:0")
torch.set_default_device(device)
torch.cuda.set_device(device)
torch.manual_seed(0)
random.seed(0)
assert target in FUNC_TABLE
target_func = FUNC_TABLE[target]
total_seqlens = cache_seqlens.sum().item()
mean_seqlens = cache_seqlens.float().mean().int().item()
max_seqlen = cache_seqlens.max().item()
max_seqlen_pad = triton.cdiv(max_seqlen, 256) * 256
# print(f"{total_seqlens=}, {mean_seqlens=}, {max_seqlen=}")
q = torch.randn(b, s_q, h_q, d)
block_size = 64
block_table = torch.arange(b * max_seqlen_pad // block_size, dtype=torch.int32).view(b, max_seqlen_pad // block_size)
blocked_k = torch.randn(block_table.numel(), block_size, h_kv, d)
out_b, lse_b, perf_b = target_func(q, block_table, blocked_k, max_seqlen_pad, block_size, b, s_q, cache_seqlens, h_q, h_kv, d, dv, causal, dtype)
FLOPS = s_q * total_seqlens * h_q * (d + dv) * 2
bytes = (total_seqlens * h_kv * d + b * s_q * h_q * d + b * s_q * h_q * dv) * (torch.finfo(dtype).bits // 8)
# print(f"perf {target}: {perf_b:.3f} ms, {FLOPS / 10 ** 9 / perf_b:.0f} TFLOPS, {bytes / 10 ** 6 / perf_b:.0f} GB/s")
return bytes / 10 ** 6 / perf_b
data = defaultdict(lambda: defaultdict(list))
for shape in shape_configs:
for target in available_targets:
perf = compare_a(target, shape["b"], shape["s_q"], shape["cache_seqlens"], shape["h_q"], shape["h_kv"], shape["d"], shape["dv"], shape["causal"], shape["dtype"])
# batchsize 128 & head 128
data[target][shape["cache_seqlens"].float().mean().cpu().item()] = perf
for name, dt in data.items():
k, v = list(dt.keys()), list(dt.values())
plt.plot(k, v, label=name)
plt.title('bandwidth')
plt.xlabel('seqlen')
plt.ylabel('bw (GB/s)')
plt.legend()
plt.show()

5. 本地部署体验

最后，我们来体验基于 FlashMLA 的本地部署。运行以下脚本，默认可使用 FlashMLA让DeepSeek-V2-Lite-Chat 生成一段快速排序的代码示例。您也可以通过调用代码修改来调用不同的 MLA 实现。可以检查输出中包含的如下字样来确认具体调用的 MLA 实现。

[cuda.py:176] Using Triton MLA backend.
[cuda.py:173] Using FlashMLA backend.

在 ecs.gn8v.4xlarge 机型上进行测试，将得到如下类似的结果：

(Triton MLA) Generate 527 tokens in 17.97 secs
(FlashMLA) Generate 515 tokens in 16.64 secs

import os
import time
import logging
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
model_name = "DeepSeek-V2-Lite-Chat"
max_model_len, tp_size = 8192, 1
def build_triton_mla_dskv2():
os.environ['VLLM_ATTENTION_BACKEND'] = 'TRITON_MLA'
return LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True, enforce_eager=True, block_size=64)
def build_flash_mla_dskv2():
os.environ['VLLM_ATTENTION_BACKEND'] = 'FLASHMLA'
return LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True, enforce_eager=True, block_size=64)
def warmup_and_infer(llm, messages_list):
tokenizer = AutoTokenizer.from_pretrained(model_name)
sampling_params = SamplingParams(temperature=0.3, max_tokens=2048, stop_token_ids=[tokenizer.eos_token_id])
warmup_token_ids = [tokenizer.apply_chat_template([{"role": "user", "content": "What's your name?"}], add_generation_prompt=True)]
llm.generate(prompt_token_ids=warmup_token_ids, sampling_params=sampling_params)
prompt_token_ids = [tokenizer.apply_chat_template(messages, add_generation_prompt=True) for messages in messages_list]
start_time = time.time()
outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=sampling_params)
end_time = time.time()
return [output.outputs[0].text for output in outputs], sum(len(output.outputs[0].token_ids) for output in outputs), (end_time - start_time)

messages_list = [
    [{"role": "user", "content": "Write a piece of quicksort code in C++."}],
]
# 使用FlashMLA进行推理
generated_text, n_tokens, generation_time = warmup_and_infer(build_flash_mla_dskv2(), messages_list)
# 使用默认MLA后端进行推理，如果FlashMLA已经运行完成，切换至默认MLA之后，请注意刷新kernel后再重新运行
# generated_text, n_tokens, generation_time = warmup_and_infer(build_triton_mla_dskv2(), messages_list)
print(f'Generate {n_tokens} tokens in {generation_time:.2f} secs')
print(generated_text[0])

以下是返回的推理结果：

INFO 02-25 15:57:54 [config.py:208] Replacing legacy 'type' key with 'rope_type'
INFO 02-25 15:58:00 [config.py:569] This model supports multiple tasks: {'classify', 'reward', 'score', 'embed', 'generate'}. Defaulting to 'generate'.
WARNING 02-25 15:58:00 [cuda.py:95] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 02-25 15:58:00 [llm_engine.py:234] Initializing a V0 LLM engine (v0.7.4.dev75+g4a8cfc75) with config: model='DeepSeek-V2-Lite-Chat', speculative_config=None, tokenizer='DeepSeek-V2-Lite-Chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=DeepSeek-V2-Lite-Chat, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}, use_cached_outputs=False, 
INFO 02-25 15:58:01 [cuda.py:173] Using FlashMLA backend.
INFO 02-25 15:58:01 [parallel_state.py:948] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 02-25 15:58:01 [model_runner.py:1110] Starting to load model DeepSeek-V2-Lite-Chat...
INFO 02-25 15:58:01 [cuda.py:173] Using FlashMLA backend.
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:04,  1.37s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:02<00:02,  1.45s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:03<00:01,  1.26s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:05<00:00,  1.33s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:05<00:00,  1.34s/it]

INFO 02-25 15:58:07 [model_runner.py:1117] Loading model weights took 31.1253 GB and 5.736558 seconds
WARNING 02-25 15:58:09 [fused_moe.py:849] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=1408,device_name=****.json
INFO 02-25 15:58:09 [worker.py:267] Memory profiling takes 1.76 seconds
INFO 02-25 15:58:09 [worker.py:267] the current vLLM instance can use total_gpu_memory (95.00GiB) x gpu_memory_utilization (0.90) = 85.50GiB
INFO 02-25 15:58:09 [worker.py:267] model weights take 31.13GiB; non_torch_memory takes 0.12GiB; PyTorch activation peak memory takes 0.96GiB; the rest of the memory reserved for KV Cache is 53.30GiB.
INFO 02-25 15:58:09 [executor_base.py:111] # cuda blocks: 25874, # CPU blocks: 1941
INFO 02-25 15:58:09 [executor_base.py:116] Maximum concurrency for 8192 tokens per request: 202.14x
INFO 02-25 15:58:11 [llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 4.16 seconds
/tmp/ipykernel_3769/871633.py:7: DeprecationWarning: The keyword arguments {'prompt_token_ids'} are deprecated and will be removed in a future update. Please use the 'prompts' parameter instead.
  generated_text, n_tokens, generation_time = warmup_and_infer(build_flash_mla_dskv2(), messages_list)
Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.44s/it, est. speed input: 9.02 toks/s, output: 23.59 toks/s]
/tmp/ipykernel_3769/871633.py:7: DeprecationWarning: The keyword arguments {'prompt_token_ids'} are deprecated and will be removed in a future update. Please use the 'prompts' parameter instead.
  generated_text, n_tokens, generation_time = warmup_and_infer(build_flash_mla_dskv2(), messages_list)
Processed prompts: 100%|██████████| 1/1 [00:17<00:00, 17.13s/it, est. speed input: 0.99 toks/s, output: 30.06 toks/s]
Generate 515 tokens in 17.13 secs
 Here is a simple implementation of the QuickSort algorithm in C++:

include <iostream>

include <vector>

void swap(int a, int b) {

int t = *a;
*a = *b;
*b = t;

}

int partition (std::vector<int>& arr, int low, int high) {

int pivot = arr[high]; 
int i = (low - 1); 

for (int j = low; j <= high - 1; j++) {
    if (arr[j] < pivot) {
        i++; 
        swap(&arr[i], &arr[j]);
    }
}
swap(&arr[i + 1], &arr[high]);
return (i + 1);

}

void quickSort(std::vector<int>& arr, int low, int high) {

if (low < high) {
    int pi = partition(arr, low, high);
    quickSort(arr, low, pi - 1);
    quickSort(arr, pi + 1, high);
}

}

void printArray(std::vector<int>& arr) {

for (int i = 0; i < arr.size(); ++i)
    std::cout << arr[i] << " ";
std::cout << "\n";

}

int main() {

std::vector<int> arr = {10, 7, 8, 9, 1, 5};
int n = arr.size();
quickSort(arr, 0, n - 1);
std::cout << "Sorted array: \n";
printArray(arr);
return 0;

}
This code sorts an array in ascending order using the QuickSort algorithm. The quickSort function is a recursive function that sorts the sub-array to the left of pi and the sub-array to the right of pi. The partition function rearranges the elements in the array so that all elements less than the pivot are to its left and all elements greater are to its right. The pivot is always the last element of the sub-array.

云上玩转DeepSeek系列之五：实测优化16%, 体验FlashMLA加速DeepSeek-V2-Lite推理

一、准备工作

二、操作步骤

1. 环境配置

2. 快速在 vLLM 中应用 FlashMLA

3. 模型下载

4. 不同 MLA 实现的简易性能比较

5. 本地部署体验

include <iostream>

include <vector>

阿里云大数据AI

引用和评论

Cosmos on PAI系列一：PAI-Model Gallery云上一键部署NVIDIA Cosmos Reason-1

基于阿里百炼的DeepSeek-R1满血版模型调用【零门槛保姆级2084小游戏开发实战】

Open WebUI：开源AI交互平台的全面解析

大模型中的Token究竟是什么？从原理到作用深度解析

一文掌握 MCP 上下文协议：从理论到实践

MySQL × 向量数据库：大模型时代的黄金组合实战指南

微信小程序接入DeepSeek｜实现智能对话