大模型推理的「加速神器」，使用 vLLM 对 Qwen2.5 推理实操教程

该教程为使用 vLLM 加载 Qwen2.5-3B-Instruct-AWQ 模型进行少样本学习，包括模型的加载、数据的准备、推理过程的优化，以及结果的提取和评估。
关键步骤是：

使用 vLLM（为了提高速度）
使用 AWQ 4 位量化（以避免 GPU VRAM OOM）
将输入大小限制为 1024 个 tokens（为了提高速度）
将输出大小限制为 1 个 token（为了提高速度）

教程链接：https://go.openbayes.com/vSLNi
使用云平台：OpenBayes
http://openbayes.com/console/signup?r=sony_0m6v
登录 http://OpenBayes.com，在「公共教程」页面，选择「AlphaFold3 蛋白质预测 Demo」教程。

页面跳转后，点击右上角「克隆」，将该教程克隆至自己的容器中。

选择「NVIDIA GeForce RTX 4090」以及「vLLM」镜像，OpenBayes 平台上线了新的计费方式，大家可以按照需求选择「按量付费」或「包日/周/月」，点击「继续执行」。可以使用文章开头的邀请链接，获得 RTX 4090 使用时长！

稍等片刻，待系统分配好资源，当状态变为「运行中」后，点击「打开工作空间」。

下面演示运行步骤：
1、进入工作空间后，我们新建一个终端，本教程已安装好了 vLLM，无需再进行安装。
可以使用以下命令在 jupyter notebook 下安装 vLLM。

#!pip install -U vllm

2、使用 vLLM 加载 Qwen 量化模型

import os, math, numpy as np os.environ["CUDA_VISIBLE_DEVICES"]="0"
import vllm  llm = vllm.LLM(     "/input0/Qwen2.5-3B-Instruct-AWQ",     quantization="awq",     tensor_parallel_size=1,      gpu_memory_utilization=0.95,      trust_remote_code=True,     dtype="half",      enforce_eager=True,     max_model_len=512,     #distributed_executor_backend="ray", ) tokenizer = llm.get_tokenizer()

3、加载测试数据
在提交期间，我们加载 128 行 train 来计算 CV 分数，加载测试数据。

import pandas as pd VALIDATE = 128  test = pd.read_csv("./lmsys-chatbot-arena/test.csv")  if len(test)==3:     test = pd.read_csv("./lmsys-chatbot-arena/train.csv")     test = test.iloc[:VALIDATE] print( test.shape ) test.head(1)

4、提示工程
如果我们想提交零次 LLM，我们需要尝试不同的系统提示来提高 CV 分数。如果我们对模型进行微调，那么系统就不那么重要了，因为无论我们使用哪个系统提示，模型都会从目标中学习该做什么。
我们使用 logits 处理器强制模型输出我们感兴趣的 3 个标记。

from typing import Any, Dict, List from transformers import LogitsProcessor import torch  choices = ["A","B","tie"]  KEEP = [] for x in choices:     c = tokenizer.encode(x,add_special_tokens=False)[0]     KEEP.append(c) print(f"Force predictions to be tokens {KEEP} which are {choices}.")  class DigitLogitsProcessor(LogitsProcessor):     def __init__(self, tokenizer):         self.allowed_ids = KEEP              def __call__(self, input_ids: List[int], scores: torch.Tensor) -> torch.Tensor:         scores[self.allowed_ids] += 100         return scores
Force predictions to be tokens [32, 33, 48731] which are ['A', 'B', 'tie'].
sys_prompt = """Please read the following prompt and two responses. Determine which response is better. If the responses are relatively the same, respond with 'tie'. Otherwise respond with 'A' or 'B' to indicate which is better."""
SS = "#"*25 + "\n"
all_prompts = [] for index,row in test.iterrows():          a = " ".join(eval(row.prompt, {"null": ""}))     b = " ".join(eval(row.response_a, {"null": ""}))     c = " ".join(eval(row.response_b, {"null": ""}))          prompt = f"{SS}PROMPT: "+a+f"\n\n{SS}RESPONSE A: "+b+f"\n\n{SS}RESPONSE B: "+c+"\n\n"          formatted_sample = sys_prompt + "\n\n" + prompt          all_prompts.append( formatted_sample )

5、Infer 测试
现在使用 vLLM 推断测试。我们要求 vLLM 输出第一个 Token 中被认为预测的前 5 个 Token 的概率。并将预测限制为 1 个 token，以提高推理速度。
根据推断 128 个训练样本所需的速度，可以推断出 25000 个测试样本需要多长时间。

from time import time start = time()  logits_processors = [DigitLogitsProcessor(tokenizer)] responses = llm.generate(     all_prompts,     vllm.SamplingParams(         n=1,  # Number of output sequences to return for each prompt.         top_p=0.9,  # Float that controls the cumulative probability of the top tokens to consider.         temperature=0,  # randomness of the sampling         seed=777, # Seed for reprodicibility         skip_special_tokens=True,  # Whether to skip special tokens in the output.         max_tokens=1,  # Maximum number of tokens to generate per output sequence.         logits_processors=logits_processors,         logprobs = 5     ),     use_tqdm = True )  end = time() elapsed = (end-start)/60. #minutes print(f"Inference of {VALIDATE} samples took {elapsed} minutes!")
submit = 25_000 / 128 * elapsed / 60 print(f"Submit will take {submit} hours")

6、提取推理概率

results = [] errors = 0  for i,response in enumerate(responses):     try:         x = response.outputs[0].logprobs[0]         logprobs = []         for k in KEEP:             if k in x:                 logprobs.append( math.exp(x[k].logprob) )             else:                 logprobs.append( 0 )                 print(f"bad logits {i}")         logprobs = np.array( logprobs )         logprobs /= logprobs.sum()         results.append( logprobs )     except:         #print(f"error {i}")         results.append( np.array([1/3., 1/3., 1/3.]) )         errors += 1          print(f"There were {errors} inference errors out of {i+1} inferences") results = np.vstack(results)

7、创建提交 CSV

sub = pd.read_csv("./lmsys-chatbot-arena/sample_submission.csv")  if len(test)!=VALIDATE:     sub[["winner_model_a","winner_model_b","winner_tie"]] = results      sub.to_csv("submission.csv",index=False) sub.head()

8、计算 CV 分数

if len(test)==VALIDATE:     true = test[['winner_model_a','winner_model_b','winner_tie']].values     print(true.shape)
if len(test)==VALIDATE:     from sklearn.metrics import log_loss     print(f"CV loglosss is {log_loss(true,results)}" )

大模型推理的「加速神器」，使用 vLLM 对 Qwen2.5 推理实操教程

小白狮ww

引用和评论

VASP 教程：VASP 结合 phonopy 计算硅的声子谱

大模型中的Token究竟是什么？从原理到作用深度解析

被 Manus 带火的 MCP 是什么｜一文看懂

一文掌握 MCP 上下文协议：从理论到实践

MySQL × 向量数据库：大模型时代的黄金组合实战指南

AdventureX 2025 正式启动：五天四夜，120小时极限创造！一起在杭州点燃青年创新之火！

大模型时代，后端程序员如何避免被AI卷死？