头图

该教程为在 RTX4090 上使用 vLLM 加载 AWQ 量化 Qwen2.5-3B-Instruct。

  • 对于每个测试问题,我们使用训练数据检索一组「支持」它的类似问题。
    • 考虑「construct」和「subject」等内容
  • 使用一组类似的问题,我们创建了一个可以馈送到我们的模型的对话
    • 在对话中使用最近支持的 chat() 功能
    • 生成温度略高的 n 个响应,以创建不同的输出
  • 对于每个问题/答案对,我们现在有 n 个推断的误解,对于每个误解,我们使用 BGE 嵌入检索前 25 个误解。
  • 对于每个问题/答案对的 n 个推断错误中的每一个的 25 个最接近的误解,现在可以使用 Borda Ranking 进行组合,这有点像最简单的集成形式。

教程链接:https://go.openbayes.com/suzcp
使用云平台:OpenBayes
http://openbayes.com/console/signup?r=sony_0m6v

登录 http://OpenBayes.com,在「公共教程」页面,选择「使用 vLLM 加载大模型进行少样本学习」教程。

图片

页面跳转后,点击右上角「克隆」,将该教程克隆至自己的容器中。

图片

选择「NVIDIA GeForce RTX 4090」以及「vLLM」镜像,OpenBayes 平台上线了新的计费方式,大家可以按照需求选择「按量付费」或「包日/周/月」,点击「继续执行」。可以使用文章开头的邀请链接,获得 RTX 4090 使用时长!

图片

图片

稍等片刻,待系统分配好资源,当状态变为「运行中」后,点击「打开工作空间」。

图片

图片

进入到工作空间后,打开左侧目录中的「README.ipynb」文件即可查看教程的运行步骤。

图片

图片

下面为详细的运行步骤:

  1. 导入相关的库
import os
import gc
import ctypes
import numpy as np
import pandas as pd

from random import sample
from tqdm.auto import tqdm
from eedi_metrics import mapk, apk
from scipy.spatial.distance import cdist
from sklearn.metrics.pairwise import cosine_similarity

import torch
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer, AutoModel
os.environ["CUDA_VISIBLE_DEVICES"]   = "0"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

def clean_memory(deep=False):
    gc.collect()
    if deep:
        ctypes.CDLL("libc.so.6").malloc_trim(0)
    torch.cuda.empty_cache()

2. 加载数据

k = 3

train_eval = True
n_train_eval_rows = 100

comp_dir  = './eedi-mining-misconceptions-in-mathematics'

llm_model_pth   = '/input0/Qwen2.5-3B-Instruct-AWQ'

embed_model_pth = '/input0/nomic-embed-text-v1.5'


if os.getenv("KAGGLE_IS_COMPETITION_RERUN"):
    train_eval = False
if train_eval:
    test       = pd.read_csv(f'{comp_dir}/train.csv').sample(n_train_eval_rows, random_state=3)
    test       = test.sort_values(['QuestionId'], ascending=True).reset_index(drop=True)
else:
    test       = pd.read_csv(f'{comp_dir}/test.csv')

train          = pd.read_csv(f'{comp_dir}/train.csv')
sample_sub     = pd.read_csv(f'{comp_dir}/sample_submission.csv')
misconceptions = pd.read_csv(f'{comp_dir}/misconception_mapping.csv')

len(train), len(test), len(misconceptions)
(1869, 100, 2587)

3. 使用 vLLM 启动 Qwen2.5-3B-Instruct-AWQ

如果出现 OOM 错误,将 max_num_seqs减少到 4 或 8 甚至 1 可能会有所帮助(默认值为 256)。

llm = LLM(
    llm_model_pth,
    trust_remote_code=True,
    dtype="half", max_model_len=4096,
    tensor_parallel_size=1, gpu_memory_utilization=0.95, 
)

tokenizer = llm.get_tokenizer()

INFO 11-28 10:39:42 awq_marlin.py:97] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 11-28 10:39:42 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post1) with config: model='/input0/Qwen2.5-3B-Instruct-AWQ', speculative_config=None, tokenizer='/input0/Qwen2.5-3B-Instruct-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/input0/Qwen2.5-3B-Instruct-AWQ, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, mm_processor_kwargs=None)
INFO 11-28 10:39:43 model_runner.py:1056] Starting to load model /input0/Qwen2.5-3B-Instruct-AWQ...

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00

INFO 11-28 10:39:44 model_runner.py:1067] Loading model weights took 1.9550 GB
INFO 11-28 10:39:44 gpu_executor.py:122] # GPU blocks: 75545, # CPU blocks: 7281
INFO 11-28 10:39:44 gpu_executor.py:126] Maximum concurrency for 4096 tokens per request: 295.10x
INFO 11-28 10:39:46 model_runner.py:1395] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 11-28 10:39:46 model_runner.py:1399] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage.
INFO 11-28 10:39:59 model_runner.py:1523] Graph capturing finished in 13 secs.

4. 后处理数据

answer_cols         = ["AnswerAText", "AnswerBText", "AnswerCText", "AnswerDText"]
misconception_cols  = ["MisconceptionAId", "MisconceptionBId", "MisconceptionCId", "MisconceptionDId"]

keep_cols           = ["QuestionId", "CorrectAnswer", "ConstructName", "SubjectName", "QuestionText" ]

def wide_to_long(df: pd.DataFrame) -> pd.DataFrame:
    
    # Melt the answer columns
    answers_df = pd.melt(
        id_vars=keep_cols,
        frame=df[keep_cols + answer_cols],
        var_name='Answer', value_name='Value'
    ).sort_values(["QuestionId", "Answer"]).reset_index(drop=True)
    if misconception_cols[0] not in df.columns:  # If test set
        return answers_df
        
    # Melt the misconception columns
    misconceptions_df = pd.melt(
        id_vars=keep_cols,
        frame=df[keep_cols + misconception_cols],
        var_name='Misconception', value_name='MisconceptionId'
    ).sort_values(["QuestionId", "Misconception"]).reset_index(drop=True)

    answers_df[['Misconception', 'MisconceptionId']] = misconceptions_df[['Misconception', 'MisconceptionId']]
    
    return answers_df
test  = wide_to_long(test)
train = wide_to_long(train)

test['AnswerId']  = test.Answer.str.replace('Answer', '').str.replace('Text', '')
train['AnswerId'] = train.Answer.str.replace('Answer', '').str.replace('Text', '')

train = pd.merge(train, misconceptions, on='MisconceptionId', how='left')
if train_eval:
    test = pd.merge(test, misconceptions, on='MisconceptionId', how='left')
train.head(3)
QuestionIdCorrectAnswerConstructNameSubjectNameQuestionTextAnswerValueMisconceptionMisconceptionIdAnswerIdMisconceptionName
00AUse the order of operations to carry out calcu...BIDMAS[\n3 \times 2+4-5\n]\nWhere do the brackets ...AnswerAText( 3 \times(2+4)-5 )MisconceptionAIdNaNANaN
10AUse the order of operations to carry out calcu...BIDMAS[\n3 \times 2+4-5\n]\nWhere do the brackets ...AnswerBText( 3 \times 2+(4-5) )MisconceptionBIdNaNBNaN
20AUse the order of operations to carry out calcu...BIDMAS[\n3 \times 2+4-5\n]\nWhere do the brackets ...AnswerCText( 3 \times(2+4-5) )MisconceptionCIdNaNCNaN
test.head(3)
QuestionIdCorrectAnswerConstructNameSubjectNameQuestionTextAnswerValueMisconceptionMisconceptionIdAnswerIdMisconceptionName
031AConvert between cm and mLength Units[450 \mathrm{~cm}=] [\square \mathrm{~m}]AnswerAText( 4.5 )MisconceptionAIdNaNANaN
131AConvert between cm and mLength Units[450 \mathrm{~cm}=] [\square \mathrm{~m}]AnswerBText( 45 )MisconceptionBId704BThinks there are 10cm in a metre
231AConvert between cm and mLength Units[450 \mathrm{~cm}=] [\square \mathrm{~m}]AnswerCText( 5 )MisconceptionCId1272CGives a rounded whole number instead of a decimal

5. 辅助函数

在给定 subject 和 construct 的情况下获取最相似的 question_ids'
以下函数首先通过检查结构top_k subject 相似的问题来返回问题 ID 的数量。
如果这没有达到top_k,则选择具有相似主题或结构的问题。如果我们仍然缺少问题 ID',我们会为剩余的 top_k 选择随机问题。

def get_topk_similar_rows(question_id: int, construct: str, subject: str, top_k: int) -> list[int]:
    """ Gets the top n ids of questions that most similar to the given construct and subject """
    
    # Rows with similar construct and subject
    similar_cs_rows = train[(train.ConstructName == construct) & (train.SubjectName == subject)]
    similar_cs_qids = list(set(similar_cs_rows.QuestionId.values.tolist()))
    
    if train_eval and question_id in similar_cs_qids:
        similar_cs_qids.remove(question_id)
        
    if len(similar_cs_qids) >= top_k:
        k_similar_cs_qids = sample(similar_cs_qids, top_k)
        return k_similar_cs_qids
    # Rows with similar construct or subject for remainder of top_k
    similar_s_rows = train[(train.ConstructName != construct) & (train.SubjectName == subject)]
    similar_c_rows = train[(train.ConstructName == construct) & (train.SubjectName != subject)]
    similar_c_or_s_qids = list(set(similar_s_rows.QuestionId.values.tolist() + similar_c_rows.QuestionId.values.tolist()))
    
    if train_eval and question_id in similar_c_or_s_qids:
        similar_c_or_s_qids.remove(question_id)
    
    if len(similar_c_or_s_qids) >= top_k - len(similar_cs_qids):
        n_similar_c_or_s_qids = sample(similar_c_or_s_qids, top_k - len(similar_cs_qids))
        return similar_cs_qids + n_similar_c_or_s_qids
        # Random rows for remainder of top_k
    not_so_similar_rows = train[(train.ConstructName != construct) & (train.SubjectName != subject)]
    not_so_similar_rows_qids = list(set(not_so_similar_rows.QuestionId.values.tolist()))
    
    if train_eval and question_id in not_so_similar_rows_qids:
        not_so_similar_rows_qids.remove(question_id)
    
    n_not_so_similar_rows_qids = sample(not_so_similar_rows_qids, top_k - len(similar_c_or_s_qids))
    return similar_c_or_s_qids + n_not_so_similar_rows_qids

获取每个问题的聊天对话

def get_conversation_msgs(question, correct_ans, incorrect_ans, misconception):
    msgs = [
        {'role': 'user',      'content': 'Question: ' + question.strip()},
        {'role': 'assistant', 'content': 'Provide me with the correct answer for a baseline.'},
        {'role': 'user',      'content': 'Correct Answer: ' + correct_ans.strip()},
        {'role': 'assistant', 'content': 'Now provide the incorrect answer and I will anaylze the difference to infer the misconception.'},
        {'role': 'user',      'content': 'Incorrect Answer: ' + incorrect_ans.strip()},
    ]
    
    if misconception is not None:
        msgs += [{'role': 'assistant', 'content': 'Misconception for incorrect answer: ' + misconception}]
        
    return msgs

6. 使用 llm.chat

注意:llm() 是最近才推出的,仅在后续版本中可用
我们生成 n 个输出,使用更高的温度来创建输出的多样化表示,然后可以稍后用于对结果进行排名。

sampling_params = SamplingParams(
    n=10,                     # 对于每个提示,返回的输出序列数量。Number of output sequences to return for each prompt.
    # top_p=0.5,               # 控制考虑的顶部标记的累积概率的浮点数。Float that controls the cumulative probability of the top tokens to consider.
    temperature=0.7,          # 采样的随机性。randomness of the sampling
    seed=1,                   # 用于可重复性的种子。Seed for reprodicibility
    skip_special_tokens=True, # 是否在输出中跳过特殊标记。Whether to skip special tokens in the output.
    max_tokens=64,            # 每个输出序列生成的最大标记数。Maximum number of tokens to generate per output sequence.
    stop=['\n\n', '. '],      # 当生成的文本中包含这些字符串时,将停止生成过程的字符串列表。List of strings that stop the generation when they are generated.
)
submission = []
for idx, row in tqdm(test.iterrows(), total=len(test)):
    
    if idx % 50:
        clean_memory()
        clean_memory()
    
    if row['CorrectAnswer'] == row['AnswerId']: continue
    if train_eval and not row['MisconceptionId'] >= 0: continue
        
    context_qids   = get_topk_similar_rows(row['QuestionId'], row['ConstructName'], row['SubjectName'], k)
    correct_answer = test[(test.QuestionId == row['QuestionId']) & (test.CorrectAnswer == test.AnswerId)].Value.tolist()[0]
    
    messages = []
    for qid in context_qids:
        correct_option = train[(train.QuestionId == qid) & (train.CorrectAnswer == train.AnswerId)]
        incorrect_options = train[(train.QuestionId == qid) & (train.CorrectAnswer != train.AnswerId)]
        for idx, incorrect_option in incorrect_options.iterrows():
            if type(incorrect_option['MisconceptionName']) == str: # Filter out NaNs
                messages += get_conversation_msgs(
                    question = correct_option.QuestionText.tolist()[0],
                    correct_ans = correct_option.Value.tolist()[0],
                    incorrect_ans = incorrect_option['Value'],
                    misconception = incorrect_option['MisconceptionName'],
                )
                
    # 对话对于错误答案以获取误解的原因。Coversation for Incorrect answer to get misconception for
    messages += get_conversation_msgs(
        question = row['QuestionText'],
        correct_ans = correct_answer,
        incorrect_ans = row['Value'],
        misconception = None,
    )
    
    output = llm.chat(messages, sampling_params, use_tqdm=False)
    inferred_misconceptions = [imc.text.split(':')[-1].strip() for imc in output[0].outputs]
    if not train_eval:
        submission.append([f"{row['QuestionId']}_{row['AnswerId']}", inferred_misconceptions])
    else:
        submission.append([
            f"{row['QuestionId']}_{row['AnswerId']}", 
            inferred_misconceptions, 
            context_qids,
            [int(row['MisconceptionId'])],
            row['MisconceptionName']
        ])
submission = pd.DataFrame(submission, columns=['QuestionId_Answer', 'InferredMisconception', 'TopKQuestionIDs', 
                                               'MisconceptionIdGT', 'MisconceptionNameGT'][:len(submission[0])])

len(submission)
  0%|          | 0/400 [00:00
227
submission.head()
QuestionId_AnswerInferredMisconceptionTopKQuestionIDsMisconceptionIdGTMisconceptionNameGT
031_B[Incorrectly divided by 100 (or multiplied by ...[691, 1119, 1774][704]Thinks there are 10cm in a metre
131_C[Incorrectly divided by 100 (or used the wrong...[691, 1119, 1774][1272]Gives a rounded whole number instead of a decimal
231_D[Multiplied when converting to a larger unit, ...[691, 1119, 257][1651]Multiplies when converting to a larger unit
361_D[Not realizing that the star is halfway betwee...[457, 1587, 696][990]Does not realise you can use equivalent fracti...
465_B[Believes the value under the square root (the...[1196, 807, 509][2316]Mixes up squaring and multiplying by 2 or doub...

7. 找到最相似的误解

删除模型并清理内存以加载嵌入模型

del llm

clean_memory(deep=True)
clean_memory(deep=True)
tokenizer   = AutoTokenizer.from_pretrained(embed_model_pth, trust_remote_code=True)
embed_model = AutoModel.from_pretrained(embed_model_pth, trust_remote_code=True).to("cuda:0")
<All keys matched successfully>
def generate_embeddings(texts, batch_size=8):
    all_embeddings = []
    for i in range(0, len(texts), batch_size):
        batch_texts = texts[i:i+batch_size]
        inputs = tokenizer(batch_texts, padding=True, truncation=True, return_tensors="pt", max_length=1024).to('cuda:0')
        with torch.no_grad():
            outputs = embed_model(**inputs)
        embeddings = outputs.last_hidden_state[:, 0, :]  # CLS token
        embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
        all_embeddings.append(embeddings.cpu().numpy())
        
    return np.concatenate(all_embeddings, axis=0)
all_ctx_vector  = generate_embeddings(list(misconceptions.MisconceptionName.values))

all_ctx_vector.shape
(2587, 768)
n_results = []

for results in tqdm(pd.DataFrame(submission.InferredMisconception.values.tolist()).T.values):
    all_text_vector = generate_embeddings(list(results))
    cosine_similarities = cosine_similarity(all_text_vector, all_ctx_vector)
    test_sorted_indices = np.argsort(-cosine_similarities, axis=1)
    n_results.append(test_sorted_indices)

n_results = np.array(n_results)
n_results.shape
  0%|          | 0/10 [00:00
(10, 227, 2587)
n_results = np.transpose(n_results, (1, 0, 2))
n_results.shape
(227, 10, 2587)

合并每个问题的每个生成输出的排名

Borda count 是一种非常简单的排名机制

def borda_count(rankings):
    scores = {}
    num_elements = len(next(iter(rankings)))
    
    for model_ranking in rankings:
        for idx, item in enumerate(model_ranking):
            points = num_elements - idx
            scores[item] = scores.get(item, 0) + points
            
    # 根据总分排序误解。Sort the misconceptions based on total points
    final_ranking = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    ranked_results = [r for r, score in final_ranking]
    return ranked_results

# 计算最终排名 Compute the final ranking
final_rankings = np.array([borda_count(result) for result in n_results])

final_rankings.shape
(227, 2587)
submission['MisconceptionId'] = final_rankings[:, :25].tolist()

8. 提交

if train_eval:
    submission['apk@25'] = submission.apply(lambda row: apk(row['MisconceptionIdGT'], row['MisconceptionId']), axis=1)
    submission.to_csv('submission_debug.csv', index=False)
    
    print(submission['apk@25'].mean())

0.1415299510916358

submission["MisconceptionId"] = submission["MisconceptionId"].apply(lambda x: ' '.join(map(str, x)))
submission[['QuestionId_Answer', 'MisconceptionId']].to_csv('submission.csv', index=False)
submission.head(25)
QuestionId_AnswerInferredMisconceptionTopKQuestionIDsMisconceptionIdGTMisconceptionNameGTMisconceptionIdapk@25
031_B[Multiplies by 100 instead of dividing by 100,...[691, 1119, 1774][704]Thinks there are 10cm in a metre2187 1035 2350 1579 2335 2408 2481 752 1408 33...0
131_C[Believes there are 100 cm in a metre, Assumes...[691, 1119, 1774][1272]Gives a rounded whole number instead of a decimal613 447 1801 1151 1795 2408 752 2187 1579 566 ...0
231_D[Multiplies by 100 instead of dividing by 100,...[691, 1119, 257][1651]Multiplies when converting to a larger unit1341 39 61 2187 2335 1035 2481 975 2134 2350 2...0
361_D[Does not recognize that the star is halfway b...[457, 1587, 696][990]Does not realise you can use equivalent fracti...1212 2134 1119 916 1184 684 1309 1807 579 1206...0
465_B[Believes the coefficient of ( h ) in the eq...[1196, 807, 509][2316]Mixes up squaring and multiplying by 2 or doub...1743 2372 341 2070 1904 2256 540 2324 1390 116...0
565_C[Does not correctly identify the value under t...[1196, 807, 634][2245]When using the formula to solve a quadratic eq...170 340 1735 2245 3 2256 341 265 994 245 1987 ...0.25
669_A[Assumes that the sample size is solely respon...[830, 1606, 1700][906]Does not know that sample size affects reliabi...2325 880 1923 1600 63 2065 453 2207 163 2299 4...0
769_C[Assumes the sample sizes are equal or that th...[622, 977, 734][906]Does not know that sample size affects reliabi...1923 2065 63 2325 880 1600 906 1225 724 2309 1...0.142857
869_D[Assumes reliability is independent of sample ...[1195, 1827, 1860][906]Does not know that sample size affects reliabi...2325 906 1681 1923 2561 880 2065 1912 2207 453...0.5
970_A[The student might have added the percentage v...[59, 1507, 548][2023]Thinks when finding a percentage you divide by...388 2276 2408 329 1601 2138 1955 2191 403 2518...0
1081_A[Orders the numbers from smallest to largest b...[1834, 1169, 473][1468]Orders integers based on first digits without ...2546 561 399 1999 1941 2540 1672 1016 22 1119 ...0
1181_C[Orders the numbers incorrectly by not conside...[657, 714, 480][1365]When ordering integers, orders from the digits...1365 561 1999 2262 1124 388 1941 1378 1672 251...1
1283_B[Rounds up to the next significant figure inst...[920, 1080, 1059][1988]Rounds up instead of down1165 1105 794 1591 1157 1705 2116 1988 1817 14...0.125
1383_C[Rounds to a degree of accuracy that is not ne...[920, 1059, 1080][1744]Rounded to nearest 100 instead of 1sf1529 739 1591 1165 2392 1105 1157 1817 1705 20...0
1485_A[The correct answer for the first term of the ...[89, 1029, 437][1240]Thinks the first term of a sequence must be 11240 108 2475 2472 1354 2376 2139 1716 936 162...1
1585_B[Assumes the first term is the coefficient of ...[89, 1029, 456][2376]When finding the nth term of a linear sequence...2252 2475 2139 2513 1821 528 2376 849 1240 109...0.142857
16103_A[Multiplied the slanted height by the length t...[353, 1538, 1161][867]When finding the area of a parallelogram does ...2332 1788 1883 1985 669 2105 307 700 1175 590 ...0
17103_B[Multiplied the slanted height by the length i...[1161, 991, 353][669]Has used slant height and base to find area of...2332 1788 669 2105 590 1883 1985 1698 342 1926...0.333333
18103_D[Uses the slanted height instead of the perpen...[991, 1538, 1161][695]Has found the area of a triangle rather than a...2332 669 1788 2105 1883 459 590 1780 2300 396 ...0
19112_A[Confuses division with subtraction when think...[258, 1777, 580][2093]Thinks the fraction bar means subtract rather ...1672 1941 15 752 566 1971 493 481 2134 357 240...0.043478
20112_C[Believes the calculation is simply the subtra...[1281, 1162, 1131][2093]Thinks the fraction bar means subtract rather ...752 566 848 1795 1431 1088 1297 1482 2512 477 ...0
21112_D[Division by a whole number does not equate to...[759, 1457, 1257][1542]Believes that a fraction means dividing the de...58 1042 812 2559 151 839 232 2525 371 1619 200...0
22140_A[When factorising a quadratic without a non-va...[847, 1291, 1057][838]When factorising a quadratic without a non var...2240 838 2581 2479 1432 265 2142 2068 1871 102...0.5
23140_B[When factorising the expression ( p^2 - 99p ...[680, 200, 455][838]When factorising a quadratic without a non var...2240 628 2581 2479 838 319 1666 1432 320 2142 ...0.2
24146_A[Assumes the fraction is represented as a deci...[47, 1690, 818][1637]Has used the decimal point to separate numerat...1825 78 257 1166 2406 72 318 1759 169 1478 157...0

小白狮ww
1 声望0 粉丝