谷歌 DeepMind 推出 QuestBench 以评估大型语言模型在解决逻辑和数学问题方面的能力

发布于 4 月 22 日

Main Points: Google DeepMind's QuestBench benchmark assesses if LLMs can find the crucial question for solving logic, planning, or math problems. It formalizes the information-gathering problem as an underspecified CSP. The team evaluated LLMs in structured reasoning tasks with clearly defined ground truth and different models in various settings. They found that LLMs performed well in GSM-Q and GSME-Q but struggled in Logic-Q and Planning-Q. LLM evaluation benchmarks are crucial.
Key Information: QuestBench is for underspecified reasoning tasks solvable by asking at most one question. LLMs are applied to reasoning tasks but often deal with underspecified queries. The benchmark formalizes the problem and has four task categories. Datasets include 1-sufficient CSPs in different domains. Evaluated SOTA LLMs in ZS, CoT, and 4S settings. Studies showed correlations and differences in abilities across domains. Specific conclusions on model performances in different domains were drawn.
Important Details: Each problem instance has a user request, question choices, and correct questions. SOTA and near-SOTA LLMs have different performances in various tasks. Analyzed correlation between model accuracy and difficulty axes. LLM evaluation benchmarks help understand model strengths and limitations and guide fine-tuning and model selection. Instructions for running the evaluation in a local environment are provided.

阅读 4