新研究指责 LM Arena 对其流行的 AI 基准进行作弊

AI Chatbot Proliferation and Ranking Platform: The rapid growth of AI chatbots makes it hard to determine which models are improving. Traditional benchmarks have limitations, leading many to rely on vibes-based analysis from LM Arena. A new study claims LM Arena is rife with unfair practices favoring large companies.
- LM Arena Creation and Function: Created in 2023 as a UC Berkeley research project, LM Arena allows users to feed prompts to two unidentified AI models and vote for their preferred output. The aggregated data forms the leaderboard showing popular models and helps track improvements.
- Companies' Attention to Ranking: As the AI market heats up, companies like Google (Gemini 2.5 Pro at the top of the leaderboard) and DeepSeek (strong performance in Chatbot Arena) pay more attention to the ranking.
Study on LM Arena's Unfair Practices: Researchers from Cohere Labs, Princeton, and MIT believe AI developers have over-relied on LM Arena. The new study claims the rankings are distorted by practices that make proprietary chatbots outperform open ones.
- Meta and Google's Testing: Meta tested 27 versions of Llama-4 before releasing the one on the leaderboard. Google tested 10 variants of Gemini and Gemma. Some AI developers take advantage of private testing.
- Promotion of Private Models: LM Arena appears to promote private models like Gemini, ChatGPT, and Claude more. Open model developers get less data. Certain models appear in arena faceoffs more often, giving big companies more attention.
Suggestions to Make LM Arena Fairer: The study authors suggest limiting the number of models a group can add and retract before release and showing all model results. LM Arena operators disagree with some of the methodology and conclusions.
- Alignment on Unfair Matchups: Both sides may agree on fair sampling to ensure open models appear more in Chatbot Arena. LM Arena will work on making the sampling algorithm more varied.
LM Arena's Future: LM Arena recently formed a corporate entity. There is a debate on whether it is an objectively better way to evaluate chatbots than academic tests. Voted vibes may lead models to adopt sycophantic tendencies, as seen with ChatGPT.

Overall, LM Arena is a popular AI ranking platform with claims of unfair practices, and efforts are being made to make it more fair while its future and impact on chatbot development remain a topic of discussion.