OpenAI 在以 o3 创纪录之前悄悄资助了独立的数学基准测试

发布于 1 月 19 日

OpenAI funded FrontierMath, a leading AI math benchmark, which only became known when OpenAI announced its record-breaking performance on the test.
FrontierMath, introduced in November 2024, tests AI systems' ability to handle complex math problems. Its problems were created by over 60 leading mathematicians.
The connection between OpenAI and FrontierMath emerged on December 20 when OpenAI unveiled its new o3 model, achieving a 25.2% success rate on the benchmark's problems.
Epoch AI, the benchmark's developer, had an agreement preventing them from revealing OpenAI's support until o3's announcement. They acknowledged the connection in a footnote.
More than 60 mathematicians who created the benchmark problems were unaware of OpenAI's involvement even after o3's announcement.
Tamay Besiroglu from Epoch AI admits mistakes and says OpenAI had access to many math problems and solutions before o3's announcement. Epoch AI kept a separate set of problems private.
They have a verbal agreement with OpenAI prohibiting the company from using the materials to train models.
There is a recommendation for more transparency in AI benchmarking, especially as mathematical reasoning is a weakness of language models.
Epoch AI lead mathematician Elliot Glazer believes OAI has been accurate but they need to independently evaluate the model using the holdout set.
The situation highlights the complexity of AI benchmarking and the importance of test results in attracting attention and investment.

阅读 5