OpenAI 在以 o3 创纪录之前悄悄资助了独立的数学基准测试

  • OpenAI funded FrontierMath, a leading AI math benchmark, which only became known when OpenAI announced its record-breaking performance on the test.
  • FrontierMath, introduced in November 2024, tests AI systems' ability to handle complex math problems. Its problems were created by over 60 leading mathematicians.
  • The connection between OpenAI and FrontierMath emerged on December 20 when OpenAI unveiled its new o3 model, achieving a 25.2% success rate on the benchmark's problems.
  • Epoch AI, the benchmark's developer, had an agreement preventing them from revealing OpenAI's support until o3's announcement. They acknowledged the connection in a footnote.
  • More than 60 mathematicians who created the benchmark problems were unaware of OpenAI's involvement even after o3's announcement.
  • Tamay Besiroglu from Epoch AI admits mistakes and says OpenAI had access to many math problems and solutions before o3's announcement. Epoch AI kept a separate set of problems private.
  • They have a verbal agreement with OpenAI prohibiting the company from using the materials to train models.
  • There is a recommendation for more transparency in AI benchmarking, especially as mathematical reasoning is a weakness of language models.
  • Epoch AI lead mathematician Elliot Glazer believes OAI has been accurate but they need to independently evaluate the model using the holdout set.
  • The situation highlights the complexity of AI benchmarking and the importance of test results in attracting attention and investment.
阅读 5
0 条评论