AI 现在能做数学吗？一位数学家的思考。

Main points: This week, OpenAI's new language model o3 got 25% on FrontierMath. A language model is like ChatGPT that answers questions. FrontierMath is a secret dataset of hard math questions. The questions are "find this number!" rather than "prove this theorem!". The dataset is valuable for AI in math but grading is hard. An article in Science quoted the author saying AI might be in trouble if it can ace the database. o3's 25% score shocked the author as they expected it to be unattackable for a while. There's confusion about the actual level of the secret dataset. DeepMind's AlphaProof solved some IMO problems with full formalized Lean proofs at high school level. In 2025, machines may perform at gold level in the IMO and we'll face the "grading" issue with language models and theorem provers.
Key information: o3 is a new language model. FrontierMath has "hundreds" of hard math questions. Sample questions have positive whole number solutions. Tao and Borcherds commented on the difficulty. Grading "prove this theorem!" questions is expensive. AlphaProof solved IMO problems. Language models and theorem provers have different grading challenges.
Important details: ChatGPT was the first coherent public language model. There are rumors about the number of questions in FrontierMath. The author could solve some sample questions but not all. The first and second sample questions require advanced knowledge. Frieder et al's article discusses dataset shortcomings. The author was shocked by o3's score. Elliot Glazer claimed 25% of problems were "IMO/undergrad style". In 2025, machines may enter the IMO with different submission types. Marking language model "proofs" is difficult. We want more than just "prove this theorem!" and need both accuracy and human understanding.