Formal VerificationResearch MilestoneJun 16, 2026, 6:09 PM· 6 min read· #5 of 5 in science

AI Systems Pass Research-Level Math Benchmark, Ushering in Era of Verified 'Co-Mathematicians'

Leading AI models successfully solved seven out of ten unpublished, research-level problems in a rigorous new benchmark. Paired with formal verification tools like the Lean theorem prover, AI is transforming from a flawed calculator into an ultra-rigorous collaborator for human mathematicians.

By Factlen Editorial Team

Share this story

Formalization Advocates 40%Traditional Mathematicians 35%AI Developers 25%

Formalization Advocates: Researchers who believe AI and formal proof assistants will revolutionize mathematical collaboration.
Traditional Mathematicians: Experts who emphasize the irreplaceable role of human intuition and conceptual framing.
AI Developers: Technology companies focused on scaling AI reasoning to achieve autonomous mathematical discovery.

What's not represented

· Mathematics Educators
· Pure Mathematics Graduate Students

Why this matters

Mathematics is the foundational language of cryptography, physics, and computer science. By pairing AI's generative speed with mathematically guaranteed verification tools, researchers can dramatically accelerate the discovery of new materials, secure software systems, and complex engineering models without the risk of human error.

Key points

A Harvard-led benchmark tested AI systems on 10 unpublished, research-level math problems to prevent data contamination.
Leading AI models successfully solved 7 out of the 10 problems, demonstrating genuine mathematical reasoning capabilities.
Mathematicians are increasingly pairing AI with the Lean theorem prover to instantly catch logical errors and hallucinations.
Lean acts as an immutable oracle; if an AI-generated proof compiles in Lean, the mathematics is guaranteed to be correct.
AI still struggles with the 'definition-finding gap,' meaning human intuition is required to frame and invent new mathematical concepts.
The field is shifting toward massive, AI-assisted collaborations, allowing researchers to verify complex proofs in weeks rather than years.

7 of 10

Unpublished research problems solved by AI

~120,000

Formalized lemmas in the Mathlib database

3 weeks

Time taken to formalize the Freiman-Ruzsa conjecture

For years, a central question has haunted the intersection of computer science and higher mathematics: Can artificial intelligence actually reason, or is it simply executing highly sophisticated pattern-matching against problems it has already seen on the internet? To find out, a group of 30 leading mathematicians gathered at Harvard University's Center of Mathematical Sciences and Applications in early June 2026. Their goal was to administer a test that no AI could possibly have studied for, aiming to definitively map the boundary between human intuition and machine capability.[1][2]

The project, dubbed "First Proof," was designed with a brilliantly simple safeguard against data contamination. The organizers sourced ten original, unpublished mathematical problems directly from the active research of top-tier mathematicians. Because these specific lemmas and theorems had never appeared in textbooks, published papers, or on preprint servers like arXiv, they were entirely absent from the training data of any large language model. For an AI to solve them, it would have to demonstrate genuine, on-the-fly mathematical reasoning rather than regurgitating memorized proofs.[1][3]

The results, announced in mid-June, marked a watershed moment for the field. Across the four leading AI systems tested—which included frontier models from Google, OpenAI, and a specialized system from ETH Zurich—the machines successfully produced passing grades on seven of the ten research-level problems. The blind-grading panel of human experts noted that while some AI solutions required minor revisions, others were deemed "flawless," and in one instance, an AI model utilized a completely novel strategy that deeply impressed the referees.[1][3]

In the June 2026 First Proof benchmark, AI systems successfully solved 70% of unpublished research-level problems.

However, the benchmark also highlighted the enduring supremacy of human experts in specific domains. While the top-performing individual AI system solved six of the ten problems, the human mathematicians who contributed the questions were able to solve all of them. The consensus among the organizers was not one of impending doom for human mathematicians, but rather a recognition of "genuine capability." The AI systems proved they are no longer just parlor tricks; they are evolving into highly capable assistants that can handle the rigorous demands of actual research.[1][2][8]

Despite these impressive benchmark victories, a fundamental hurdle has historically prevented mathematicians from fully trusting AI: the problem of hallucination. In advanced mathematics, a proof can easily span over 100 pages of dense, interconnected logic. If an AI model confabulates even a single line—misapplying a theorem or dropping a crucial negative sign—the entire mathematical argument collapses. Because large language models are inherently probabilistic, they have traditionally been viewed as too unreliable for the absolute certainty required by the discipline.[5]

The solution to this trust deficit has not come from making language models less probabilistic, but by pairing them with an entirely different kind of software: the interactive theorem prover. The most prominent of these is Lean 4, a programming language and proof assistant that is rapidly becoming the gold standard in the mathematical community. Lean does not guess or estimate; it operates on strict, axiomatic logic, checking every single step of a mathematical argument against the foundational rules of mathematics.[4][5]

The most prominent of these is Lean 4, a programming language and proof assistant that is rapidly becoming the gold standard in the mathematical community.

In this new paradigm, Lean acts as an "immutable verification oracle." When an AI system proposes a proof, it must write that proof in the Lean programming language. The Lean compiler then checks the code. If the code compiles successfully, the mathematics is guaranteed to be 100 percent correct. This architecture completely neutralizes the threat of AI hallucinations. The language model is free to be creative, intuitive, and even make mistakes, because the Lean kernel will instantly catch and reject any logical flaws before they are accepted as truth.[4][7]

The Lean theorem prover acts as an immutable oracle, instantly catching AI hallucinations before they are accepted as fact.

This synthesis of generative AI and formal verification has given rise to the era of the "AI Co-Mathematician." Instead of functioning as autonomous oracles that spit out finished papers, modern AI systems act as tireless research assistants. They can rapidly search through vast databases of known theorems, propose multiple tactical approaches to a stubborn lemma, and grind through the tedious algebraic manipulations that often bog down human researchers. The human mathematician acts as the director, setting the high-level strategy while the AI handles the tactical execution.[4][6]

The real-world impact of this workflow is already transforming how landmark mathematics is done. In a widely celebrated milestone, a team led by Fields Medalist Terence Tao used Lean to formally verify the proof of the polynomial Freiman-Ruzsa conjecture. What would traditionally have been a painstaking, months-long process of peer review and manual checking was completed by a 25-person collaborative team in just three weeks. The formalization proved that complex, modern research results can be verified almost as quickly as they are written.[5]

This rapid acceleration is being fueled by the massive expansion of Mathlib, Lean's central library of formalized mathematics. Built by a global community of volunteer mathematicians and computer scientists, Mathlib has grown to encompass approximately 120,000 formalized lemmas as of early 2026. This repository functions as a comprehensive, machine-readable map of known mathematics. When an AI system like ByteDance's Seed Prover or OpenAI's latest models attempt to solve a new problem, they can autonomously search Mathlib to find and apply the exact foundational theorems required for their proofs.[5][7]

The Mathlib database has grown to encompass roughly 120,000 formalized mathematical lemmas, providing a map for AI systems.

Armed with these tools, AI systems are beginning to chip away at longstanding open questions. Recent reports indicate that advanced models, working in tandem with formal verification environments, have contributed to solutions for several open Erdős problems—a famous set of mathematical conjectures posed by the legendary Paul Erdős. By breaking these problems down into smaller, verifiable steps, AI models are demonstrating that they can make novel contributions to the field with minimal human guidance, provided their output is anchored by a formal verifier.[4][6]

Yet, for all their tactical brilliance, AI systems still suffer from what researchers call the "definition-finding gap." While an AI can brilliantly execute a proof once a problem is clearly defined, it struggles immensely with the conceptual framing required to invent new mathematics. Human mathematicians rely on taste, intuition, and a deep sense of aesthetic beauty to decide which problems are actually worth solving. AI cannot yet look at a chaotic mathematical landscape and identify the hidden structures that warrant a new definition or a groundbreaking new theory.[1][2]

Consequently, the culture of mathematics is undergoing a profound shift. The romanticized stereotype of the lone genius working in isolation with a chalkboard is giving way to massive, digitally connected collaborations. Proof assistants and AI tools are democratizing the field, allowing researchers to build upon each other's verified work with absolute confidence. As the barrier to verifying complex proofs drops, mathematicians are free to tackle increasingly ambitious and sprawling conjectures that would have been impossible to verify by hand.[1][5]

Ultimately, the results of the First Proof benchmark and the rise of the Lean theorem prover point to an uplifting future for the discipline. Mathematics is not being "solved" or automated away by machines; rather, it is being upgraded. By offloading the burden of rigorous, line-by-line verification to AI and formal proof assistants, human mathematicians are being freed to focus on what they do best: exercising creativity, following their intuition, and dreaming up the next great questions to ask the universe.[1][2]

How we got here

2024
AI systems like DeepMind's AlphaProof reach silver-medal performance levels at the International Math Olympiad.
Early 2025
A team led by Terence Tao formalizes the complex polynomial Freiman-Ruzsa conjecture using Lean in just three weeks.
February 2026
The first batch of the 'First Proof' benchmark is launched to test AI on genuine research-level mathematics.
June 2026
Results from the second batch of First Proof reveal that AI systems successfully solved 7 out of 10 unpublished problems.

Viewpoints in depth

Formalization Advocates

Researchers who believe AI and formal proof assistants will revolutionize mathematical collaboration.

This camp, which includes prominent figures like Fields Medalist Terence Tao and the broader Lean community, views formalization as the inevitable future of mathematics. They argue that as proofs become longer and more complex, human peer review is no longer sufficient to guarantee accuracy. By translating mathematics into code, they believe the field can eliminate errors, democratize access, and allow massive teams of researchers to collaborate on single problems with absolute confidence.

Traditional Mathematicians

Experts who emphasize the irreplaceable role of human intuition and conceptual framing.

Many working mathematicians acknowledge the utility of AI but push back against the narrative that the discipline is being 'solved' by machines. Researchers in this camp emphasize that solving a pre-defined problem is only a fraction of a mathematician's job. The true art of mathematics lies in 'taste'—the ability to look at a chaotic field, identify hidden structures, and formulate the right questions to ask. They view AI as a powerful calculator, but one that fundamentally lacks the aesthetic judgment required for true mathematical discovery.

AI Developers

Technology companies focused on scaling AI reasoning to achieve autonomous mathematical discovery.

For organizations like OpenAI, DeepMind, and ByteDance, mathematics is the ultimate testbed for artificial general intelligence. Because math provides an objective standard of truth, it is the perfect environment to train models in complex, multi-step logical reasoning. This camp is heavily invested in 'agentic' AI systems that can autonomously search libraries, write code, and correct their own errors, with the ultimate goal of building systems that can discover and prove entirely new theorems without human intervention.

What we don't know

Whether AI systems will ever be able to autonomously formulate novel, interesting mathematical conjectures without human prompting.
How the widespread adoption of formal verification tools will impact the funding and structure of university mathematics departments.
If scaling up current large language models will eventually overcome the 'definition-finding gap,' or if an entirely new AI architecture is required.

Key terms

Lean Theorem Prover: A programming language and software tool that mechanically verifies the logical correctness of mathematical proofs.
Formalization: The process of translating traditional, human-written mathematical proofs into computer code that can be verified by software.
Lemma: A smaller, intermediate mathematical proposition used as a stepping stone to prove a larger, more significant theorem.
Mathlib: The central, open-source library of formalized mathematics for the Lean theorem prover, containing hundreds of thousands of verified facts.
Hallucination: A phenomenon where an AI model confidently generates false or logically flawed information, a major liability in rigorous mathematics.

Frequently asked

What is the First Proof benchmark?

It is a rigorous test designed by Harvard mathematicians that evaluates AI systems using unpublished, research-level math problems to ensure the AI is genuinely reasoning rather than relying on memorized training data.

What is the Lean theorem prover?

Lean is a programming language and interactive proof assistant that checks mathematical arguments line-by-line against foundational axioms, guaranteeing that a proof is 100% correct if the code compiles.

Will AI replace human mathematicians?

Current consensus suggests AI will not replace humans, but rather act as a highly capable 'co-mathematician.' AI excels at tactical execution and verification, while humans remain essential for framing concepts and asking the right questions.

What is Mathlib?

Mathlib is a massive, community-built database of formalized mathematical theorems and lemmas written in Lean, serving as a machine-readable map of known mathematics for AI systems to reference.

Sources

[1]The Washington PostTraditional Mathematicians
Math illuminates how traffic flows, how our cells build proteins. Is AI an existential threat to math or an impressive tool?
Read on The Washington Post →
[2]Harvard UniversityTraditional Mathematicians
Have reports of AI replacing mathematicians been greatly exaggerated?
Read on Harvard University →
[3]The Economic TimesAI Developers
Mathematicians solved what AI couldn't: Inside the First Proof benchmark
Read on The Economic Times →
[4]arXivFormalization Advocates
AI Co-Mathematician: Accelerating Mathematicians with Agentic AI
Read on arXiv →
[5]International Mathematical UnionFormalization Advocates
Formalization of Fields Medallists' work and the future of AI in Mathematics
Read on International Mathematical Union →
[6]OpenAIAI Developers
Frontier AI capabilities in science and mathematics
Read on OpenAI →
[7]ByteDanceAI Developers
Seed Prover 1.5: Advancing Formal Mathematical Reasoning
Read on ByteDance →
[8]NatureTraditional Mathematicians
Daily briefing: Iron-Age human bones were made into tools before interment
Read on Nature →

Up next

Exoplanet Weather

JWST Maps Dawn and Dusk on Ultra-Hot Exoplanet WASP-121 b

Astronomers using the James Webb Space Telescope have successfully isolated the atmospheric signals of an exoplanet's morning and evening terminators, revealing extreme temperature and chemical differences.

Every angle. Every day.

Get science stories with full source coverage and perspective breakdowns delivered to your inbox.

Get the briefing →Browse science