AI CapabilitiesEvidence ReviewJun 13, 2026, 2:10 AM· 5 min read· #7 of 7 in science

AI Models Hit a Wall on 'First Proof,' a Rigorous New Benchmark for Research Mathematics

A coalition of top mathematicians tested leading AI models on unpublished, research-level math problems. The results reveal that while AI excels at standardized tests, it still struggles with autonomous mathematical discovery.

By Factlen Editorial Team

Share this story

First Proof Organizers 40%AI Developers 30%Skeptical Academics 30%

First Proof Organizers: Advocate for strict, contamination-free testing to measure true autonomous reasoning.
AI Developers: Argue that AI's true value lies in human-machine collaboration rather than pure zero-shot autonomy.
Skeptical Academics: View the results as proof that LLMs lack fundamental reasoning and self-correction capabilities.

What's not represented

· Early-career mathematicians who might rely on AI tools to compete with larger research labs.
· Educators concerned about how AI's math capabilities impact university curricula.

Why this matters

As artificial intelligence rapidly integrates into education, research, and the workforce, understanding its true capabilities is critical. This benchmark cuts through industry hype, providing a clear, objective measure of where machine computation ends and human creativity begins.

Key points

A coalition of elite mathematicians created the 'First Proof' benchmark to test AI on unpublished research problems.
The test was designed to eliminate 'data contamination,' ensuring AI models couldn't simply regurgitate memorized solutions.
During the formal 'Second Batch' test, AI models were banned from receiving human steering or feedback.
Human mathematicians significantly outperformed the AI models, which frequently generated confident but logically flawed proofs.
The results clarify that while AI is an excellent computational assistant, genuine mathematical discovery remains a human domain.

Unpublished research lemmas tested

24 hours

Time limit for autonomous AI proofs

Leading mathematicians on the First Proof board

Artificial intelligence has spent the last year conquering some of the most prestigious standardized tests on Earth, passing the bar exam, and even scoring gold medals at the International Mathematical Olympiad. These highly publicized milestones led to bold industry claims that AI systems were on the verge of autonomous scientific discovery. But when stripped of human assistance and forced to solve novel, unpublished research problems, the world's most advanced models hit a wall. On June 10, 2026, a coalition of elite mathematicians released the results of the "First Proof" benchmark, revealing a massive gap between AI pattern-matching and genuine mathematical creativity.[1][4]

The First Proof initiative was designed by eleven leading mathematicians from institutions including Harvard, Stanford, Yale, and UC Berkeley to solve a pervasive problem in AI evaluation known as "data contamination." Because large language models are trained on vast, opaque swaths of the internet, it is nearly impossible to tell if an AI is actually reasoning through a complex problem or simply regurgitating a solution it memorized from an obscure 1980s journal. Standard benchmarks have become increasingly unreliable as models inadvertently ingest the test questions during their training phases.[3][4][6]

To neutralize this memorization loophole, the First Proof team sourced ten highly specific mathematical problems directly from the unpublished, ongoing research of human mathematicians. Because these problems had never appeared online, in any preprint server, or in any public talk, the AI models were forced to reason entirely from scratch. The problems spanned diverse and highly specialized fields, including symplectic geometry, spectral graph theory, stochastic analysis, and algebraic combinatorics.[4][5][7]

First Proof uses unpublished problems to ensure AI models are reasoning, not just reciting memorized data.

The benchmark specifically focused on proving "lemmas"—intermediate propositions that mathematicians use as stepping stones to build larger, more complex theorems. Lemmas represent the actual daily labor of a working research mathematician. They are not artificially constructed puzzles designed for automatic grading, nor do they have clean numerical answers like competition math; they require building rigorous, end-to-end logical arguments that can survive peer review.[4][5][7]

The initiative's initial "First Batch" experiment, conducted in February 2026, highlighted exactly why a rigorous benchmark was necessary. During that informal round, major tech companies including OpenAI and Google DeepMind submitted solutions that appeared highly impressive at first glance. However, subsequent analysis revealed that these successes relied heavily on human intervention, a collaborative process researchers refer to as "Centaur Math."[4][5][7]

In Centaur Math, human experts steer the AI system, correcting its logical gaps, suggesting alternative approaches when it gets stuck, and manually selecting the best output from dozens of generated attempts. While this proves that AI is a highly effective interactive tool, it completely obscures the model's ability to generate a novel mathematical idea autonomously. The First Proof organizers realized that to measure true machine intelligence, they had to sever the human lifeline.[4][7]

While this proves that AI is a highly effective interactive tool, it completely obscures the model's ability to generate a novel mathematical idea autonomously.

The "Second Batch" of the First Proof benchmark, conducted in late May and graded in early June 2026, imposed draconian restrictions on the AI systems to test pure autonomy. The formal benchmark banned human steering entirely. Models were given exactly one shot to generate a proof via an API, with a strict 24-hour time limit and zero human intervention allowed during the generation process.[3][4]

The Second Batch of First Proof banned human steering, forcing models to generate proofs in a single autonomous attempt.

The AI-generated proofs were anonymized, assigned animal code names like "Ocelot," "Badger," and "Marmot," and subjected to blind peer review. On June 4 and 5, human experts gathered at Harvard University's Center of Mathematical Sciences and Applications (CMSA) to grade the submissions. The referees applied the exact same rigorous standards used by academic mathematics journals, evaluating the proofs for logical soundness, clarity, and the presence of a core conceptual breakthrough.[3][4]

The results were sobering for AI advocates. According to Nature, human mathematicians significantly outperformed the AI models across the board. Scientific American characterized the collective AI performance as a "C-," noting that the systems struggled profoundly when forced to formulate the creative leaps required for research-level math without a human guide.[1][2]

Without human steering, the models frequently produced what reviewers called "confident but flawed" proofs. The AI systems proved highly adept at generating pages of plausible-sounding mathematical notation, perfectly mimicking the tone and structure of an academic paper. However, beneath the polished formatting, the proofs frequently contained fatal logical errors, circular reasoning, or fundamental misunderstandings of the core geometric or algebraic concepts.[1][2][6]

The benchmark highlighted a specific vulnerability in current AI architectures: the inability to recognize their own boundaries and simply say, "I don't know." In mathematics, where a single incorrect assumption or hallucinated logical bridge invalidates an entire proof, this tendency to confidently guess is a critical liability. The models lacked the self-correction mechanisms that human mathematicians use to realize when a line of inquiry has hit a dead end.[2][6]

AI systems that ace standardized tests struggle significantly when faced with novel, unpublished research problems.

Despite the low overall scores, the First Proof results were not entirely negative. The models did succeed at executing routine algebraic manipulations, verifying known steps, and translating mathematical concepts between different formats. This suggests that their immediate utility lies in acting as high-speed calculators and verification engines, freeing human mathematicians from tedious symbolic manipulation so they can focus on higher-level theory.[1][7]

The organizers drew a philosophical distinction between computation and conceptual discovery, quoting conceptual artist Sol LeWitt: "the idea becomes a machine that makes the art." The AI models proved excellent at operating the machine—executing the mechanical steps of a proof once a path was chosen—but they consistently failed to generate the initial "idea" or conceptual leap required to solve a novel problem.[4]

Rather than dismissing AI, the First Proof organizers view these results as a vital, uplifting recalibration for the scientific community. By establishing an objective, contamination-proof baseline, mathematicians now have a clear, transparent picture of where AI excels and where human intuition remains irreplaceable. The consensus is clear: while AI will undoubtedly accelerate the pace of mathematical research, the spark of genuine discovery remains a distinctly human domain.[1][3][4][5]

How we got here

July 2025
AI models achieve gold-medal performance on the International Mathematical Olympiad, sparking debate about their reasoning skills.
February 2026
The First Proof team releases its initial batch of 10 problems for an informal, collaborative experiment with AI labs.
March 2026
First Proof announces strict new rules for its 'Second Batch' to eliminate human steering and test pure autonomous reasoning.
June 4-5, 2026
Human mathematicians gather at Harvard University to conduct blind peer reviews of the AI-generated proofs.
June 10, 2026
The Second Batch results are published, revealing that human experts still significantly outperform autonomous AI models.

Viewpoints in depth

First Proof Organizers

Emphasize the need for rigorous, contamination-free testing to measure true autonomous reasoning.

This camp argues that standard benchmarks are fundamentally broken because large language models inadvertently memorize the internet. By using unpublished lemmas and banning "Centaur Math" (human steering), they believe they have created the first honest assessment of AI's mathematical capabilities. They view the "C-" results not as a failure, but as a necessary reality check that highlights the difference between pattern-matching and genuine conceptual discovery.

AI Developers & Optimists

Argue that AI's true value lies in human-machine collaboration rather than pure zero-shot autonomy.

While acknowledging the First Proof results, this camp points out that real-world mathematics is rarely done in a vacuum. They argue that restricting models to "one-shot" API calls without feedback artificially kneecaps the technology. In their view, the fact that AI can solve complex lemmas when iteratively steered by a human expert proves that the models are already highly valuable research assistants, even if they aren't yet autonomous mathematicians.

Skeptical Academics

Point to the benchmark as proof that LLMs lack fundamental reasoning and self-correction capabilities.

This viewpoint focuses on the AI models' tendency to produce "confident but flawed" proofs. Skeptics argue that because LLMs are fundamentally text-prediction engines, they excel at mimicking the syntax of a mathematical proof but fail to grasp the underlying logic. They highlight the models' inability to say "I don't know" as a fatal flaw in rigorous disciplines, warning that relying on AI for autonomous research could flood the literature with plausible-sounding errors.

What we don't know

Whether future AI architectures that combine large language models with symbolic logic engines will overcome these reasoning barriers.
How quickly the gap between human-steered 'Centaur Math' and pure autonomous AI reasoning will close.
The exact threshold at which an AI-generated proof becomes too complex for human referees to reliably verify.

Key terms

Lemma: An intermediate proposition or 'stepping stone' proven to be true and used as a building block to prove a larger mathematical theorem.
Data Contamination: A flaw in AI testing where the model has already been exposed to the test questions during its training phase, allowing it to answer from memory rather than reasoning.
Zero-Shot Inference: A testing method where an AI model is given a prompt and must produce the final answer in a single attempt, without any follow-up corrections or human guidance.
Centaur Math: A collaborative approach where a human mathematician steers an AI system, correcting its mistakes and guiding its logic to reach a solution.

Frequently asked

What is the First Proof benchmark?

It is a rigorous test designed by top mathematicians to evaluate whether AI can autonomously solve unpublished, research-level math problems.

Why does the test use unpublished problems?

To prevent 'data contamination.' If a problem is already on the internet, an AI might simply regurgitate a memorized solution rather than actually reasoning through it.

Did the AI models pass the test?

The models struggled significantly. When forced to work without human guidance, they frequently produced confident but logically flawed proofs.

Does this mean AI is useless for mathematics?

No. AI is still highly effective at executing routine calculations, verifying steps, and acting as an assistant when guided by a human mathematician.

Sources

[1]NatureSkeptical Academics
Humans outperform AI at this highly rigorous mathematics test
Read on Nature →
[2]Scientific AmericanSkeptical Academics
AI scores a 'C–' on its hardest math test yet
Read on Scientific American →
[3]Harvard CMSAFirst Proof Organizers
First Proof, Second Batch
Read on Harvard CMSA →
[4]First Proof InitiativeFirst Proof Organizers
First Proof Second Batch: Methodology and Results
Read on First Proof Initiative →
[5]The SociableAI Developers
OpenAI submitted models to the hardest math test yet for AI
Read on The Sociable →
[6]NautilusSkeptical Academics
Looking for Signs of Intelligence in Chatbots
Read on Nautilus →
[7]Tech Jacks SolutionsAI Developers
AI Math Results: What Four Reasoning Breakthroughs in 30 Days Mean for Research Automation
Read on Tech Jacks Solutions →

Up next

Longevity Science

The Science of Healthspan: What Actually Works for Extending Healthy Human Life

As the longevity industry booms with experimental biohacks, clinical evidence points to midlife lifestyle interventions and emerging cellular therapies as the most proven paths to preserving cognitive and physical health.

Every angle. Every day.

Get science stories with full source coverage and perspective breakdowns delivered to your inbox.

Get the briefing →Browse science