New MA-ProofBench Benchmark Evaluates LLM Theorem Proving in

New MA-ProofBench Benchmark Evaluates LLM Theorem Proving in Advanced Math

Lushi Pu, Weiming Zhang, Xinheng Xie, Zixuan Fu, Bingxiang He, Hongya Lyu, Xin Li, Jie Zhou, Yudong Wang· June 15, 2026 View original

Summary

A new benchmark, MA-ProofBench, has been introduced to evaluate Large Language Models' (LLMs) ability to perform formal theorem proving in mathematical analysis. It features 200 formalized theorems across two difficulty levels, revealing that even top LLMs like GPT-5.5 perform poorly, highlighting significant gaps in formal reasoning.

Researchers have developed MA-ProofBench, the first formal theorem-proving benchmark specifically designed for mathematical analysis. This benchmark addresses a critical gap in existing evaluation tools, which often focus on easier-to-formalize areas like algebra and number theory. MA-ProofBench comprises 200 formalized theorems, spanning six core topics and 27 subcategories, including complex analysis and measure theory. The problems are categorized into two difficulty levels: an undergraduate level (Level I) and a Ph.D. qualifying level (Level II), each containing 100 problems. The creation process involved human-led, LLM-assisted formalization, followed by independent expert review to ensure mathematical fidelity. Evaluations of various general-purpose reasoning models and formal theorem provers on MA-ProofBench showed limited success. Even the highest-performing model, GPT-5.5, achieved only 16% Pass@8 on Level I and a mere 5% on Level II, with most other models scoring near 0% on Level II. Analysis indicated that "Mathlib hallucinations" and incomplete proofs were primary failure modes, underscoring a significant disparity between informal and formal mathematical reasoning capabilities in current LLMs.

Why it matters

This benchmark provides a crucial tool for assessing and advancing the formal mathematical reasoning capabilities of AI, which is essential for developing more reliable and trustworthy AI systems in scientific and engineering domains. Professionals can use these findings to understand current LLM limitations in complex logical tasks.

How to implement this in your domain

1Review MA-ProofBench to understand the current limitations of LLMs in formal mathematical reasoning.
2Integrate formal verification techniques into AI development workflows for high-stakes applications.
3Investigate methods to reduce "Mathlib hallucinations" and improve proof completeness in AI-generated formalizations.
4Collaborate with mathematicians to develop more robust training data and evaluation metrics for advanced AI reasoning.

Who benefits

AI ResearchSoftware EngineeringAcademiaScientific Computing

Key takeaways

MA-ProofBench is the first formal benchmark for LLM theorem proving in mathematical analysis.
Current LLMs, including GPT-5.5, perform poorly on advanced mathematical formal reasoning tasks.
Key failure modes include "Mathlib hallucinations" and incomplete proofs.
The benchmark highlights a significant gap between informal and formal reasoning in LLMs.

Original post by Lushi Pu, Weiming Zhang, Xinheng Xie, Zixuan Fu, Bingxiang He, Hongya Lyu, Xin Li, Jie Zhou, Yudong Wang

"arXiv:2606.13782v1 Announce Type: new Abstract: Large Language Models (LLMs) have made notable progress in automated theorem proving, yet existing formal benchmarks remain limited in both mathematical coverage and difficulty. Most are concentrated in areas that are easier to form…"

View on X

Originally posted by Lushi Pu, Weiming Zhang, Xinheng Xie, Zixuan Fu, Bingxiang He, Hongya Lyu, Xin Li, Jie Zhou, Yudong Wang on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses

New MA-ProofBench Benchmark Evaluates LLM Theorem Proving in Advanced Math

Why it matters

How to implement this in your domain

Who benefits

Key takeaways

Want to go deeper?

More in AI Research

VISReg Enhances JEPA Training with Novel Regularization

Margaret Atwood Criticizes AI for "Garbage In, Garbage Out" Flaw

Podcast Explores Large Test-Time Compute and AI Model Budgets