Auditing Reveals Flaws in AI Theorem Proving Benchmarks

Pawan Sasanka Ammanamanchi, Siddharth Bhat, Stella Biderman· June 30, 2026 View original

Summary

An audit of five widely used Lean theorem-proving benchmarks uncovered 4,833 findings, including 398 mechanically certified issues like counterexamples and unsound axioms. The study highlights that while machine-checked proofs verify formal statements, they don't guarantee the statement accurately reflects the intended informal problem or that evaluation methods are robust.

Benchmarks used to evaluate Large Language Model (LLM)-assisted theorem proving in Lean are often considered highly reliable because every solved problem comes with a machine-verified proof. However, this research reveals a critical oversight: the Lean kernel only confirms the formal correctness of a statement, not whether that statement accurately represents the original informal problem or if the evaluation methods are resilient to trivial or adversarial solutions. A comprehensive audit of five prominent Lean theorem-proving benchmarks and their derivatives identified nearly 5,000 issues. Among these, 398 were mechanically certified defects, including instances of counterexamples, vacuous theorems, and unsound axioms. The audit also documented semantic flaws such as missing hypotheses, oversimplified problems, incorrect translations, and Lean-specific specification hazards. Beyond dataset construction, the study examined evaluation-time failures, demonstrating that these defects can both artificially inflate and deflate reported prover scores. To address these issues, the researchers propose a fault taxonomy, a suite of automated checkers, and recall-oriented semantic audit prompts, along with new standards for creating more trustworthy and reproducible formal math datasets and evaluations.

Why it matters

For professionals developing or relying on AI for formal verification and theorem proving, understanding these benchmark limitations is crucial for accurate model evaluation and ensuring the reliability of AI-generated proofs.

How to implement this in your domain

  1. 1Adopt the proposed fault taxonomy and automated checkers when creating or selecting benchmarks for theorem proving.
  2. 2Implement rigorous semantic audit prompts to ensure formal statements accurately reflect intended informal problems.
  3. 3Review existing internal benchmarks for potential dataset defects and evaluation failures using the methods outlined.
  4. 4Prioritize the use of corrected dataset snapshots and adhere to new standards for formal math dataset creation.

Who benefits

Software EngineeringAI ResearchCybersecurityAcademiaSemiconductor Design

Key takeaways

  • Machine-checked proofs do not guarantee that formal statements accurately encode informal problems.
  • Widely used Lean theorem-proving benchmarks contain significant defects, including unsound axioms and vacuous theorems.
  • Dataset defects can lead to both inflated and deflated AI prover scores.
  • New tools and standards are needed for more reliable and reproducible formal math dataset creation and evaluation.

Original post by Pawan Sasanka Ammanamanchi, Siddharth Bhat, Stella Biderman

"arXiv:2606.29493v1 Announce Type: new Abstract: Benchmarks for LLM-assisted theorem proving in Lean are often treated as intrinsically reliable because every solved instance comes with a machine-checked proof. However, the kernel only checks that a proof establishes a \emph{forma…"

View on X

Originally posted by Pawan Sasanka Ammanamanchi, Siddharth Bhat, Stella Biderman on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses