Expert Review Reveals LLM Autoformalization Weaknesses Beyond Proof Gaps

Vasily Ilin, Brian Nugent· June 15, 2026 View original

Summary

A case study on the semi-autonomous formalization of Grothendieck's vanishing theorem highlights that while LLMs can close proof gaps, they struggle with higher-level aspects like definition choice, theorem generality, and API design. Expert review is crucial for evaluating the reusability and quality of autoformalized mathematical libraries.

This research investigates the distinction between merely closing proof gaps in interactive theorem provers and creating truly reusable library contributions through semi-autonomous formalization. The study focuses on a detailed case study involving the formalization of Grothendieck's vanishing theorem. Initially, the LLM-generated formalization compiled without any "sorries" (unproven statements), suggesting a complete proof. However, a subsequent expert review uncovered significant issues not related to proof correctness but to the quality and design of the formalization itself. These problems included suboptimal definitions, insufficient theorem generality, poor file organization, and an inadequate API. Following a review-driven refactoring and compression process, a second expert review was conducted. The comparison between the initial and refined versions clearly demonstrated that AI agents are adept at responding to local, mechanically verifiable feedback. However, they remain notably weak in making high-level design choices, such as selecting appropriate definitions and designing effective APIs for formal mathematical libraries. The study concludes that autoformalization should be judged not just by the absence of proof gaps, but by its ability to withstand rigorous expert review for reusability and quality.

Why it matters

This study provides critical insights into the limitations of current AI in complex formal reasoning, particularly in generating high-quality, reusable mathematical formalizations. Professionals developing AI for scientific or engineering applications must recognize that "correctness" (e.g., closing proof gaps) does not equate to "utility" or "design quality," necessitating human expert oversight for robust systems.

How to implement this in your domain

  1. 1Incorporate expert human review as a mandatory step for AI-generated formalizations or complex code.
  2. 2Focus AI development efforts on improving high-level design capabilities, such as definition selection and API design, rather than just local correctness.
  3. 3Develop metrics beyond mere proof completion to evaluate the quality and reusability of AI-generated formal content.
  4. 4Design interactive AI systems that facilitate human-AI collaboration for architectural and design decisions in formalization tasks.

Who benefits

AI ResearchSoftware EngineeringAcademiaScientific ComputingFormal Verification

Key takeaways

  • LLMs can close proof gaps but struggle with high-level design in formalization.
  • Expert review is crucial for assessing the reusability and quality of autoformalized content.
  • AI agents adapt well to local feedback but are weak at choosing definitions and designing APIs.
  • Evaluation of autoformalization should go beyond just proof completeness.

Original post by Vasily Ilin, Brian Nugent

"arXiv:2606.13925v1 Announce Type: new Abstract: Large language models can often close proof gaps in interactive theorem provers, but a verified theorem is not the same thing as a reusable library contribution. We study this distinction through a detailed case study: a semi-autono…"

View on X

Originally posted by Vasily Ilin, Brian Nugent on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses