Mask-Proof Pipeline Automates LLM Evaluation for Mathematica

Mask-Proof Pipeline Automates LLM Evaluation for Mathematical Proofs

Jierui Zhang, Siyuan Tan, Xinhang Li, Longzhuangzhi Lin, Dailin Li, Chengfeng Gu, Xinping Li, Yaxian Hao, Shengjia Liang, Yuxiang Ren, Wenhao Liu· June 16, 2026 View original

Summary

Mask-Proof is an LLM-based pipeline that converts real mathematical proofs into automatically checkable masked-step tasks to measure step-level reasoning. It evaluates model reconstructions using an LLM-based equivalence judge and introduces Mask-ProofBench, a benchmark of 292 curated problems.

Researchers have developed Mask-Proof, an innovative pipeline designed to automate the curation and evaluation of mathematical proofs using large language models (LLMs). The primary goal is to provide a scalable and reproducible method for measuring step-level reasoning within long and diverse mathematical proofs, addressing a critical gap in evaluating trustworthy AI assistance in scientific progress. The Mask-Proof pipeline transforms actual mathematical proofs into masked-step tasks that can be automatically checked. It strategically masks key formula steps within a proof, provides the surrounding context, and then evaluates how well an LLM can reconstruct the missing steps. The reconstruction quality is judged by another LLM-based equivalence judge, which uses repeated votes to ensure stability and accuracy. Accompanying the pipeline is Mask-ProofBench, a new benchmark comprising 292 curated problems from various research areas. Experiments conducted with 17 different models on this benchmark revealed that models enhanced with specific reasoning capabilities significantly outperform standard models, showing improvements of 12% to 27%. The LLM-based evaluator achieved a high agreement rate of 96.8% with expert annotators, confirming its reliability for faithful, reproducible, and comparable measurement of step-level mathematical reasoning.

Why it matters

This pipeline provides a robust and scalable method for evaluating the step-level reasoning capabilities of LLMs in mathematics, which is crucial for developing more reliable AI assistants for scientific research and education. Professionals can use this to benchmark and improve AI tools for formal verification and problem-solving.

How to implement this in your domain

1Utilize the Mask-Proof pipeline to benchmark the mathematical reasoning capabilities of various LLMs for specific applications.
2Integrate the LLM-based equivalence judge into automated proof verification systems to enhance accuracy and scalability.
3Apply masked-step tasks for fine-tuning LLMs on domain-specific mathematical proofs to improve their reasoning.
4Develop educational tools that leverage this methodology to provide step-by-step feedback on mathematical problem-solving.
5Collaborate with AI researchers to extend the Mask-ProofBench to new areas of mathematics or scientific reasoning.

Who benefits

AcademiaEdTechResearch & DevelopmentSoftware EngineeringAI Development

Key takeaways

Mask-Proof automates the evaluation of LLM step-level reasoning in mathematical proofs.
The pipeline converts real proofs into automatically checkable masked-step tasks.
Reasoning-enhanced LLMs significantly outperform standard models on mathematical tasks.
The LLM-based evaluator achieves high agreement with human experts, ensuring reliability.

Original post by Jierui Zhang, Siyuan Tan, Xinhang Li, Longzhuangzhi Lin, Dailin Li, Chengfeng Gu, Xinping Li, Yaxian Hao, Shengjia Liang, Yuxiang Ren, Wenhao Liu

"arXiv:2606.15258v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly capable of mathematical problem solving and can even assist with research-level proofs, yet we still lack a scalable and reproducible way to measure step-level reasoning in long proofs a…"

View on X

Originally posted by Jierui Zhang, Siyuan Tan, Xinhang Li, Longzhuangzhi Lin, Dailin Li, Chengfeng Gu, Xinping Li, Yaxian Hao, Shengjia Liang, Yuxiang Ren, Wenhao Liu on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses

Mask-Proof Pipeline Automates LLM Evaluation for Mathematical Proofs

Why it matters

How to implement this in your domain

Who benefits

Key takeaways

Want to go deeper?

More in AI Research

VISReg Enhances JEPA Training with Novel Regularization

Margaret Atwood Criticizes AI for "Garbage In, Garbage Out" Flaw

Podcast Explores Large Test-Time Compute and AI Model Budgets