New Benchmark Evaluates LLMs for Hardware Formal Verificatio

New Benchmark Evaluates LLMs for Hardware Formal Verification

Maohua Nie, Jiang Zhu, Jingqun Zhang, Zhichen Zeng, Jiayi Wang, Sibo Zhang, Jialin Wang, C. -J. Richard Shi· June 15, 2026 View original

Summary

Researchers introduce HierSVA, a comprehensive suite including a pipeline, dataset, and benchmark to assess large language models' capabilities in hierarchical hardware formal verification. The evaluation reveals current LLMs struggle with fault detection and formal core coverage, despite high assertion proof success rates.

A new research initiative, HierSVA, has been developed to rigorously evaluate the performance of large language models (LLMs) in the complex domain of hierarchical hardware formal verification. This suite comprises a data synthesis pipeline, a specialized dataset, and a benchmark designed to test LLMs' ability to generate SystemVerilog Assertions (SVA) for hierarchical Register Transfer Level (RTL) designs. The HierSVA-SP pipeline integrates RTL preprocessing with an LLM-driven formal verification flow to create reference SVAs. This process was applied to BaseJump STL, resulting in HierSVA-DS, a dataset featuring 342 modules with varying hierarchy depths and a subset of 28 module-bug pairs. The HierSVA-B benchmark assesses assertion quality across six metrics, including syntax correctness, proof success, vacuity, specification faithfulness, mutation coverage, and formal core coverage. Initial evaluations of twelve recent LLMs using HierSVA-B revealed several challenges. While LLMs achieved a 67.1% module-level compile rate and 82.1% non-vacuous proof success, their assertion sets detected only 70.2% of injected faults and covered just 36.2% of the formal core. Furthermore, agentic mode showed some improvements in provability and strength but with diminishing returns.

Why it matters

This research is crucial for hardware engineers and AI developers aiming to leverage LLMs for automated verification, highlighting current limitations and guiding future development towards more robust and reliable AI-assisted design tools.

How to implement this in your domain

1Review HierSVA benchmark results to understand current LLM limitations in hardware verification.
2Integrate the HierSVA dataset into internal LLM training pipelines for specialized hardware design tasks.
3Develop custom evaluation metrics based on HierSVA's six axes to assess LLM-generated SVA quality.
4Explore agentic LLM modes for SVA generation, focusing on iterative refinement to overcome current plateaus.
5Collaborate with research teams to contribute to improving LLM capabilities for formal verification.

Who benefits

SemiconductorElectronics ManufacturingAerospaceAutomotive

Key takeaways

HierSVA provides a new benchmark for evaluating LLMs in hierarchical hardware formal verification.
Current LLMs show promise in SVA generation but struggle with comprehensive fault detection and formal core coverage.
The benchmark assesses assertion quality across six critical metrics.
Agentic LLM modes offer some improvements but face performance plateaus.

Original post by Maohua Nie, Jiang Zhu, Jingqun Zhang, Zhichen Zeng, Jiayi Wang, Sibo Zhang, Jialin Wang, C. -J. Richard Shi

"arXiv:2606.13706v1 Announce Type: cross Abstract: We present HierSVA, an integrated suite that combines a pipeline, dataset, and benchmark for LLM-driven hierarchical hardware formal verification. HierSVA-SP pairs an RTL preprocessing toolchain with an LLM-in-the-loop formal veri…"

View on X

Originally posted by Maohua Nie, Jiang Zhu, Jingqun Zhang, Zhichen Zeng, Jiayi Wang, Sibo Zhang, Jialin Wang, C. -J. Richard Shi on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses

New Benchmark Evaluates LLMs for Hardware Formal Verification

Why it matters

How to implement this in your domain

Who benefits

Key takeaways

Want to go deeper?

More in AI Research

VISReg Enhances JEPA Training with Novel Regularization

Margaret Atwood Criticizes AI for "Garbage In, Garbage Out" Flaw

Podcast Explores Large Test-Time Compute and AI Model Budgets