LLMs Benchmarked on Floating-Point Error Classification.

Lisa Taldir (LI-PaRAD), Muhammad Ahmad Saeed (LI-PaRAD), David Defour (LI-PaRAD), Pablo de Oliveira Castro (LI-PaRAD), Eric Petit· July 1, 2026 View original

▶ The 2-minute explainer

Summary

This paper introduces InterFLOPBench, a benchmark of 90 C kernels with 1,130 test samples, to evaluate Large Language Models' ability to detect and classify six categories of floating-point errors statically in software code. Latest LLMs achieved over 0.88 F1-score, demonstrating strong performance, though accuracy varied between explicit and subtle error types.

Floating-point errors are a common source of bugs in software, particularly in numerical computations, and can be challenging to detect. This research investigates the capability of Large Language Models (LLMs) to statically identify and classify these errors within C programming code, a task traditionally requiring specialized analysis tools. To facilitate this evaluation, the authors developed InterFLOPBench, a new benchmark comprising 90 C kernels and 1,130 test samples. This benchmark is specifically designed to assess LLMs across six distinct categories of floating-point errors: cancellation, comparison, division by zero, overflow, underflow, and NaN (Not-a-Number). The study compared the performance of 14 different LLMs on this benchmark. The evaluation framework treated floating-point error detection as a multi-label classification problem, using the F1-score as the primary metric. Results indicated that the latest generation of LLMs, including Qwen 3 32b, Gemini 2.5 Flash, Phi 4 Reasoning, DeepSeek R1T2, and gpt-oss models, achieved an impressive overall F1-score greater than 0.88. However, performance varied significantly across error categories, with LLMs performing better on explicit errors like division by zero (average F1-score: 0.8479) compared to more subtle numerical phenomena such as underflow (0.6059) and cancellation (0.6164).

Why it matters

For software engineers, quality assurance professionals, and AI developers, this research demonstrates LLMs' potential as powerful tools for static code analysis and bug detection, particularly for complex numerical errors, which can significantly improve software reliability and development efficiency.

How to implement this in your domain

1Integrate LLMs into static code analysis pipelines to automatically detect and classify floating-point errors.
2Fine-tune LLMs on domain-specific codebases and error patterns to improve their accuracy in identifying subtle numerical issues.
3Develop custom prompts and few-shot examples to guide LLMs in recognizing specific floating-point error categories.
4Use LLM-generated error classifications to prioritize and streamline code review processes for numerical stability.
5Create internal benchmarks similar to InterFLOPBench to continuously evaluate and improve LLM performance for code quality tasks.

Who benefits

Software DevelopmentFinancial ServicesScientific ComputingAerospaceAutomotive

Key takeaways

LLMs can effectively detect and classify floating-point errors in C code.
InterFLOPBench is a new benchmark for evaluating LLM performance on numerical errors.
Latest LLMs achieve high F1-scores, especially for explicit error types.
Performance varies, with subtle errors like underflow and cancellation being more challenging.

Original post by Lisa Taldir (LI-PaRAD), Muhammad Ahmad Saeed (LI-PaRAD), David Defour (LI-PaRAD), Pablo de Oliveira Castro (LI-PaRAD), Eric Petit

"arXiv:2606.31308v1 Announce Type: new Abstract: This paper investigates the capability of Large Language Models (LLMs) to detect and classify floating-point errors statically in software code. We introduce InterFLOPBench, a benchmark of 90 C kernels with 1 130 test samples design…"

View on X

Originally posted by Lisa Taldir (LI-PaRAD), Muhammad Ahmad Saeed (LI-PaRAD), David Defour (LI-PaRAD), Pablo de Oliveira Castro (LI-PaRAD), Eric Petit on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses

LLMs Benchmarked on Floating-Point Error Classification.

Why it matters

How to implement this in your domain

Who benefits

Key takeaways

Want to go deeper?

More in AI Engineering & DevTools

Philosophical Foundations for Explainable AI in Healthcare Explored

New Metric Improves LLM Reinforcement Learning with Verifiable Rewards.

New ACE Module Boosts LLM Agent Context Management