LLMs Benchmarked on Floating-Point Error Classification.

Lisa Taldir (LI-PaRAD), Muhammad Ahmad Saeed (LI-PaRAD), David Defour (LI-PaRAD), Pablo de Oliveira Castro (LI-PaRAD), Eric Petit· July 1, 2026 View original

▶ The 2-minute explainer

Summary

This paper introduces InterFLOPBench, a benchmark of 90 C kernels with 1,130 test samples, to evaluate Large Language Models' ability to detect and classify six categories of floating-point errors statically in software code. Latest LLMs achieved over 0.88 F1-score, demonstrating strong performance, though accuracy varied between explicit and subtle error types.

Floating-point errors are a common source of bugs in software, particularly in numerical computations, and can be challenging to detect. This research investigates the capability of Large Language Models (LLMs) to statically identify and classify these errors within C programming code, a task traditionally requiring specialized analysis tools. To facilitate this evaluation, the authors developed InterFLOPBench, a new benchmark comprising 90 C kernels and 1,130 test samples. This benchmark is specifically designed to assess LLMs across six distinct categories of floating-point errors: cancellation, comparison, division by zero, overflow, underflow, and NaN (Not-a-Number). The study compared the performance of 14 different LLMs on this benchmark. The evaluation framework treated floating-point error detection as a multi-label classification problem, using the F1-score as the primary metric. Results indicated that the latest generation of LLMs, including Qwen 3 32b, Gemini 2.5 Flash, Phi 4 Reasoning, DeepSeek R1T2, and gpt-oss models, achieved an impressive overall F1-score greater than 0.88. However, performance varied significantly across error categories, with LLMs performing better on explicit errors like division by zero (average F1-score: 0.8479) compared to more subtle numerical phenomena such as underflow (0.6059) and cancellation (0.6164).

Why it matters

For software engineers, quality assurance professionals, and AI developers, this research demonstrates LLMs' potential as powerful tools for static code analysis and bug detection, particularly for complex numerical errors, which can significantly improve software reliability and development efficiency.

How to implement this in your domain

  1. 1Integrate LLMs into static code analysis pipelines to automatically detect and classify floating-point errors.
  2. 2Fine-tune LLMs on domain-specific codebases and error patterns to improve their accuracy in identifying subtle numerical issues.
  3. 3Develop custom prompts and few-shot examples to guide LLMs in recognizing specific floating-point error categories.
  4. 4Use LLM-generated error classifications to prioritize and streamline code review processes for numerical stability.
  5. 5Create internal benchmarks similar to InterFLOPBench to continuously evaluate and improve LLM performance for code quality tasks.

Who benefits

Software DevelopmentFinancial ServicesScientific ComputingAerospaceAutomotive

Key takeaways

  • LLMs can effectively detect and classify floating-point errors in C code.
  • InterFLOPBench is a new benchmark for evaluating LLM performance on numerical errors.
  • Latest LLMs achieve high F1-scores, especially for explicit error types.
  • Performance varies, with subtle errors like underflow and cancellation being more challenging.

Original post by Lisa Taldir (LI-PaRAD), Muhammad Ahmad Saeed (LI-PaRAD), David Defour (LI-PaRAD), Pablo de Oliveira Castro (LI-PaRAD), Eric Petit

"arXiv:2606.31308v1 Announce Type: new Abstract: This paper investigates the capability of Large Language Models (LLMs) to detect and classify floating-point errors statically in software code. We introduce InterFLOPBench, a benchmark of 90 C kernels with 1 130 test samples design…"

View on X

Originally posted by Lisa Taldir (LI-PaRAD), Muhammad Ahmad Saeed (LI-PaRAD), David Defour (LI-PaRAD), Pablo de Oliveira Castro (LI-PaRAD), Eric Petit on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses

More in AI Engineering & DevTools

AI ResearchAI Engineering & DevTools

Philosophical Foundations for Explainable AI in Healthcare Explored

This paper critically reviews the intersection of philosophy of science and explainable AI (XAI) in health sciences, examining what constitutes an adequate medical explanation. It identifies causality, trust, and epistemic adequacy as central axes for designing robust XAI systems in clinical decision-making.

Martina Mattioli, Marcello PelilloJul 1, 2026
AI ResearchAI Engineering & DevTools

New Metric Improves LLM Reinforcement Learning with Verifiable Rewards.

This research introduces the Relative Surprisal Index (RSI), an information-theoretic metric for adaptive token selection in Reinforcement Learning with Verifiable Rewards (RLVR) for LLMs. RSI-S, an entropy-adaptive filtering method based on RSI, improves reasoning accuracy by 2-3 percentage points by retaining tokens within a stable surprisal interval.

Outongyi Lv, Yanzhao Zheng, Yuanwei Zhang, Zhenghao Huang, Xingjun Wang, Baohua Dong, Hangcheng Zhu, Yingda ChenJul 1, 2026
AI Engineering & DevToolsAI Research

New ACE Module Boosts LLM Agent Context Management

Researchers introduce ACE (Adaptive Context Elasticizer), a plug-and-play module that dynamically manages historical information for LLM-based agents. ACE maintains a lossless message layer and adaptively orchestrates context, significantly improving performance across various agent frameworks without architectural changes.

Ning Liao, Zihao Long, Xiaoxing Wang, Xue Yang, Yaoming Wang, Ziyuan Zhuang, Xunliang Cai, Rongxiang Weng, Junchi YanJul 1, 2026