CDR-Bench Reveals LLMs Fail Compositional, Order-Sensitive D

CDR-Bench Reveals LLMs Fail Compositional, Order-Sensitive Data Refinement.

Yuchen Huang, Xiang Li, Zhenqing Ling, Sijia Li, Qianli Shen, Daoyuan Chen, Yi R. Fung, Yaliang Li· July 1, 2026 View original

Summary

CDR-Bench is a new benchmark evaluating LLMs' ability to faithfully execute multi-step, order-sensitive data refinement recipes, revealing that current models struggle significantly with compositional tasks and procedural faithfulness. This highlights a critical gap in LLM capabilities for reliable data processing.

Data refinement tasks often involve executing multiple sequential operations on evolving text states, where both the combination and the order of these operations are crucial for the final outcome. While existing benchmarks have explored text editing or combined it with code execution, it has been unclear whether large language models (LLMs) can directly and reliably perform these complex, compositional, and order-sensitive data refinement recipes. To address this gap, researchers introduced CDR-Bench, a comprehensive benchmark comprising 3,462 high-quality tasks across four real-world data refinement domains and 29 distinct operators. The benchmark evaluates LLMs in atomic, order-agnostic, and order-sensitive settings, using deterministic reference outputs for precise evaluation. Experiments conducted with over ten state-of-the-art LLMs revealed consistent failure patterns. Model performance sharply declined in compositional settings, and success rates for order-sensitive recipes collapsed. These findings underscore a significant deficiency in current LLMs regarding the procedural faithfulness required for dependable compositional data refinement, indicating a need for further research in this area.

Why it matters

Professionals relying on LLMs for automated data cleaning, transformation, or complex text processing workflows need to be aware of these limitations, as current models may not reliably execute multi-step, order-dependent refinement tasks.

How to implement this in your domain

1Exercise caution when designing LLM-based workflows for multi-step data refinement, especially where order matters.
2Implement rigorous validation and human-in-the-loop checks for LLM-generated data transformations.
3Break down complex data refinement tasks into smaller, atomic, and less order-sensitive steps for LLMs.
4Explore alternative or hybrid approaches that combine LLMs with deterministic scripting for critical, order-sensitive operations.

Who benefits

Data ScienceSoftware DevelopmentBusiness IntelligenceMarketingLegal

Key takeaways

LLMs struggle with compositional and order-sensitive data refinement tasks.
CDR-Bench reveals a lack of procedural faithfulness in current LLMs.
Performance degrades significantly when multiple operations are combined.
Reliable execution of multi-step data processing remains a challenge for LLMs.

Original post by Yuchen Huang, Xiang Li, Zhenqing Ling, Sijia Li, Qianli Shen, Daoyuan Chen, Yi R. Fung, Yaliang Li

"arXiv:2606.31435v1 Announce Type: new Abstract: Data refinement involves executing multi-step recipes over evolving text states, where both composition and execution order of processing operators determine the outcome. While existing benchmarks either isolate text editing or enta…"

View on X

Originally posted by Yuchen Huang, Xiang Li, Zhenqing Ling, Sijia Li, Qianli Shen, Daoyuan Chen, Yi R. Fung, Yaliang Li on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses

CDR-Bench Reveals LLMs Fail Compositional, Order-Sensitive Data Refinement.

Why it matters

How to implement this in your domain

Who benefits

Key takeaways

Want to go deeper?

More in AI Engineering & DevTools

Philosophical Foundations for Explainable AI in Healthcare Explored

New Metric Improves LLM Reinforcement Learning with Verifiable Rewards.

New ACE Module Boosts LLM Agent Context Management