CODA-BENCH Evaluates AI Agents on Data-Intensive Coding Tasks

Yuxin Zhang, Ju Fan, Meihao Fan, Shaolei Zhang, Xiaoyong Du· June 16, 2026 View original

Summary

CODA-BENCH is a novel benchmark designed to assess the combined code and data intelligence of AI agents in realistic, data-intensive environments. It reveals that even advanced agents struggle to effectively integrate data discovery with code execution, highlighting a significant gap in current agentic capabilities for complex data tasks.

As advanced AI agents increasingly demonstrate capabilities as autonomous engineers, there's a growing need for evaluation benchmarks that accurately reflect the complexities of real-world development. These scenarios typically involve intricate code alongside large-scale data, often within complex file systems. However, existing benchmarks tend to evaluate code-centric or data-centric abilities in isolation, failing to capture the integrated challenges of real development. To bridge this gap, researchers introduce CODA-BENCH, the first benchmark specifically designed to jointly evaluate both code and data intelligence within a data-intensive environment. This benchmark utilizes a data-intensive Linux sandbox based on the Kaggle ecosystem, featuring hundreds of datasets. Agents must actively navigate complex file hierarchies to identify relevant resources and then generate code for data-driven analytical tasks. CODA-BENCH comprises 1,009 tasks across 31 communities, with each task environment containing an average of 980 files, simulating realistic data scale and noise. Evaluations of top-performing agents on CODA-BENCH show a success rate of only 61.1%, indicating a substantial deficiency in their ability to effectively integrate data discovery with code execution. These findings point to critical areas for future research in agentic capabilities.

Why it matters

For professionals developing or deploying AI agents for software engineering or data science tasks, CODA-BENCH highlights current limitations and provides a crucial tool for developing more capable and robust agents that can handle the full complexity of real-world data environments.

How to implement this in your domain

  1. 1Utilize CODA-BENCH to evaluate the performance of your AI agents on integrated code and data tasks.
  2. 2Focus agent development efforts on improving data discovery and contextual understanding within complex file systems.
  3. 3Design agent architectures that better integrate code generation with data exploration and manipulation.
  4. 4Analyze failure modes on CODA-BENCH to identify specific weaknesses in agentic reasoning for data-intensive scenarios.

Who benefits

Software DevelopmentData ScienceAI EngineeringResearch & DevelopmentEdTech

Key takeaways

  • CODA-BENCH is the first benchmark to evaluate AI agents on combined code and data intelligence.
  • It simulates real-world data-intensive environments using a Kaggle-based sandbox.
  • Current advanced agents struggle with integrating data discovery and code execution.
  • The benchmark highlights a significant gap in agentic capabilities for complex data tasks.

Original post by Yuxin Zhang, Ju Fan, Meihao Fan, Shaolei Zhang, Xiaoyong Du

"arXiv:2606.15300v1 Announce Type: new Abstract: Advanced agents are increasingly demonstrating the potential to operate as autonomous engineers, creating a growing demand for evaluation benchmarks that capture the complexity of real-world development. Such environments typically…"

View on X

Originally posted by Yuxin Zhang, Ju Fan, Meihao Fan, Shaolei Zhang, Xiaoyong Du on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses