Analyzing AI Agent Trajectories Reveals Model-Specific Problem-Solving Behaviors

Gaurav Gupta, Vatshank Chaturvedi, Jun Huan, Anoop Deoras· June 17, 2026 View original

Summary

This research formalizes the "intent-execution gap" in AI agents, highlighting the mismatch between a model's intended actions and the agent harness's execution. It introduces the Simple Strands Agent (SSA) harness to reproduce and improve performance on benchmarks, and analyzes 138,000 trajectories to uncover model-level differences in problem-solving beyond simple pass rates.

The performance of AI agents is not solely dependent on the underlying model but also significantly influenced by the "agent harness" – the structured execution layer that includes prompts, memory, tools, and interaction loops. A critical challenge is the "intent-execution gap," which describes the discrepancy between what the model aims to do and what the harness actually executes. Bridging this gap is as vital as other aspects of harness design. To investigate this, a customizable harness called "Simple Strands Agent" (SSA) was developed. SSA aims to identify common patterns across various large language model families while also accounting for model-specific preferences. Using SSA, researchers were able to reproduce or even enhance the pass@1 performance on several popular agentic benchmarks, including SWE-Pro and Terminal-Bench-2. Beyond simple pass rates, an extensive analysis of 138,000 trajectories generated by SSA revealed deeper insights into model behavior. By mapping these trajectories into code state-spaces, distinct problem-solving approaches emerged among different models. Finer-grained metrics, such as edit frequency, testing activity, and phase transitions, provided a detailed view of how individual models allocate their effort across the various stages of autonomous problem-solving.

Why it matters

For AI engineers and developers, understanding the intent-execution gap and how different models behave within agent harnesses is crucial for optimizing agent performance, debugging failures, and designing more robust and efficient AI systems.

How to implement this in your domain

  1. 1Analyze agent trajectories in detail to identify discrepancies between model intent and harness execution.
  2. 2Develop custom agent harnesses that are aligned with the specific capabilities and preferences of the chosen LLM.
  3. 3Implement fine-grained metrics like edit frequency and testing activity to evaluate agent problem-solving processes.
  4. 4Benchmark agent performance not just on final pass rates but also on intermediate behaviors and resource allocation.
  5. 5Iteratively refine agent harness design based on insights from trajectory analysis to minimize the intent-execution gap.

Who benefits

Software DevelopmentAI EngineeringRoboticsAutonomous SystemsQuality Assurance

Key takeaways

  • AI agent performance is heavily influenced by the "intent-execution gap" between the model and its harness.
  • Customizable harnesses can significantly improve benchmark performance across diverse LLMs.
  • Analyzing agent trajectories reveals model-specific problem-solving strategies beyond simple pass rates.
  • Finer-grained metrics offer deeper insights into how models allocate effort during autonomous tasks.

Original post by Gaurav Gupta, Vatshank Chaturvedi, Jun Huan, Anoop Deoras

"arXiv:2606.17454v1 Announce Type: new Abstract: AI agent performance is not just a modeling problem, it is fundamentally a systems problem. The advanced capabilities of models are realized through agent harnesses. Therefore, a gap between model assumptions and harness behavior ca…"

View on X

Originally posted by Gaurav Gupta, Vatshank Chaturvedi, Jun Huan, Anoop Deoras on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses