MedEvoEval: Evaluating Doctor Agents' Continual Evolution

Hui Zhang· June 30, 2026 View original

Summary

MedEvoEval is a new longitudinal evaluation framework for doctor agents, simulating outpatient clinical episodes to assess how agents acquire evidence, use resources, and evolve their behavior across episodes through memory and updates. It exposes process costs and supports analysis of memory maturation and transfer.

This paper introduces MedEvoEval, a novel and executable longitudinal evaluation framework designed to assess the continual evolution of "doctor agents" in simulated clinical environments. Moving beyond single-turn question-answering, this framework focuses on outpatient episodes where agents must acquire evidence, utilize examination and consultation resources, and make decisions regarding diagnosis and management plans. Crucially, MedEvoEval also supports the evaluation of how agent behavior changes over time through mechanisms like memory, retrieval, and reflection across multiple episodes. The framework converts source clinical cases into role-specific patient, examination, and manager views, revealing evidence only through valid actions. Each episode generates a structured trace, detailing observations, actions, final outputs, scores, and optional experience write-back. Experiments using 700 processed episodes demonstrate that MedEvoEval can uncover process costs often hidden by final-answer scoring, illustrate resource reallocation during multi-disciplinary team-style consultations, and facilitate longitudinal analyses of memory development, transfer learning, and retention. This framework provides a concrete foundation for evaluating whether doctor agents genuinely improve with experience and maintain capabilities over time.

Why it matters

This framework provides a robust method for developing and validating AI doctor agents that can learn and adapt over time, crucial for building reliable and effective clinical decision support systems.

How to implement this in your domain

  1. 1Adopt MedEvoEval as a standard benchmark for developing and testing AI agents in healthcare applications.
  2. 2Integrate longitudinal evaluation methodologies into the development lifecycle of AI-powered diagnostic tools.
  3. 3Utilize the framework to identify and address weaknesses in agent memory, reasoning, and decision-making processes over extended interactions.

Who benefits

HealthcareMedical ResearchPharmaceuticalsAI Development

Key takeaways

  • MedEvoEval is a new framework for evaluating evolving AI doctor agents.
  • It simulates longitudinal outpatient clinical episodes.
  • The framework reveals process costs and supports analysis of agent learning and memory.
  • It helps assess how agents improve with experience and retain capabilities.

Original post by Hui Zhang

"arXiv:2606.28900v1 Announce Type: new Abstract: Doctor agents are moving beyond single-turn answer generation toward evolving clinical decision systems. Within an outpatient episode, they acquire evidence, use examination and consultation resources, and decide when to finalize a…"

View on X

Originally posted by Hui Zhang on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses