IHBench Evaluates Voice Agent Recovery from User Interruptio

IHBench Evaluates Voice Agent Recovery from User Interruptions

Ahmad Salimi, Wentao Ma, Yuzhi Tang, Dongming Shen, Mu Li, Alex Smola· June 19, 2026 View original

Summary

A new benchmark, IHBench, has been introduced to evaluate how voice agents recover from user interruptions within structured workflows. It assesses task fulfillment and recovery quality across 10 enterprise domains and various interruption types.

Voice agents operating in structured environments, such as customer service or healthcare scheduling, frequently encounter user interruptions. While existing benchmarks focus on the timing of these interruptions, they often overlook the crucial aspect of what happens *after* an interruption: whether the agent can correctly resume the workflow, address the interjection, and avoid repeating information. To fill this gap, researchers have developed IHBench, the Interruption Handling Benchmark. This benchmark specifically evaluates voice agents' ability to recover from interruptions while executing state-machine-driven workflows across ten diverse enterprise domains. It injects six different types of interruptions at controlled points during an utterance, with each interruption scored on both task fulfillment and the quality of the recovery. An evaluation of 27 audio-language model configurations from major providers like OpenAI and Google, as well as open-weight models, revealed significant performance disparities. Closed-weight models consistently demonstrated greater robustness to interruptions, outperforming open-weight models in task fulfillment, degrading slower with conversation length, and showing no modality gap between audio and text inputs. A human study validated the LLM judge's accuracy, confirming that recovery quality is a distinct and important capability for voice agents.

Why it matters

For professionals developing or deploying voice agents, IHBench provides a critical tool to ensure robust and user-friendly interactions. Evaluating post-interruption recovery is essential for building reliable conversational AI systems that can maintain user satisfaction and task completion in real-world, dynamic environments.

How to implement this in your domain

1Utilize IHBench to rigorously test the interruption handling capabilities of your existing or in-development voice agents.
2Analyze the performance of your voice agents across different interruption types to identify areas for improvement.
3Prioritize development efforts on improving recovery quality, especially for critical enterprise workflows.
4Consider the findings regarding closed-weight versus open-weight models when selecting platforms for voice agent deployment.
5Implement continuous evaluation using IHBench-like metrics to monitor and enhance voice agent resilience.

Who benefits

Customer ServiceHealthcareTelecommunicationsBankingRetail

Key takeaways

IHBench is a new benchmark for evaluating voice agent recovery after user interruptions.
It assesses task fulfillment and recovery quality in structured, multi-step workflows.
Closed-weight models generally show superior robustness to interruptions compared to open-weight models.
Effective interruption handling is a distinct and crucial capability for production-ready voice agents.

Original post by Ahmad Salimi, Wentao Ma, Yuzhi Tang, Dongming Shen, Mu Li, Alex Smola

"arXiv:2606.19595v1 Announce Type: new Abstract: Voice agents deployed in structured workflows (customer service, healthcare scheduling, account management) must handle frequent user interruptions while maintaining progress through multi-step procedures. Existing benchmarks for sp…"

View on X

Originally posted by Ahmad Salimi, Wentao Ma, Yuzhi Tang, Dongming Shen, Mu Li, Alex Smola on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses

IHBench Evaluates Voice Agent Recovery from User Interruptions

Why it matters

How to implement this in your domain

Who benefits

Key takeaways

Want to go deeper?

More in AI Research

VISReg Enhances JEPA Training with Novel Regularization

Margaret Atwood Criticizes AI for "Garbage In, Garbage Out" Flaw

Podcast Explores Large Test-Time Compute and AI Model Budgets