External Feedback Outperforms Self-Refinement for LLM Improvement

Bart{\l}omiej Cupia{\l}, Jan {\L}ojek, Miko{\l}aj Garstecki, Szymon Pob{\l}ocki, Alicja Ziarko, Piotr Mi{\l}o\'s· July 1, 2026 View original

Summary

A study investigates the true impact of natural language feedback on LLM performance, finding that significant improvements beyond mere repeated attempts primarily stem from strong external teachers rather than self-generated feedback. The research emphasizes that a student model's ability to utilize feedback is a key bottleneck.

In multi-turn language agent settings, improvements in final accuracy are often observed, but it's frequently unclear whether these gains are due to genuinely useful feedback or simply other factors like resampling, format correction, or additional computation during testing. This research introduces a controlled student-teacher protocol to isolate and study when natural-language feedback truly drives improvement beyond what repeated attempts alone can achieve. The study evaluated thirteen open-weight models across various tasks, examining external feedback, self-feedback, and unguided self-refinement. The findings reveal that multi-turn improvement is often not strong evidence of effective feedback use; self-generated feedback, for instance, offered minimal gains beyond unguided self-refinement. Instead, the most substantial feedback-specific improvements were observed when strong external teachers provided guidance that went beyond generic retry instructions. The research also highlights that the student model's inherent ability to effectively utilize feedback is a more critical driver of interactive gains than the specific identity of the teacher, though teacher quality remains important. These results underscore the need to evaluate feedback-based agents against repeated-attempt baselines and suggest that enhancing an agent's capacity to act on feedback is a central bottleneck for interactive improvement.

Why it matters

Professionals designing and implementing LLM-based interactive systems need to understand that not all feedback is equally valuable, and investing in high-quality external feedback mechanisms and improving LLM's feedback assimilation capabilities is crucial for real performance gains.

How to implement this in your domain

  1. 1Design LLM evaluation metrics that differentiate between true feedback-driven improvement and gains from mere retries or format corrections.
  2. 2Prioritize developing robust external feedback mechanisms from human experts or highly capable 'teacher' models.
  3. 3Focus on training LLMs to better interpret and integrate external feedback into their reasoning processes.
  4. 4Implement A/B testing for different feedback strategies to identify what truly drives performance improvements in your applications.

Who benefits

AI DevelopmentCustomer ServiceEducationSoftware DevelopmentRobotics

Key takeaways

  • Multi-turn LLM improvement often isn't solely due to effective feedback.
  • Strong external teachers provide significantly more useful feedback than self-generated feedback.
  • A student LLM's ability to use feedback is a primary bottleneck for interactive improvement.
  • Feedback-based agents should be evaluated against repeated-attempt baselines.

Original post by Bart{\l}omiej Cupia{\l}, Jan {\L}ojek, Miko{\l}aj Garstecki, Szymon Pob{\l}ocki, Alicja Ziarko, Piotr Mi{\l}o\'s

"arXiv:2606.30774v1 Announce Type: new Abstract: We study when natural-language feedback produces improvement beyond the gains obtainable from repeated attempts alone. In multi-turn language agent setting, higher final accuracy can reflect useful feedback, but it can also arise fr…"

View on X

Originally posted by Bart{\l}omiej Cupia{\l}, Jan {\L}ojek, Miko{\l}aj Garstecki, Szymon Pob{\l}ocki, Alicja Ziarko, Piotr Mi{\l}o\'s on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses

More in AI Research

AI ResearchAI Engineering & DevTools

Philosophical Foundations for Explainable AI in Healthcare Explored

This paper critically reviews the intersection of philosophy of science and explainable AI (XAI) in health sciences, examining what constitutes an adequate medical explanation. It identifies causality, trust, and epistemic adequacy as central axes for designing robust XAI systems in clinical decision-making.

Martina Mattioli, Marcello PelilloJul 1, 2026
AI ResearchAI Engineering & DevTools

New Metric Improves LLM Reinforcement Learning with Verifiable Rewards.

This research introduces the Relative Surprisal Index (RSI), an information-theoretic metric for adaptive token selection in Reinforcement Learning with Verifiable Rewards (RLVR) for LLMs. RSI-S, an entropy-adaptive filtering method based on RSI, improves reasoning accuracy by 2-3 percentage points by retaining tokens within a stable surprisal interval.

Outongyi Lv, Yanzhao Zheng, Yuanwei Zhang, Zhenghao Huang, Xingjun Wang, Baohua Dong, Hangcheng Zhu, Yingda ChenJul 1, 2026
AI Engineering & DevToolsAI Research

New ACE Module Boosts LLM Agent Context Management

Researchers introduce ACE (Adaptive Context Elasticizer), a plug-and-play module that dynamically manages historical information for LLM-based agents. ACE maintains a lossless message layer and adaptively orchestrates context, significantly improving performance across various agent frameworks without architectural changes.

Ning Liao, Zihao Long, Xiaoxing Wang, Xue Yang, Yaoming Wang, Ziyuan Zhuang, Xunliang Cai, Rongxiang Weng, Junchi YanJul 1, 2026