Fine-Tuning Vision-Language Models for Visual Grounding Creates Controllable Interference

Chenyu Zhou, Qiliang Jiang, Boguang Pan· June 15, 2026 View original

Summary

Research shows that fine-tuning vision-language models to generate dense coordinate lists for visual grounding can introduce an "interference surface," affecting how models serialize and terminate structured outputs. This behavior, characterized by repeated output tails, can be measured and controlled without compromising performance.

This research investigates the effects of fine-tuning vision-language models (VLMs) to produce dense coordinate lists, a technique used to improve visual grounding. The study found that this specific fine-tuning method creates an "interference surface" within the model, which alters how it generates and concludes structured outputs. A key observation was the emergence of repeated output tails, where the model duplicates parts of its generated coordinate lists. For instance, in Gemma 4 12B, high-capacity LoRA significantly improved visual grounding but also led to a notable duplicate rate. Crucially, the study demonstrates that this interference is controllable. By implementing object-level repeat-stop mechanisms, researchers were able to eliminate duplicate records while maintaining or even slightly improving the model's F1 scores for visual grounding. This suggests that the introduced behavior is a structure-bound, cross-family phenomenon that can be managed.

Why it matters

Professionals working with vision-language models for tasks like object detection or image captioning need to understand how fine-tuning impacts model behavior beyond primary performance metrics, especially concerning output quality and control. This research offers insights into managing unintended side effects like output repetition, ensuring more reliable and efficient model deployment.

How to implement this in your domain

  1. 1Implement post-processing filters to detect and remove repeated elements in structured outputs from fine-tuned VLMs.
  2. 2Experiment with different LoRA ranks and adapter capacities during fine-tuning to observe and mitigate interference effects.
  3. 3Develop custom evaluation metrics that specifically track output repetition rates alongside standard performance metrics for visual grounding tasks.
  4. 4Investigate the internal mechanisms of VLMs to pinpoint where coordinate-list fine-tuning induces changes in serialization logic.

Who benefits

RoboticsAutonomous VehiclesImage AnalysisAugmented RealityE-commerce

Key takeaways

  • Fine-tuning VLMs for dense coordinate lists can introduce output repetition.
  • This "interference surface" is measurable and can be controlled.
  • Controlling repetition does not necessarily degrade visual grounding performance.
  • The effect is specific to structured coordinate outputs, not general JSON.

Original post by Chenyu Zhou, Qiliang Jiang, Boguang Pan

"arXiv:2606.14507v1 Announce Type: new Abstract: Fine-tuning vision-language models to emit dense coordinate lists improves visual grounding but also changes how models serialize, repeat, and terminate structured outputs. We study this behavior as a generation and control surface.…"

View on X

Originally posted by Chenyu Zhou, Qiliang Jiang, Boguang Pan on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses