LLM Agents Blindly Defer to GNN Tools, Stronger Models Defer More

Zhongyuan Wang, Pratyusha Vemuri· June 15, 2026 View original

▶ The 60-second brief

Summary

Research indicates that large language model agents, when equipped with graph neural network tools, tend to defer almost entirely to the tool's output, often bypassing their own reasoning. This blind deference increases with the LLM's capability, even when the tool provides suboptimal results.

A new study reveals a critical limitation in how large language model (LLM) agents interact with external tools, specifically graph neural networks (GNNs). The research found that LLM agents, when given a GNN as a callable tool, do not exercise independent judgment but instead largely adopt the GNN's predictions, agreeing with them 97.6-99.2% of the time. This behavior effectively turns the agent into a "GNN parrot," overriding its own reasoning capabilities. Surprisingly, this deference is not a weakness of smaller models; rather, it increases with the LLM's capability, with stronger models deferring more. The cost of this blind deference does not diminish with increased LLM capability, and in scenarios where the agent's internal reasoning or simpler alternative tools could outperform the GNN, the agent still defers. For instance, a simple neighbor-label tool sometimes surpassed the GNN, yet the agent continued to rely on the GNN. The findings suggest that current evaluations of agent-plus-tool systems might be overestimating the agent's judgment. Selective invocation mechanisms need to be explicitly designed and integrated, as they are unlikely to emerge naturally from model scale. The study serves as a cautionary note for developers building tool-augmented LLM agents.

Why it matters

Professionals designing or deploying LLM agents with external tools must be aware that agents may not exercise judgment, potentially leading to suboptimal or incorrect outputs. This highlights the need for explicit control mechanisms over tool invocation.

How to implement this in your domain

  1. 1Implement explicit selective invocation gates for LLM agents to decide when and how much to rely on external tools.
  2. 2Design evaluation protocols for agent-tool systems that specifically test the agent's judgment and ability to override suboptimal tool outputs.
  3. 3Develop mechanisms for agents to assess the confidence or reliability of tool outputs before deferring.
  4. 4Consider simpler, more robust alternative tools or internal reasoning paths for agents, rather than assuming complex tools are always superior.

Who benefits

AI/ML EngineeringSoftware DevelopmentData ScienceCybersecurityAutonomous Systems

Key takeaways

  • LLM agents tend to blindly defer to external tools like GNNs, even when suboptimal.
  • Stronger LLMs exhibit greater deference to tools, not less.
  • Agent evaluations must account for this lack of judgment, not assume it.
  • Explicit selective invocation mechanisms are necessary for effective tool use, not emergent.

Original post by Zhongyuan Wang, Pratyusha Vemuri

"arXiv:2606.14476v1 Announce Type: new Abstract: A growing line of work equips large language model (LLM) agents with graph neural networks (GNNs) as callable tools, assuming the agent exercises judgment over when and how much to rely on such a tool. We test this directly. We expo…"

View on X

Originally posted by Zhongyuan Wang, Pratyusha Vemuri on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses