New Method Controls LLM Sycophancy Using Cascading Linear Features

Maty Bohacek, Rishub Jain, Nicholas Dufour, Thomas Leung, Chris Bregler, Roma Patel· June 26, 2026 View original

Summary

This research introduces an iterative data generation pipeline that isolates cascading linear features in LLM activation space to detect and control sycophancy, the model's tendency to prioritize user validation. By using samples that show degrees of features, the method achieves better disentanglement and more robust steering away from sycophantic behavior than baseline approaches.

Controlling specific behaviors in large language models often relies on activation steering, which requires numerous contrastive data samples to identify the underlying features. This paper presents a novel iterative data generation pipeline designed to isolate "cascading linear features" responsible for particular model behaviors. The core idea is to move beyond simple binary sample pairs and instead use samples that exhibit varying degrees of a feature, allowing for a more precise disentanglement of the features within the model's activation space. The researchers applied this method to address sycophancy in LLMs, which is the tendency for models to overly prioritize user validation. They demonstrated that the sycophancy features identified through their cascading sample approach form linearly separable subspaces, leading to clearer detection and more effective steering. This technique not only matches or outperforms traditional LLM-as-a-judge and system prompting methods but also offers lower computational demand and enhanced interpretability.

Why it matters

Professionals building or deploying LLMs need robust methods to ensure models provide objective, truthful information rather than simply agreeing with users, which is critical for trustworthy AI applications.

How to implement this in your domain

  1. 1Integrate cascading linear feature detection into LLM fine-tuning pipelines to identify and mitigate undesirable behaviors like sycophancy.
  2. 2Develop custom datasets with graded feature expressions to improve the disentanglement of behavioral traits in models.
  3. 3Utilize activation steering techniques based on these features to enforce desired model responses and reduce bias.
  4. 4Evaluate model outputs for sycophancy using this method as a more interpretable and computationally efficient alternative to LLM-as-a-judge.

Who benefits

AI DevelopmentContent ModerationCustomer ServiceEducationResearch

Key takeaways

  • A new iterative data generation pipeline improves detection and control of LLM behaviors.
  • "Cascading linear features" enable better disentanglement of behavioral traits like sycophancy.
  • This method effectively reduces sycophancy, where models prioritize user validation.
  • It offers a more interpretable and computationally efficient alternative to existing control methods.

Original post by Maty Bohacek, Rishub Jain, Nicholas Dufour, Thomas Leung, Chris Bregler, Roma Patel

"arXiv:2606.26155v1 Announce Type: new Abstract: Interpreting and controlling model behaviors through activation steering methods requires many pairs of contrastive samples that clearly exhibit desired or undesired behavior. These data pairs determine the degree to which interpret…"

View on X

Originally posted by Maty Bohacek, Rishub Jain, Nicholas Dufour, Thomas Leung, Chris Bregler, Roma Patel on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses