Qwen3-Instruct SAEs Uncover Millions of Interpretable LLM Features

XinYang He, Wei Wang, Bing Zhao, Xuan Ren, WenBo Li, WeiXu Qiao, Hu Wei, Lin Qu· June 26, 2026 View original

▶ The 2-minute explainer

Summary

This work introduces Qwen3-Instruct SAE, a comprehensive suite of Sparse Autoencoders (SAEs) trained on the Qwen3 instruction-tuned model family. These SAEs decompose language model representations into sparse, interpretable features, demonstrating their utility in steering model behavior like refusal.

Researchers have developed and released Qwen3-Instruct SAE, a collection of Sparse Autoencoders (SAEs) specifically designed for the Qwen3 family of instruction-tuned language models. These SAEs are crucial for breaking down the complex, "superposed" internal representations of large language models into more granular, sparse, and human-interpretable features. The project involved training layer-wise SAEs across various activation sites—residual streams, MLP outputs, and attention outputs—for different Qwen3 model sizes. A systematic evaluation confirmed varying trade-offs between sparsity and fidelity depending on the layer and component analyzed. A practical application of these SAEs was demonstrated through a case study on refusal-steering. By manipulating specific SAE features, the researchers were able to causally influence Qwen3 models to exhibit refusal behavior, highlighting the potential for fine-grained control and understanding of LLM mechanisms. This release offers a valuable resource for further research into sparse representations and behavioral interventions in LLMs.

Why it matters

This research provides tools and insights for better understanding and controlling the internal workings of large language models, which is critical for improving their reliability, safety, and interpretability in professional applications. It enables more precise interventions and debugging.

How to implement this in your domain

  1. 1Explore the Qwen3-Instruct SAE suite to analyze internal representations of Qwen3 models.
  2. 2Utilize the identified interpretable features to debug unexpected model behaviors or biases.
  3. 3Experiment with feature-level interventions to steer LLM outputs for specific use cases, such as enhancing safety or adherence to guidelines.
  4. 4Integrate SAEs into LLM development workflows to gain deeper insights into model decision-making processes.

Who benefits

AI/ML DevelopmentNatural Language ProcessingAI SafetyContent Moderation

Key takeaways

  • Qwen3-Instruct SAEs provide interpretable features for Qwen3 language models.
  • SAEs can decompose complex LLM representations into sparse, understandable components.
  • Feature-level interventions can causally steer LLM behaviors like refusal.
  • This resource aids in studying sparse representations and behavioral control in LLMs.

Original post by XinYang He, Wei Wang, Bing Zhao, Xuan Ren, WenBo Li, WeiXu Qiao, Hu Wei, Lin Qu

"arXiv:2606.26620v1 Announce Type: new Abstract: Sparse autoencoders (SAEs) have emerged as a powerful tool for decomposing superposed language model representations into sparse and interpretable features. However, training SAEs is computationally expensive, and available open-sou…"

View on X

Originally posted by XinYang He, Wei Wang, Bing Zhao, Xuan Ren, WenBo Li, WeiXu Qiao, Hu Wei, Lin Qu on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses