Rational Sparse Autoencoder Improves Mechanistic Interpretability.

Naiyu Yin, Yue Yu· June 16, 2026 View original

Summary

This paper introduces the Rational Sparse Autoencoder (RSAE), which replaces fixed encoder nonlinearities with trainable rational functions to improve the reconstruction-versus-sparsity trade-off in sparse autoencoders. RSAE consistently outperforms baseline SAEs on reconstruction and downstream behavior metrics across various language models, enhancing feature-level interpretability.

Sparse autoencoders (SAEs) are fundamental tools for mechanistic interpretability in large language models, helping to understand how models represent information. However, existing SAE families are limited by fixed encoder nonlinearities, such as ReLU or TopK, which can constrain the model's ability to achieve an optimal balance between reconstruction quality and sparsity. This fixed nature can also distort the trade-off between these two critical aspects. The Rational Sparse Autoencoder (RSAE) addresses this limitation by replacing the fixed encoder activation with a trainable rational function. Rational functions are highly flexible and can approximate the activation primitives used by current SAE families, while also offering a richer function class to adapt to the specific geometry of pre-activations. The RSAE implementation involves a two-stage pipeline: an initialization step that copies baseline SAE weights and calibrates rational coefficients, followed by a fine-tuning phase using the standard sparsity-regularized reconstruction objective. Empirically, RSAE consistently improved upon baseline SAEs across various open-weight language models and activation families. It showed gains in both reconstruction metrics and downstream behavior metrics, all without compromising feature-level interpretability. These improvements were observed across different host models and sparsity levels, with the upgrade requiring minimal additional parameters and computational resources.

Why it matters

For AI researchers and engineers, RSAE offers a more powerful and flexible tool for mechanistic interpretability, enabling a deeper understanding of how large language models function and potentially leading to more robust and controllable AI systems.

How to implement this in your domain

  1. 1Explore integrating Rational Sparse Autoencoders into mechanistic interpretability workflows for LLMs.
  2. 2Replace fixed encoder activations in existing SAEs with trainable rational functions.
  3. 3Implement the two-stage initialization and fine-tuning pipeline for RSAE deployment.
  4. 4Benchmark RSAE performance against traditional SAEs on reconstruction and downstream tasks.
  5. 5Apply RSAE to analyze feature representations in various open-weight language models.

Who benefits

AI ResearchNatural Language ProcessingMachine Learning EngineeringExplainable AI

Key takeaways

  • RSAE uses trainable rational functions for encoder activations, improving SAE flexibility.
  • It consistently outperforms traditional SAEs in reconstruction and downstream metrics.
  • RSAE enhances mechanistic interpretability without sacrificing feature-level clarity.
  • The upgrade is computationally efficient, requiring minimal additional parameters.

Original post by Naiyu Yin, Yue Yu

"arXiv:2606.14990v1 Announce Type: new Abstract: Sparse autoencoders (SAEs) are standard tools for mechanistic interpretability, but current SAE families are constrained by fixed encoder nonlinearities such as ReLU, JumpReLU, and TopK. This hard-codes a particular sparsity mechani…"

View on X

Originally posted by Naiyu Yin, Yue Yu on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses