Flexformer Introduces Flexible Linear Transformers with Learnable Attention Kernels.

Haoran Zhang, Feng Zhou· June 29, 2026 View original

Summary

This paper proposes Flexformer, a new linear Transformer model that overcomes the quadratic complexity of traditional attention mechanisms by learning attention kernels in a data-driven manner. It treats spectral frequencies as trainable parameters, enabling the model to learn a wide range of attention kernels for improved expressiveness and performance.

Traditional Transformer models, while effective at capturing long-range dependencies, face a significant limitation due to their attention mechanism's quadratic computational complexity. This complexity restricts their application to very long sequences of data. Kernel-based linear attention offers a solution by reducing this complexity, but often relies on fixed or only weakly learnable kernels, which can limit the model's expressive power and overall performance. Researchers have introduced Flexformer, a novel flexible linear Transformer designed to address these issues. Flexformer learns its attention kernels entirely from data by treating spectral frequencies as trainable parameters. This innovative approach allows the model to adapt and learn a much broader family of attention kernels, significantly enhancing its flexibility and expressiveness. The Flexformer architecture includes both stationary and nonstationary variants, with the latter providing even greater expressive capabilities. Extensive testing on language modeling and sequence classification tasks demonstrates that Flexformer consistently outperforms existing baselines. Furthermore, it can be effectively distilled from pre-trained Transformers to recover softmax attention behavior and exhibits strong kernel transferability across different domains, achieving both high efficiency and competitive performance on tasks involving long sequences.

Why it matters

Professionals working with large sequence data in NLP or other domains can leverage Flexformer to build more efficient and scalable Transformer models without sacrificing performance.

How to implement this in your domain

  1. 1Evaluate existing Transformer implementations for performance bottlenecks on long sequence data.
  2. 2Explore integrating Flexformer's architecture into new or existing model designs for improved efficiency.
  3. 3Experiment with distilling pre-trained Transformer knowledge into Flexformer for specific applications.
  4. 4Benchmark Flexformer's performance against current state-of-the-art linear Transformers on relevant tasks.

Who benefits

AI/TechSoftware DevelopmentNatural Language ProcessingData Science

Key takeaways

  • Flexformer is a linear Transformer that learns attention kernels from data.
  • It addresses the quadratic complexity of traditional Transformers, improving scalability.
  • The model treats spectral frequencies as trainable parameters for enhanced expressiveness.
  • Flexformer outperforms baselines in language modeling and sequence classification.

Original post by Haoran Zhang, Feng Zhou

"arXiv:2606.27748v1 Announce Type: new Abstract: Transformer models rely on attention mechanism to capture long-range dependencies but suffer from quadratic complexity, limiting their scalability to long sequences. Kernel-based linear attention reduces this complexity but typicall…"

View on X

Originally posted by Haoran Zhang, Feng Zhou on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses