OrtSAE: Orthogonal Sparse Autoencoders Uncover Atomic Features
Abstract
Orthogonal Sparse Autoencoders (OrtSAE) mitigate feature absorption and composition by enforcing orthogonality, leading to better feature discovery and improved performance on spurious correlation removal.
Sparse autoencoders (SAEs) are a technique for sparse decomposition of neural network activations into human-interpretable features. However, current SAEs suffer from feature absorption, where specialized features capture instances of general features creating representation holes, and feature composition, where independent features merge into composite representations. In this work, we introduce Orthogonal SAE (OrtSAE), a novel approach aimed to mitigate these issues by enforcing orthogonality between the learned features. By implementing a new training procedure that penalizes high pairwise cosine similarity between SAE features, OrtSAE promotes the development of disentangled features while scaling linearly with the SAE size, avoiding significant computational overhead. We train OrtSAE across different models and layers and compare it with other methods. We find that OrtSAE discovers 9% more distinct features, reduces feature absorption (by 65%) and composition (by 15%), improves performance on spurious correlation removal (+6%), and achieves on-par performance for other downstream tasks compared to traditional SAEs.
Community
Orthogonal SAE (OrtSAE) is a new method that improves sparse autoencoders by enforcing orthogonality between learned features. This prevents features from merging together or creating gaps in representation. Compared to traditional SAEs, OrtSAE finds 9% more distinct features, reduces feature absorption by 65% and composition by 15%, and improves spurious correlation removal by 6%—all with minimal computational overhead.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- AdaptiveK Sparse Autoencoders: Dynamic Sparsity Allocation for Interpretable LLM Representations (2025)
- Analysis of Variational Sparse Autoencoders (2025)
- AbsTopK: Rethinking Sparse Autoencoders For Bidirectional Features (2025)
- Resurrecting the Salmon: Rethinking Mechanistic Interpretability with Domain-Specific Sparse Autoencoders (2025)
- Distribution-Aware Feature Selection for SAEs (2025)
- Attention Layers Add Into Low-Dimensional Residual Subspaces (2025)
- Binary Sparse Coding for Interpretability (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper