Papers
arxiv:2503.24277

Evaluating and Designing Sparse Autoencoders by Approximating Quasi-Orthogonality

Published on Mar 31
Authors:
,
,

Abstract

Theoretical findings enable sparse autoencoders to be trained without manually setting the number of nonzero activations, introducing a new evaluation method and a dynamic activation function.

AI-generated summary

Sparse autoencoders (SAEs) are widely used in mechanistic interpretability research for large language models; however, the state-of-the-art method of using k-sparse autoencoders lacks a theoretical grounding for selecting the hyperparameter k that represents the number of nonzero activations, often denoted by ell_0. In this paper, we reveal a theoretical link that the ell_2-norm of the sparse feature vector can be approximated with the ell_2-norm of the dense vector with a closed-form error, which allows sparse autoencoders to be trained without the need to manually determine ell_0. Specifically, we validate two applications of our theoretical findings. First, we introduce a new methodology that can assess the feature activations of pre-trained SAEs by computing the theoretically expected value from the input embedding, which has been overlooked by existing SAE evaluation methods and loss functions. Second, we introduce a novel activation function, top-AFA, which builds upon our formulation of approximate feature activation (AFA). This function enables top-k style activation without requiring a constant hyperparameter k to be tuned, dynamically determining the number of activated features for each input. By training SAEs on three intermediate layers to reconstruct GPT2 hidden embeddings for over 80 million tokens from the OpenWebText dataset, we demonstrate the empirical merits of this approach and compare it with current state-of-the-art k-sparse autoencoders. Our code is available at: https://github.com/SewoongLee/top-afa-sae.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2503.24277 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2503.24277 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2503.24277 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.