Published on

Cross-Layer Transcoders and Sparse Autoencoders

Authors

The problem with neural networks is that it's difficult to understand how they reach their predictions. Take a transformer model for example, which consists of a stack of transformer layers, each with multiple attention heads and an MLP. Given an input, one could investigate which neurons of an MLP are active and how they contribute to the final prediction, but neuron activations themselves are usually not easily interpretable.

If you think about it, the "number of things that exist" is way larger than the number of neurons in a neural network. So it's hard to imagine how there could be a one-to-one mapping in a way that one neuron would only activate for a single concept. It's more likely that a given neuron activates to different degrees in response to different things/concepts, and that there are different combinations of neurons that activate simulataneously when being confronted with specific concepts.

The fact that neurons can activate for a variety of unrelated concepts is what makes them polysemantic and inherently difficult to interpret. One hypothesis for polysemanticity is a phenomenon called superposition. Neural networks are trying to cram as much information as possible into their activations by overlaying them in some sense.

How then can we make sense of the activations? One way is to view activations as a dense, low-dimensional representation of a higher, sparse feature space. The objective of Interpretability is then to reconstruct the sparse features from activations, in the hope of finding out which sparse features get activated when a certain concept is present.

For this to be useful, we would want the sparse features to have high specificity. That is, we don't want a feature to get activated by lots and lots of different concepts.. we want it to reliably activate when a certain concept is present. We also want features to have a causal effect on the prediction. This would give us confidence that we didn't just train a nice auxiliary model, but that we actually learning something relevant to the inner workings of the original model we are trying to interpret.

If this was the case, we would be able to run interventions: we could for example downweight a certain feature (e.g. a feature related to racism) and reduce the likelihood that the model makes a certain prediction (e.g. a racist joke).

Turns out, this is possible! We can find sparse features, explain LLM predictions and steer model behaviour to some extent.

An approach that is receiving a lot of research attention is the use of Cross-Layer Transcoders (CLTs), which aim to decompose the MLP within a transformer layer into sparse features. This is an evolution of Sparse Autoencoders (SAEs), which tried to decompose just the MLP activations (i.e. just the outputs of MLPs) into sparse features.

In this blog post we're gonna take a look at both approaches and how they differ. CLTs are probably more promising for reasons that will become clear later, but it's interesting to see how we got from SAEs to CLTs.

If you're interested in what can be done to interpret attention heads, check out my blog post on induction circuits.

Sparse Autoencoders (SAEs)

SAEs try to reconstruct the output/activations of an MLP using a sparse set of features. This is done in 2 steps:[3][6]

  • a linear encoder ff maps nn-dimensional MLP activations x\textbf{x} into an m-dimensional space where mnm \ge n and we then apply a non-linear activation function (ReLU).
f(x)=ReLU(Wex+be)\textbf{f}(x) = \text{ReLU} (\textbf{W}_{e}\textbf{x} + \textbf{b}_{e})
  • a linear decoder projects f\textbf{f} back to activation space:
x^=Wdf+bd\hat{\textbf{x}} = \textbf{W}_{d} \textbf{f} + \textbf{b}_{d}

Here, WeRm×n\textbf{W}_{e} \in \mathbb{R}^{m \times n} and WdRn×m\textbf{W}_{d} \in \mathbb{R}^{n \times m} are the encoder and decoder weights, and beRm\textbf{b}_{e} \in \mathbb{R}^{m} and bdRn\textbf{b}_{d} \in \mathbb{R}^{n} are the encoder and decoder biases.

Graphically, this is what it looks like:

sae

SAEs decompose MLP activations into sparse features Bricken, et al. 2023

The sparsity of feature activations is encouraged via the loss function:

L=1XxXxx^22+λifiWd,i2\mathcal{L} = \frac{1}{|X|} \sum_{\textbf{x} \in \textbf{X}} \left\| \textbf{x} - \hat{\textbf{x}} \right\|_{2}^{2} + \lambda \sum_{i} | \textbf{f}_{i}| \left\| \textbf{W}_{d,i} \right\|_{2}

where X\textbf{X} is the set of all MLP activations, x^\hat{\textbf{x}} is the reconstructed activation, and λ\lambda is a regularization parameter. Specifically, the MSE term 1XxXxx^22\frac{1}{|X|} \sum_{\textbf{x} \in \textbf{X}} \left\| \textbf{x} - \hat{\textbf{x}} \right\|_{2}^{2} makes the model predict the activations as closely as possible, while λf\lambda | \textbf{f}| encourages the model to use only a few features to reconstruct the activations.

The exact formulation has gone through iterations. Earlier versions for example just used the L1 norm of f\textbf{f} as the sparsity regularization term in the loss function and no term involving the decoder weights Wd,i\textbf{W}_{d,i}[2].

Cross-Layer Transcoders (CLTs)

Transcoders try to reconstruct the actual computation of an MLP rather than just the activations. Since they operate on both the input and the output of the MLP they're also sometimes called input-output SAEs.

A transcoder takes as input the pre-MLP activations (i.e. the residual stream) and tries to reconstruct the post-MLP activations as a sparse linear combination of feature vectors. If a transcoder provides outputs to later layers like displayed below, it's called a cross-layer transcoder (CLT).

transcoder

A transcoder reconstructs the actual computation of an MLP Ameisen, et al. 2025

The CLT also has 2 steps:[1]

  • a linear encoder maps the nn-dimensional layer ll pre-MLP activations xl\textbf{x}^{l} into an m-dimensional space where mnm \ge n and we then apply a non-linear activation function. The result is called the feature activations al\textbf{a}^{l} at layer ll.
al=JumpReLU(Wexl)\textbf{a}^{l} = \text{JumpReLU} (\textbf{W}_{e}\textbf{x}^{l})

Here, JumpReLU\text{JumpReLU}[5] is a variant of the ReLU\text{ReLU} activation function that introduces a trainable parameter θ\theta which represents a mininum threshold for when the ReLU returns a non-zero output.

  • a linear decoder that projects al\textbf{a}^{l} back to into post-MLP activation space, and which takes into account the outputs of all previous layers ala^{l'}:
y^l=l=1lWdllal\hat{\textbf{y}}^{l} = \sum_{l'=1}^{l} \textbf{W}_{d}^{l' \rightarrow l} \textbf{a}^{l'}

where WdllRn×m\textbf{W}_{d}^{l' \rightarrow l} \in \mathbb{R}^{n \times m} is the decoder weight matrix for layer ll and al\textbf{a}^{l'} is the feature activations at layer ll'.

The loss function also has a MSE term between the reconstructed and the actual post-MLP activations y^l\hat{\textbf{y}}^{l} and yl\textbf{y}^{l}, and a sparsity penalty, which has 2 hyperparameters λ\lambda and cc.

L=l=1Ly^lyl2+λl=1Li=1Ntanh(cWd,ilail)\mathcal{L} = \sum_{l=1}^{L} \left\| \hat{\textbf{y}}^{l} - \textbf{y}^{l} \right\|^{2} + \lambda \sum_{l=1}^{L} \sum_{i=1}^{N} \text{tanh}(c \cdot \left\| \textbf{W}_{d, i}^{l} \right\| \cdot {a}^{l}_{i})

Comparison

As alluded to above, the main difference between SAEs and CLTs is that SAEs operate at a single point in the model, at the post-MLP activations, while CTLs operatate at 2 points, the pre-MLP and post-MLP activations. This means that SAEs are able to decompose the MLP outputs into sparse feature vectors, but can't really tell us much about the actual computation of the MLP. We can understand what features are important on a given input, but now why.

CLTs on the other hand are a sparse approximation of the MLP computation itself. The computation is very straightforward to interpret: calculate feature activations and then use them as coefficients in a weighted sum of decoder vectors.[4] Once we have this for every MLP layer in the model, we're able to construct a sparse computational graph of the entire model. Say for example we identified a certain feature of interest in some middle layer of the model. We can then use the encoder vector of this feature to find out which features in the previous layers are important for this feature. And we can repeat this process all the way back to the input tokens.

This is something that can't be done with SAEs, because they only explain the MLP outputs, not the computation itself. Once you identify a feature of interest, you can't easily go back the earlier layers and find out which features are important for this feature, because you don't have an explanation for the non-linear computation inside the MLP.

So in terms of building interpretable computational graphs, CLTs are more useful than SAEs. What about performance? Standard metrics to evaluate performance are:

  • interpretability: what's the mean number of features that are active for any given input?
  • fidelity: what's the increase in cross-entropy loss relative to the original model when replacing a layer with an SAE/CLT?

As it turns out, CLTs are at least on par with SAEs, if not better.[4]

sae-vs-clt

CLTs achieve a similar pareto frontier to SAEs [1]

Conclusion

To sum up the main points of this blog post:

  • We train SAEs or CLTs to obtain sparse reconstructions of MLP activations inside a transformer model
  • SAEs produce sparse representations of the MLP outputs, while CLTs produce sparse representations of the MLP computation
  • Because of this, CLTs allow us to build interpretable computational graphs of the model

That's all for now. I'll be back with more on how exactly you can use CLTs for interpretability soon!

References

[1] Ameisen, et al., Circuit Tracing: Revealing Computational Graphs in Language Models, Transformer Circuits, 2025. [paper]

[2] Bricken, et al., Towards Monosemanticity: Decomposing Language Models With Dictionary Learning, Transformer Circuits Thread, 2023. [paper]

[3] Conerly, et al., Circuits Updates - April 2024, Transformer Circuits Thread, 2024. [paper]

[4] Dunefsky, et al., Transcoders enable fine-grained interpretable circuit analysis for language models, LessWrong, 2024. [post]

[5] Rajamanoharan, et al., Improving reconstruction fidelity with jumprelu sparse autoencoders, 2024. [paper]

[6] Templeton, et al., Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet, Transformer Circuits Thread, 2024. [paper]