Induction Circuits - LLMs are more than next-token predictors

Present day LLMs are auto-regressive language models using the Transformer architecture and are trained on a next-token prediction task. As such, they are often thought of purely "next-token predictors", learning the statistical distribution of token sequences that occur in the data they're trained on.

However, this is selling them short: transformers can also be shown to learn mini algorithms that "run" at inference time. One of these algorithms is induction which works via so-called induction circuits that form as a combination of 2 attention heads with specific attention patterns.

The algorithm is fairly simple and is a sort of pattern matching. It ensures that if there is a sequence of 2 tokens A B in the context window, and the LLM encounters another A, the probability that the next token will be B is very high. That is, given a sequence A B ... A the model induces that the next token will likely be B.

The fascinating thing is that this doesn't all rely on the statistical correlations of the tokens in the training data. If the pattern was Harry Potter ... Harry, it would be unsurprising if the model predicted Potter as the next token, because it could have easily learned that Harry and Potter often go together in the training corpus.

Instead, this also works with completely random tokens. So if we gave the model a random pattern like symphony dioxide ... symphony, it would still predict dioxide with high probability.

Induction circuits may implement just a simple algorithm, but they are really important for in-context learning. Apart from induction circuits, there are also evidence of other algorithms in the model weights. And given the field of Mechanistic Interpretability is so young, one may wonder what other algorithms will be found in the future.

There exists already a lot of material about induction circuits, including this LessWrong post and colab by Callum McDougall and this paper by Elhage et al. All of these were super helpful for me in building my own understanding, and this blog post builds heavily on them. I'm a firm believer in that when trying to understand complex subject matter, having it explained to you by different people and from different angles can be tremendously helpful. So in that spirit, I hope this post can serve as a starting point for you!

Recap of Transformer basics

Before getting into the details it's worth recalling a few things about transformers:

The input to a transformer is piece of text, that is turned into a sequence of tokens by a tokenizer, and then embedded into a sequence of token embedding vectors (these have a certain dimension $D$ ), to which positional embeddings are added.
A transformer consists of a number of layers which apply their own "transformations" to the output of the previous layer. These transformations are additive - so any transformations that happen to an input vector are just added to it.
As all the transformations are additive, you can imagine the original embedding that enters the first layer is flowing through the transformer stack and things are just getting added to it in each layer. The vectors that get handed as output from one layer to the next are referred to as the residual stream.
This residual stream can be thought of a sort of memory: each layer reads from it and writes back to it.
Each token is embedded and processed by the transformer separately, so each token has its own residual stream. Though, attention heads can read from the residual stream of one token and write to another.
The dimensionality of the residual stream $D$ is much higher than the dimensionality that each attention head operates in. Each attention head can read/write from/to a subspace of the residual stream.
Each layer of a transformer follows the following structure:
- Layer normalization followed by a number of attention heads whose outputs are added to the residual stream
- Another layer normalization followed by an MLP, whose output is also added to the residual stream
The output of the very last layer enters an unembedding layer, which projects the output into a vector of dimension $V$ where $V$ is the number of tokens in the vocabulary. A softmax can be applied to obtain a probability for each token.

Below is a simplified visualisation of this (just one layer displayed in detail, minus the layer normalisations, positional embeddings and softmax).

Elhage et al. 2021

High level overview of Induction Circuits

Induction circuits form as a combination of 2 attention heads in 2 different layers that have particular attention patterns (that means that single layer transformers can't have induction circuits). The first one is called previous token head and the one that follows is the induction head.

Sticking with the terminology established so far, the previous token head looks for information about which token preceded the current token, and writes this information into the residual stream of the current token. So in the example of the A B ... A pattern, after the previous token head is done processing the B token, the residual stream of the B token will contain the fact that that it followed A.

The induction head then uses the output of the previous token head as a key input. By the time the induction head is processing the second A token in the A B ... A sequence, it will attend to tokens whose keys contain the fact that they followed the A token. It will then copy the information that the B token should likely follow into the residual stream, thereby increasing the logit of the B token.

The Maths of Induction Circuits

Many papers and tutorials focus on toy Transformers that are much simpler than the LLMs we all love and use every day. Common simplifcations are the use attention-only transformers (without any MLPs), removal of bias terms (just weight matrices) and layer normalizations. This is the approach taken by Elhage et al.[1] and Callum McDougall's awesome colab and we'll adopt the same approach here.

With these simplifications in place, a single attention head consists of just 4 weight matrices $\textbf{W}_{Q}$ , $\textbf{W}_{K}$ and $\textbf{W}_{V}$ which are all [d_head, d_model] dimensional and $\textbf{W}_{O}$ which is [d_model, d_head] dimensional. We'll see in a bit that because we made these simplifications, the attention head is almost entirely linear, except for the softmax operation in the attention pattern calculation.

We can denote $\textbf{X}$ as a [d_model, n_seq] matrix representing the input to any transformer layer (where n_seq is the number of tokens in the input), i.e. the residual stream. So in the case of the first transformer layer it's just the raw token embeddings $\textbf{W}_{E}\textbf{T}$ where $\textbf{T}$ is the one-hot encoded input sequence ([n_vocab, n_seq]) and $\textbf{W}_{E}$ ([d_model, n_vocab]) the embedding matrix of the model.

Decomposing the attention mechanism

The attention mechanism relies on a [n_seq, n_seq] matrix $\textbf{A}$ . Every row $i$ of $\textbf{A}$ contains the attention scores from the query token $i$ to each token key token $j$ that precedes it ( $\textbf{A}$ is made lower triangular using a mask).

$\textbf{A}$ is constructed by computing for each input $x_{i}$ the query and key vectors $q_{i}=\textbf{W}_{Q}x_{i}$ and $k_{i}=\textbf{W}_{K}x_{i}$ . The dot product $q^{T}_{i}k_{j}$ yields the attention weight $a_{ij}$ . We can do this for all combinations and get $A=softmax(\textbf{X}^{T}\textbf{W}_{Q}^{T}\textbf{W}_{K}\textbf{X})$ .

Given $\textbf{A}$ , the attention mechanism then works as follows:

Obtain value vectors $v_{i} = \textbf{W}_{V}x_{i}$ for each input token $x_{i}$
Create a linear combination of value vectors using the attention pattern $r_{i} = \Sigma_{j}(A_{i,j}v_{j})$ .
Create the output vector of this attention head for each token $h(x)_{i} = \textbf{W}_{O}r_{i}$

Now we're going to show how to decompose these operations into 2 linear ones like in Elhage et al.[1] but while adding extra steps which should be useful for those not familiar with tensor products.

The three steps above in tensor math look as follows:

h(x) = (I \otimes \textbf{W}_{O}) \cdot (A \otimes I) \cdot (I \otimes \textbf{W}_{V}) \cdot X

Couple of points to note here

$\textbf{X}$ is [d_model, n_seq], but we can equally flatten it out into a d_model ⋅ n_seq dimensional vector
$(I \otimes \textbf{W}_{V})$ is a [d_model ⋅ n_seq, d_model ⋅ n_seq] dimensional matrix that looks as follows: $(I \otimes \textbf{W}_{V}) = \begin{bmatrix} \textbf{W}_{V} & 0 & \cdots & 0 \\ 0 & \textbf{W}_{V} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \textbf{W}_{V} \end{bmatrix}$
So $(I \otimes \textbf{W}_{V}) \cdot \textbf{X}$ basically performs $v_{i}=\textbf{W}_{V}x_{i}$ on each input and stacks the resulting value vectors in a [n_seq ⋅ d_head] dimensional vector, let's call it $\textbf{v}$ .
$(A \otimes I)$ is a [n_seq ⋅ d_head, n_seq ⋅ d_head] dimensional matrix that looks as follows: $(A \otimes I) = \begin{bmatrix} a_{11}I & a_{12}I & \cdots & a_{1n}I \\ a_{21}I & a_{22}I & \cdots & a_{2n}I \\ \vdots & \vdots & \ddots & \vdots \\ a_{n1}I & a_{n2}I & \cdots & a_{nn}I \end{bmatrix}$
So $(A \otimes I) \cdot \textbf{v}$ performs $r_{i} = \Sigma_{j}(A_{i,j}v_{j})$ and stacks the result vectors in a [n_seq, d_head] dimensional vector, let's call it $\textbf{r}$ .
Finally, $(I \otimes \textbf{W}_{O})$ is a [d_model ⋅ n_seq, d_head ⋅ n_seq] dimensional matrix that looks as follows: $(I \otimes \textbf{W}_{O}) = \begin{bmatrix} \textbf{W}_{O} & 0 & \cdots & 0 \\ 0 & \textbf{W}_{O} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \textbf{W}_{O} \end{bmatrix}$
So $(I \otimes \textbf{W}_{O}) \cdot \textbf{r}$ computes $\textbf{W}_{O}r_{i}$ for every result vector $r_{i}$ and stores the results in a [d_model ⋅ n_seq] dimension vector, which can just be reshaped into a [d_model, n_seq] matrix.

The trick to understanding this derivation is to realise that you can just flatten out a matrix into a vector and stack it up into a matrix again.

We can rewrite the above as:

h(x) = (A \otimes \textbf{W}_{O}\textbf{W}_{V}) \cdot X

which reveals the linear nature of the attention mechanism of this toy transformer: Apart from the softmax operation to compute $A$ , we're just applying the transformations $\textbf{W}_{Q}^{T}\textbf{W}_{K}$ and $\textbf{W}_{O}\textbf{W}_{V}$ . Because they're always going together, we denote them as $\textbf{W}_{QK}$ and $\textbf{W}_{OV}$ respectively.

Of course we're working with just a toy model where we deliberately removed most of the non-linearities. However, this way we can more easily reason about and demonstrate induction circuits.

QK and OV circuits

Remember that with our toy transformer, we don't have any MLPs or layer normalisations. So if we were to look at a single layer model, then all we would do is embed an input sequence, apply the transformer head(s) as per the above, add that to the residual stream and unembed the result.

T = (I \otimes \textbf{W}_{U}) \cdot (I + \sum_{h \in H_{1}} A^{h} \otimes \textbf{W}_{OV}^{h}) \cdot (I \otimes W_{E})

Elhage et al.[1] have an awesome diagram for this. Here I'm providing some extra steps to make the derivation more explicit:

$(I \otimes W_{E})$ is a [d_model ⋅ n_seq, n_vocab ⋅ n_seq] matrix $(I \otimes W_{E}) = \begin{bmatrix} \textbf{W}_{E} & 0 & \cdots & 0 \\ 0 & \textbf{W}_{E} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \textbf{W}_{E} \end{bmatrix}$
$A^{h} \otimes \textbf{W}_{OV}^{h}$ is a [d_model ⋅ n_seq, d_model ⋅ n_seq] matrix $A^{h} \otimes \textbf{W}_{OV}^{h} = \begin{bmatrix} a_{11} \textbf{W}_{OV}^{h} & a_{12} \textbf{W}_{OV}^{h} & \cdots & a_{1n} \textbf{W}_{OV}^{h} \\ a_{21} \textbf{W}_{OV}^{h} & a_{22} \textbf{W}_{OV}^{h} & \cdots & a_{2n} \textbf{W}_{OV}^{h} \\ \vdots & \vdots & \ddots & \vdots \\ a_{n1} \textbf{W}_{OV}^{h} & a_{n2} \textbf{W}_{OV}^{h} & \cdots & a_{nn} \textbf{W}_{OV}^{h} \end{bmatrix}$
$\textbf{W}_{U}$ is the [n_vocab, d_model] unembedding matrix (mapping residual stream to logits) and so $(I \otimes \textbf{W}_{U})$ is a [n_vocab ⋅ n_seq, d_model ⋅ n_seq] matrix

Multiplying the product out we get:

T = I \otimes \textbf{W}_{U} W_{E} + \sum_{h \in H_{1}} A^{h} \otimes ( W_{U}\textbf{W}_{OV}^{h} W_{E})

$I \otimes \textbf{W}_{U} W_{E}$ represents the direct path/residual stream through the transformer without any attention. The second term contains the QK and OV circuits.

The QK circuit

The QK circuit refers to the term $\textbf{W}_{E}^{T}\textbf{W}_{QK}\textbf{W}_{E}$ and it signifies where information is moved from and to. It's hidden inside the attention pattern $A^{h} = softmax(T^{T}\textbf{W}_{E}^{T}\textbf{W}_{QK}\textbf{W}_{E}T)$ where $T$ is a [d_vocab, n_seq] of the one-hot encoded input sequence. It's a [d_vocab, d_vocab] matrix, which is huge, and never explicitly computed. For any specific tokens $t_{i}$ and $t_{j}$ , the QK circuit gives the attention paid by token $i$ to token $j$ :

t_{i}^{T}\textbf{W}_{E}^{T}\textbf{W}_{Q}\textbf{W}_{K}^{T}\textbf{W}_{E}t_{j} = q_{i}^{T}k_{j}

The OV circuit

The OV circuit is the term $W_{U}\textbf{W}_{OV}^{h} W_{E}$ describes what information is moved (given the attention pattern, which establishes where information flows from and to). It's also a [d_vocab, d_vocab] matrix. Given a specific query token $t_{i}$ and the embedding of that token $W_{E}t_{i}$ :

$\textbf{W}_{OV}\textbf{W}_{E}t_{i}$ is what the attention head would write into the residual stream of the destination token, if it only paid attention to $t_{i}$ (i.e. if there is no weighted averaging over all tokens being done by $A$ )
and $\textbf{W}_{U}\textbf{W}_{OV}\textbf{W}_{E}t_{i}$ is the contribution of this attention head to the logit of the destination token.

Note that these are the circuits for a single layer toy attention model. Once you add a second layer, which we need for an induction circuit to form, the QK and OV circuits of the later head will take as input the output of the first head. In other words, the input will no longer be the raw token embeddings, but terms consisting of previous layers' $\textbf{W}_{Q}$ , $\textbf{W}_{K}$ and $\textbf{W}_{V}$ matrices. This is called Q-, K- or V-composition, and will be necessary for induction circuits to work.

Identifying Induction Circuits

So much for the theory. We can now look at the model weights of a trained 2-layer toy transformer and identify induction circuits.

We're first going to look at the QK circuit of a previous token head and see what the attention pattern looks like. Remember from above that the previous token head writes information about the previous token into the residual stream of the current token. So in the case of the A B ... A pattern, the previous token head will write the information that A preceded B into the residual stream of the B token.

Then, we're going to check the QK circuit of an induction head and look for evidence of K-composition. As we know, this will ensure that when the induction head is processing the second A token in the A B ... A sequence, it will attend to tokens whose keys contain the fact that they followed the A token - i.e. the B token.

Lastly, we will look at the OV circuit of the induction head and verify that it is simply a copying circuit. In other words, while the QK circuit of the induction head will make sure we attend to the B token when processing the second A token, the OV circuit will copy the B token into the residual stream of the second A token, increasing the probability that we predict B as the next token.

The QK circuit of the previous token head

The QK circuit of the previous token head can be identified using the $\textbf{W}_{QK}$ circuit and the positional embeddings, leading to $\textbf{W}_{pos}^{T}\textbf{W}_{QK}\textbf{W}_{pos}$ which is a [n_seq, n_seq] matrix. Notice that it's diagonal, but shifted by one position off the main diagonal. This leads to the current token attending to the previous token and is what makes this head a previous token head.

Callum McDougall's colab

The QK circuit of the induction head

Here we want to see evidence of K-composition.

Imagine we are processing the second A token in the A B ... A sequence.
So the query vector of the induction head is just the embedding of the A token
The keys of the induction head are provided by the output of the previous token head.

The QK circuit of the induction head basically takes a query token (which is the embedding of the A token) and looks for keys that contain the fact that they followed the A token. In other words, the attention pattern should show high values for keys that contain the fact that they followed the A token.

We can easily test this by taking a random input sequence, repeat it, and feed it to the model to produce an artifact we need: the output of the previous token head when given this random, repeated input sequence. We will then feed it to the QK circuit of the induction head to compute keys, and also use it to compute queries for each token embedding in the sequence.

Put differently, we compute $T^{T}\textbf{W}_{E}^{T}\textbf{W}_{QK}^{1}h^{0}(x)^{T}$ where $h^{0}(x)$ is the output of the previous token head in layer 0, and $\textbf{W}_{QK}^{1}$ is the QK circuit of the induction head in layer 1.

This will produce a [n_seq, n_seq] matrix, with unnormalized attention scores like the one below. Notice the distinct diagonal pattern off the main diagonal.

It starts exactly in the middle of the sequence, which makes sense, because the produced a random token sequence of 50 tokens and repeated it. So starting exactly in the middle of the sequence is where we would expect the induction head to start paying a lot of attention to previous tokens: Token 51 (i.e. the second A in our AB ... A sequence) is the first one that gets repeated, so we would expect it to pay a lot of attention to token 2 (i.e. the first B in our AB ... A sequence).

Callum McDougall's colab

The OV circuit of the induction head

The OV circuit of the induction head is fairly simple: it's just the identity matrix (approximately). It's also possible that it's not quite the identity matrix, but taken together with the OV circuits of other induction heads, it more looks like it. The reason why this matrix is the identity matrix is that it basically just copies the information that the attention pattern demands to the residual stream.

It's job is simple to copy the B token into the residual stream of the second A token.

Callum McDougall's colab

Conclusion

Thanks for reading, I hope this post was useful to get started with induction circuits. I skipped over a few details, such as K-, Q- and V-composition, but you can find more details about that in the references below. There is also a lot more to be said about induction circuits in the "wild", i.e. in the context of models with more than 2 layers and without the simplications we made here. Finally, this is just the starting point for those who want to dive deeper into the field of Mechanistic Interpretability. There is so much about transformers that we don't know yet, so it's an exciting time to work on this.

In this article we've covered a way of interpreting the inner workings on attention heads of transformer models. Apart from attention heads, there are of course also the MLPs in each transformer layer, which are probably even more difficult to interpret. If you're interested in this, check out my article on sparse autoencoders & cross-layer transcoders.

References

[1] Elhage, et al., "A Mathematical Framework for Transformer Circuits", Transformer Circuits Thread, 2021.[paper]

[2] McDougall, C. "Induction heads - illustrated", 2023.[LessWrong]

[3] McDougall, C. "Intro to Mechanistic Interpretability: TransformerLens & induction circuits".[colab]