attention

Attention is the mathematical operation within a transformer that allows different parts of the input to figure out how important they are to each other.

Info

Understanding how attention fits into transformers is described in Transformers > Attention. This page describes different ways in which attention can be implemented.

Full attention

“Full attention” is the form of attention where every token interacts with every other token. It takes an input (the activations tensor) and creates three derived tensors:

Query tensor $Q$ , which captures what information each token needs from the other tokens
Key tensor $K$ , which captures what information each token has to offer the other tokens
Value tensor $V$ , which captures the features that get shared by other tokens when the query and key are similar

All three tensors are simply the activations multiplied by learned weights (which are model parameters). If the activation tensor (aka the “residual stream”) is $X$ , then

$Q = X \times W_{Q}$
$K = X \times W_{K}$
$V = X \times W_{V}$

Where these weight tensors $W$ have dimensions $[d, d]$ where $d$ is the hidden dimension of the transformer. This means $Q$ , $K$ , and $V$ have the same dimensions as the activations tensor ( $[S, d]$ , where $S$ is the transformer’s sequence length).

Roughly, attention works like this:

It first calculates a score between each token $i$ and every other token $j$ by comparing their projected representations ( $Q_{i}$ vs $K_{j}$ ). The more similar they are, the higher the score. After this, we have a tensor of shape $[S, S]$ describing how much every token attends to every other token. These scores are then normalized with softmax.
For each token $i$ , the normalized score for every token $j$ is multiplied by that token’s value vector $V_{j}$ . These weighted value vectors are added up, producing a new context vector for token $i$ . Doing this for all tokens results in an output tensor of shape $[S, d]$ .
This output tensor is finally multiplied by one last matrix of learned weights, $W_{O}$ (shape $[d, d]$ ), to produce the new activations $[S, d]$ that feed into the next part of the transformer layer.

Mathematically, attention looks like this:

$Attention (Q, K, V) = softmax (\frac{Q K ^{T}}{d}) V$

Because $Q$ , $K$ , and $V$ have the dimensions of $[S, d]$ (remember, $S$ is sequence length, or when the context isn’t completed filled, the number of tokens $N$ in the context so far), there are two quadratic terms here:

$Q K^{T}$ multiplies tensors of shape $[N, d]$ by $[d, N]$ and outputs a tensor $[N, N]$
$softmax (\frac{Q K ^{T}}{d}) V$ then multiplies tensors of shape $[N, N]$ and $[N, d]$ and outputs a tensor $[N, d]$

So the computational cost of attention scales as $N^{2}$ .

Multi-head Attention (MHA)

Multi-head attention splits the attention part of each transformer layer into multiple attention heads, and each head gets a slice of every token’s hidden dimension. Each head decides the importance of every token with respect to the others, but only in its slice of the hidden dimension. Each head does this independently (and in parallel), and only after each head calculates the attention values for its slice is there a global mixing across heads.

A little more precisely, each token vector (of length $d$ ) going into the attention block is divided up over the attention heads. For example, if a model has a hidden dimension of $d = 8192$ features and $h = 64$ attention heads, each attention head would be assigned a slice that contains $d_{h} = 128$ features.

As a result, each head has its own subset of $K$ , $V$ , and $Q$ and their associated weights $W_{K}$ , $W_{V}$ , and $W_{Q}$ . Each head computes on its subslice of the hidden dimension independently, resulting in $h$ attention output tensors. These per-head outputs are all concatenated together, then multiplied with another set of learned weights, $W_{O}$ , to give you a fully mixed attention output tensor.

Properties

Multi-head attention is generally a good idea because it allows the model to learn multiple attention patterns in parallel. The output projection $W_{O}$ is also a more compact way of representing the mixing of features than representing the full $N^{2}$ attention tensor.

If you have too many attention heads, many of them will either zero themselves out or become redundant. At the extreme, the subslice of the feature space that each head gets is so small that it cannot learn useful attention patterns.

Worked Example

As an example, let’s say we have an attention block with hidden dimension $d = 8$ and two heads ( $h = 2$ ), and we pass a single token through it. The activations vector for that token might be

$X = [100, 200, 300, 400, 500, 600, 700, 800]$

It gets divided over the two heads:

$X_{1} = [100, 200, 300, 400]$

$X_{2} = [500, 600, 700, 800]$

After each head processes its $X_{i}$ , there are two activation outputs $O_{i}$ :

$O_{1} = [1, 2, 3, 4]$

$O_{2} = [10, 20, 30, 40]$

They get concatenated along the feature axis:

$O = [1, 2, 3, 4 ∣ 10, 20, 30, 40]$

Let’s say $W_{O}$ is an $8 \times 8$ learned matrix that looks like:

W_{O} = 1000 ⋮ 010000100001100001000010 0001 ⋮

The final attention output is the concatenated output tensor $O$ multiplied by this weight matrix:

$O^{'} = O W_{O}$

The resulting attention tensor coming out of multi-head attention would then be:

$O^{'} = [1 + 10, 2 + 20, 3 + 30, 4 + 40, \dots] = [11, 22, 33, 44, \dots]$

Grouped-Query Attention (GQA)

GQA was used by Llama-3.1.

Latent attention (MLA)

Latent attention factorizes the $W_{Q}$ matrix into two smaller matrices. One such matrix can get go into the KV cache, and the other holds weights. Both take less memory, and you rehydrate the original full matrix only when needed.

This approach effectively stores a compressed representation (a latent representation) of the $K$ and $V$ tensors that are shared across the heads in each transformer layer. During a forward pass, each attention head rehydrates its real $K$ and $V$ using this latent representation on-demand. This rehydration is more computationally expensive than storing $K$ and $V$ in memory outright, but it saves a ton of memory and allows you to fit a larger model in the same GPU memory footprint.

This method was used by DeepSeek-R1.

Linear attention

Linear attention is a family of techniques that replaces full attention with an approximation that scales as $O (N)$ rather than $O (N^{2})$ . Recall that full attention does the following:

Step	Operation	Complexity	Output Shape
1	Multiply $Q K^{T}$	$O (N^{2})$	$[S, S]$ or $[N, N]$
2	Multiply previous $[N, N]$ by $V$	$O (N^{2})$	$[N, d]$

Linear attention does this:

Step	Operation	Complexity	Output Shape
1	Compute summary of $K$ multiply by $V$	$O (N)$	$[d, d]$
2	Multiply summary of $Q$ with previous	$O (N)$	$[N, d]$

The summary transformation $ϕ$ is a nonlinear feature mapping. Remember that the key $K$ , value $V$ , and query $Q$ tensors are really each a bunch of per-token vectors. Each token $j$ has its own key vector $k_{j}$ , value vector $v_{j}$ , and query vector $q_{j}$ within $K$ , $V$ , and $Q$ , and these vectors all have length $d$ . With this in mind,

Linear attention uses a function $ϕ$ which is a feature map or kernel map. This turns a key vector $k_{j}$ into a feature-mapped key vector $ϕ (k_{j})$ of length $d$ .
For each token $j$ , we take the outer product of the feature-mapped key vector $ϕ (k_{j})$ and the unmodified value vector $v_{j}$ . This results in a tensor of shape $[d, d]$ that encodes how token $j$ ‘s key and value interact.
Each token’s $[d, d]$ matrix is them added together to form a summary matrix $S$ of shape $[d, d]$ which encodes a summary of how tokens’ keys and values interact. However, there is no mixing across tokens yet!
Now for each query vector $q_{i}$ , calculate its feature-mapped query vector $ϕ (q_{i})$ using the same feature map we used to generate $ϕ (k_{j})$ .
This feature-mapped query vector is multiplied by the summary matrix ( $ϕ (q_{j})^{T} S$ ), resulting in an output vector in the same space as the value vector. Its length is still $d$ , and it now encodes a blended weight of all keys and values (which were represented in the summary matrix) for each feature of the query.

Stacking all of these output vectors back into a tensor gives you a $[N, d]$ attention tensor, but you never had to calculate a full matrix-matrix multiplication.

Implementations

I first become aware of linear attention after reading the Jet-Nemotron paper which selectively replaces full-attention with linear attention in layers where it does not compromise model quality. The corollary is that linear attention does negatively impact model quality.

Qwen3-Next uses Gated DeltaNet layers, “a form of linear attention that runs slightly slower than Mamba-2 but yields better performance.”¹

Ring attention

Ring attention is a method for distributing Full attention across multiple nodes. It uses a specialized version of generalized distributed (e.g., block-cyclic) GEMM have been doing for decades. Because it is specifically designed for attention, it performs a 1D decomposition (partitioning along the sequence length, so that nodes get blocks of token vectors) and incorporates the distributed softmax along with the GEMM.

Ring attention is likely the method by which Google Gemini offers sequence lengths of millions of tokens, since its creator works at DeepMind.

Gated attention

Gated attention uses “a learned gate after computing attention, effectively enabling the model to decide which parts of the layer’s output they want to pass along to subsequent layers.”¹

Alternatives

Selective state spaces

Mamba is an alternative to attention where, instead of re-computing all pairwise interactions like attention, it keeps a hidden state that is updated every time new tokens arrive. This hidden state is designed to carry long-range dependencies (like attention) but scales with sequence length $S$ as $O (S)$ rather than $O (S^{2})$ (like attention).

Test-time trainable memory modules

ATLAS is a method that replaces a transformer’s attention layers with a trainable memory module which can process up to 10 million input tokens. From The Batch’s summary:

How it works: ATLAS replaces a transformer’s attention layers with a trainable memory module

…

The module acts something like a retriever: When it receives sequences of tokens that are similar to those it received previously, it retrieves stored representations of the earlier sequence enriched with the latest context. In this way, it can interpret new input tokens in light of previous ones, like a typical recurrent neural network, without needing to examine all input tokens at once, like a transformer.

Given text tokens, ATLAS used linear projections to transform them (a sliding context window of the last 2 tokens) into a key used to find related information and a value containing that information.

The memory module, made up of fully connected layers, received the transformed key and produced a predicted value. ATLAS compared the predicted value to the actual value and updated the memory module’s weights to minimize the difference, effectively learning which keys retrieve which values.

At inference, the model’s parameters were frozen except the memory model’s weights, which reset after each session.

The Batch, September 17, 2025 ↩ ↩²

Glenn's Digital Garden

Explorer

attention

Full attention

Multi-head Attention (MHA)

Properties

Worked Example

Grouped-Query Attention (GQA)

Latent attention (MLA)

Linear attention

Implementations

Ring attention

Gated attention

Alternatives

Selective state spaces

Test-time trainable memory modules

Graph View

Table of Contents

Backlinks

Glenn's Digital Garden

Explorer

attention

Full attention

Multi-head Attention (MHA)

Properties

Worked Example

Grouped-Query Attention (GQA)

Latent attention (MLA)

Linear attention

Implementations

Ring attention

Gated attention

Alternatives

Selective state spaces

Test-time trainable memory modules

Footnotes

Graph View

Table of Contents

Backlinks