Skip to content
Go back

Why Streaming LLMs Need Attention Sinks

Published:

Imagine you are running an LLM as a streaming service: a chat that never ends, an agent that runs for days, a monitor that watches a log forever. You cannot keep growing the KV cache without bound. It is O(N) memory and O(N) attention compute per generated token, and unboundedly many tokens means unbounded resources.

The obvious fix is a sliding window. Keep the most recent W tokens in the cache, drop everything older.

This does not work.

The moment the window pushes past the very first tokens of the input (the BOS, the opener, that one stray period), perplexity does not gradually degrade. It explodes. Often by 100x within a few hundred tokens. And the kicker: this happens whether or not the early tokens carry any semantic content. Eviction kills the model even when the evicted tokens were meaningless.

It is not a context problem. It is structural to how softmax allocates weight. This post walks through what attention sinks are, why softmax produced them by accident, and how a four-token reservation lets sliding-window inference run to four million tokens with no quality loss.

Table of contents

Open Table of contents

Where the term comes from

The phenomenon got its name from Xiao et al. (2023), Efficient Streaming Language Models with Attention Sinks, which appeared at ICLR 2024.

The setup: the authors were trying to make sliding-window attention work for unbounded streams. It kept failing in a strange way. Perplexity exploded right at the eviction boundary, regardless of context. They went looking for the culprit.

What they found: every trained decoder-only transformer they looked at (Llama, Mistral, Falcon, Pythia) was doing the same thing. A large fraction of attention from every later token, in every head, was pointing at the first one to four positions in the sequence. The first tokens were acting as a magnet for attention mass, even when those tokens were semantically blank.

The paper traces the cause, proposes the fix, and shows it generalises. The mechanism turns out to be a property of softmax, not of any particular model.

What an attention sink looks like

If you pick a mid-network layer of a trained transformer and visualise its attention pattern, you see a striking thing. Across the attention map of any head (query tokens on one axis, key tokens on the other), the columns corresponding to the first few positions are noticeably brighter than everything else.

That brightness is the share of attention each later query is paying to the early keys. Brighter means more weight. And the pattern is consistent: it shows up across heads, across layers, and across totally unrelated inputs. Whether the prompt is a paragraph from a news article, a snippet of Python, or a casual chat message, the first one to four tokens stay hot. They typically absorb 30 to 80 percent of the total attention mass per query.

This is strange. The first BOS token has no semantic content. The opening of the system prompt is usually boilerplate. A stray period at position 2 is not the most informative thing in the prompt. Yet later tokens keep paying attention to them as if they were.

The phenomenon is emergent: nothing in the architecture or training objective tells the model to behave this way. It learns to. And it learns to consistently enough that “the first few positions are special” is a reliable property of any trained decoder.

The implication is what makes this a problem. If the model has learned to route attention through the early tokens, those tokens are load-bearing, not for what they say but for the role they play in the attention dynamics. Drop them out of the KV cache, and you have just kicked out a piece of structure the model is depending on.

Why softmax produces sinks

The mechanism is a property of softmax. For each query, the attention head computes scores over all keys, then runs them through softmax to get a probability distribution. The resulting weights are non-negative and sum to exactly one:

ai=exp(si)j=1Nexp(sj),i=1Nai=1.a_i = \frac{\exp(s_i)}{\sum_{j=1}^N \exp(s_j)}, \qquad \sum_{i=1}^N a_i = 1.

That last constraint is the load-bearing part. The attention budget for each query is fixed at exactly one. There is no way for a head to say “for this query, I have nothing to contribute, please put my budget at zero.” Whatever mass you do not allocate to one place, you must allocate to somewhere else.

Now consider what happens to a head that, at this query, genuinely has no useful information to retrieve from the keys. Every key looks roughly equivalent (say, low-relevance). The head still has to assign one unit of mass somewhere. If it spreads the mass uniformly, it pulls in a small slice of every value vector, which is approximately the average of all value vectors, which is generally not what you want. If it could route the leftover mass out of the way, it would.

Models discover an alternative: dump the leftover mass on a small set of fixed positions whose value vectors do not disturb the residual stream very much. The first few tokens are the natural choice. They are visible to every later position. Their position is reliable and unchanging. The model has many training steps to learn that these positions are safe to use as a parking lot.

So a sink is not a bug. It is an equilibrium of an optimization problem under an architectural constraint that says “your attention must sum to exactly one.” Take the constraint away and the sink does not form.

Evan Miller proposed exactly this, under the name Attention Is Off By One. Replace softmax with a variant where the denominator has an extra +1 term, equivalent to introducing a synthetic null position that absorbs unwanted mass. Attention can then sum to anything between zero and one, the leftover mass drains into the null slot, and the early tokens never have to play parking lot.

The same phenomenon shows up in vision transformers. Darcet et al. (2023) showed that ViTs produce outlier features in low-information patch tokens for the same structural reason, and proposed adding explicit “register” tokens to absorb that mass intentionally. Different domain, same story.

The fix: keep the sinks permanently

Once you see attention sinks as load-bearing structure, the StreamingLLM fix writes itself: never evict them.

Reserve the first K tokens of the sequence permanently in the KV cache. Slide your eviction window over everything else. When the cache fills, drop the next-oldest token instead of the absolute oldest. The early tokens stay anchored.

That is it. K is usually 4. Memory cost: four extra KV cache entries, completely negligible. Compute cost: a tiny amount of extra attention to those positions. Quality: matches dense attention closely while keeping sliding-window compute. The trick is so cheap it feels like a typo.

What the fix is doing is making sure the model never has its parking lot pulled out from under it. The model is going to route some attention mass through positions 0 to 3 no matter what; that is how it was trained. Sliding window without sinks evicts those positions and the model has nowhere left to dump the mass. Sliding window with sinks keeps those positions available forever. The model behaves the way it does in normal inference, just over a longer effective horizon.

Important detail: the sinks do not have to be the original tokens. The paper shows that you can prepend a few learnable “sink tokens” during pretraining or fine-tuning, and the model will use those as the parking lot instead. This is a cleaner solution architecturally (the sinks become an explicit part of the design rather than an emergent property of the original prompt), but it requires training, so most deployed solutions just reserve the first K real tokens.

Does it actually work

The headline result from the paper: stable perplexity out to four million tokens.

Sliding window alone, on Llama-2-7B with a window of about 1024: perplexity sits near the normal range until just past the original context length, then climbs into the thousands within a few thousand more tokens. The model effectively breaks at the eviction boundary. This is reproducible across Llama, Mistral, Falcon, and Pythia.

Sliding window plus 4 sink tokens reserved: perplexity stays in the normal range across the entire four-million-token sweep. No drift, no eventual collapse, no sign that the model is heading anywhere bad.

The dense-attention baseline, for reference, also stays low until you exceed the model’s training context window, after which it climbs (the model is now extrapolating positional encodings). Dense attention is the gold standard for quality but uneconomical for unbounded streams. Sliding window plus sinks gets effectively the same quality, with sliding-window memory and compute, indefinitely.

The same paper validates this on multiple model families. It is not a Llama-specific quirk. Every trained decoder has sinks, and every sliding-window scheme that does not preserve them blows up.

Where this has gone

The fix is in production. Several inference stacks reserve sink tokens by default for streaming use cases, and OpenAI’s gpt-oss open-weights release exposes per-head attention sink biases as learned parameters, making explicit what most models do implicitly.

The same pattern shows up in vision transformers. Darcet et al. noticed that ViTs produce outlier features in low-information patch tokens, and proposed prepending dedicated “register” tokens to give those features somewhere to live. Same architectural pattern: give the model explicit no-op slots so it does not have to repurpose real ones.

Evan Miller’s softmax+1 is the cleanest theoretical fix: change the softmax denominator and the sink phenomenon never forms. The downside is that it requires retraining models from scratch, so existing pretrained checkpoints cannot benefit.

Some interpretability work has reframed sinks as a kind of no-op or ignore attention pattern that is useful in its own right. The model has learned to use them as a steering signal: there are real heads that fire on positions 0–3 only when the head genuinely has nothing to do this turn.

Further reading



Previous Post
Breaking Down Agent Evals (Part 1): A Practitioner's Guide
Next Post
How PPO Actually Works