Tag: inference

All the articles with the tag "inference".

Why Streaming LLMs Need Attention Sinks

Published: 12 Nov, 2025

A walkthrough of attention sinks: what they are, why softmax produces them by accident, why naive sliding-window inference collapses without them, and how a four-token reservation lets streaming inference run to four million tokens with no quality loss.

Why Streaming LLMs Need Attention Sinks