Tag: inference
All the articles with the tag "inference".
-
Why Streaming LLMs Need Attention Sinks
Published:A walkthrough of attention sinks: what they are, why softmax produces them by accident, why naive sliding-window inference collapses without them, and how a four-token reservation lets streaming inference run to four million tokens with no quality loss.