Skip to content
Go back

How to Mitigate the Lost-in-the-Middle Effect in LLMs

Published:

Recently I’ve been building agents that run for a while: long tasks, many tool calls, plenty of intermediate state piling up in the prompt. After enough turns, the model would start ignoring older instructions. Constraints from a few tool calls back stopped being respected. Sub-tasks I had asked for never got done, even though the request was still right there in the prompt.

I started reading, and turns out a lot of people are hitting the same thing. The phenomenon has names: lost-in-the-middle, and a more general follow-up version called context rot. In long contexts, models are much better at using information near the start and the end of the prompt than information buried in the middle.

The cheapest practical fix in real agent loops is recitation: periodically rewrite the goal, plan, or open sub-tasks at the tail end of the context, and the model starts conditioning on a fresh copy at decision time. This post walks through what recitation is, where it shows up in real systems, why long contexts become harder for transformers to use, and where the technique fails.

Table of contents

Open Table of contents

What “recitation” actually means here

The technical use of the word started with Recitation-Augmented Language Models (Sun et al., 2023). The setup there is simple: instead of doing retrieval against an external corpus the way RAG does, prompt the LLM to first recite relevant passages from its own parametric memory, then answer the question conditioned on what it just recited. RECITE works by splitting the task in two. The first step (sample what you remember) mimics the pretraining objective and is something the model is already good at. The second step (answer) is grounded in the recited text, which is now sitting fresh at the end of the context.

In agent systems, the same word covers a slightly broader idea: any time the agent deliberately writes down its goals, plan, retrieved facts, or state into the live context window before its next action. The thing being recited can be different (a goal, a todo list, a summary of what just happened, a behavioural style), but the mechanism is the same: pull the important content into the recent part of the context so the model actually conditions on it.

A few flavours people use in practice:

These look different on the surface but they are doing the same physical thing to the prompt: every one of them moves a piece of important content from “somewhere in the conversation” to “right before the next decision.”

The goal-recitation case is the easiest to picture step by step. Without recitation the objectives sit at position 0 forever and get pushed deeper into the past as new actions and observations append. With recitation a fresh copy of the objectives reappears further down at step n+1, riding back into the recency window:

without recitation with recitation step n step n+1 step n step n+1 Objectives Action 1 Observation 1 Action 2 Observation 2 Objectives Action 1 Observation 1 Action 2 Observation 2 Action 3 Observation 3 Objectives Action 1 Observation 1 Action 2 Observation 2 Objectives Action 1 Observation 1 Action 2 Observation 2 Objectives Action 3 Observation 3 After the figure in Yichao "Peak" Ji, "Context Engineering for AI Agents: Lessons from Building Manus" (2025).

The problem: long contexts quietly break LLMs

Why does that placement matter so much? Two well-known effects, both worth treating as load-bearing:

Lost in the Middle

Liu et al. (2023) put a single key fact (a “needle”) at different positions inside a long document and asked the model to retrieve it. The accuracy curve as a function of where the needle lives is U-shaped: high at the beginning, high at the end, and visibly lower when the needle is buried in the middle. The effect is not subtle, and it appears even on models that are explicitly trained for long contexts.

start used reliably middle often dropped end used reliably each bar is a line of the document; opacity ≈ how reliably content at that position influences the next prediction

The intuition behind the dip is that models tend to privilege the boundaries of the prompt. The beginning matters because instructions usually live there and instruction tuning reinforces that pattern. The end matters because it sits closest to the next few decoding steps, and many heads retrieve nearby content more reliably than distant content. Either way, anything load-bearing that lives in the middle of a long prompt has a harder retrieval problem.

Context Rot

The Lost in the Middle paper is from 2023 and the U-shape became folklore. The follow-up question was: does this actually go away with frontier models and 1M-token context windows? Chroma’s “Context Rot” study (2025) ran a careful version of this question across 18 frontier models (Claude Opus 4 and Sonnet 4, GPT-4.1, GPT-4o, Gemini 2.5, Qwen 3, etc.) on extended needle-in-a-haystack and conversational QA benchmarks. The headline finding: every single model degrades as input length grows, and the degradation is not a sharp cliff at the limit but a gradual slide that starts well before. There is no immune model. Long-context training mitigates the problem; it does not solve it.

So we have two empirical facts agents have to live with: middle positions are weak, and total length itself eats accuracy. Now the question is why, because the answer is what justifies the fix.

Where the problem comes from: dilution, distance, and position

There is not one single mechanism behind long-context failure, but the cleanest place to start is geometric. A self-attention head computes, for each query qq and every key kik_i, a score si=qki/ds_i = q^\top k_i / \sqrt{d}, then turns those scores into weights via softmax:

ai=exp(si)j=1Nexp(sj).a_i = \frac{\exp(s_i)}{\sum_{j=1}^N \exp(s_j)}.

The output of the head is then iaivi\sum_i a_i v_i, a weighted average of value vectors. Two things follow from that softmax that matter for long contexts.

First, the attention budget is bounded. The aia_i are non-negative and sum to one. Adding more tokens does not give the head more attention to spend; it forces it to redistribute the same unit of mass across more candidates. If the relevant key has score ss^* and there are N1N - 1 irrelevant keys with scores roughly sˉ\bar s, then the relevant key’s attention weight is approximately

aexp(s)exp(s)+(N1)exp(sˉ)=11+(N1)exp(sˉs).a^* \approx \frac{\exp(s^*)}{\exp(s^*) + (N - 1)\exp(\bar s)} = \frac{1}{1 + (N - 1)\exp(\bar s - s^*)}.
See the sigmoid form and how the curve shifts right as N grows

Letting g=ssˉg = s^* - \bar s stand in for the score gap, this is exactly a sigmoid: a=σ(gln(N1))a^* = \sigma\bigl(g - \ln(N-1)\bigr). The midpoint of the sigmoid (where the head puts half its mass on the relevant key) sits at g=ln(N1)g = \ln(N - 1). So as NN grows, the whole curve translates to the right: a larger score gap is needed to claim the same fraction of attention.

0 2 4 6 8 10 0 0.5 1 N = 10 N = 100 N = 1000 score gap g = s* − s̄ a* (attention on the right key)

Reading the plot: at a score gap of about 2, the head puts roughly half of its mass on the right key when N=10N = 10, but only about 7% when N=100N = 100 and effectively nothing when N=1000N = 1000. To get back to the half-mass mark in a 1000-token context, the score gap has to grow from 2 to almost 7. Doubling NN adds ln20.69\ln 2 \approx 0.69 to where the curve is centered, so each context-doubling tightens the threshold for “large enough advantage” by the same amount.

short context (N small) attention weight tokens (one bar each) needle long context (N large) attention weight tokens (one bar each) needle noise floor rises

The picture above is the same head with the same query, looking for the same fact. On the left, the irrelevant keys are few and the right key wins by a wide margin. On the right, the noise floor of “vaguely related” keys has risen high enough that the right key still wins per token, but its share of the total mass is much smaller. The model is conditioning on a flatter, noisier weighted average and the answer it produces becomes correspondingly less crisp.

Second, distance and positional structure make long-range retrieval harder. RoPE-style encodings carry rich relative-position signal at short range, but that signal becomes coarser and more phase-ambiguous at long range. That makes it harder for attention heads to discriminate one distant position from another. By itself this does not fully explain the U-shape: the beginning of the prompt is also privileged by prompt layout, instruction tuning, and sometimes attention-sink behaviour, while the end is privileged because it sits closest to the next decoding steps. Put together, those effects make the middle the weakest place to hide important information.

See why the wave structure of RoPE forces this trade-off

RoPE encodes position mm by rotating each pair of embedding dimensions by an angle mθkm\theta_k, with θk\theta_k ranging from very fast (high frequency, short period) at one end of the embedding to very slow (long period) at the other. Each pair of dims is effectively a sine and cosine sampled at mθkm\theta_k, and dot products between two positions mm and nn depend on mnm - n through these sinusoids. Each frequency channel contributes differently:

high-frequency channel (short period) low-frequency channel (long period) 0 100 200 300 token position m

The high-frequency channel goes through nearly twenty cycles in 300 tokens. Two nearby positions sit at clearly different points on the wave, so the head can tell them apart easily. Two far-apart positions have so many cycles between them that the high-frequency channel is essentially random with respect to their distance: it is just as likely to put them at similar phase as different phase. The signal has aliased.

The low-frequency channel does the opposite. In 300 tokens it covers less than one cycle. Far-apart positions sit at meaningfully different points on the wave, so the model can still tell which side of the document a token is on. But near positions are practically identical: the channel changes by a tiny fraction of a cycle between them, so it carries no useful information for short-range distinctions.

There is no single channel that stays both highly precise and unambiguous across arbitrarily long distances. At long range, the usable signal becomes coarser. The model may still recover broad location, but exact relative position gets harder to pin down.

Recitation does not fix long-context weakness in any deep sense. What it does is shorten the effective distance between “the thing you want me to use” and “the place I am about to make a decision,” and move that content back into a part of the prompt the next few decoding steps can access more reliably.

Why moving content to the end actually helps

There are three reinforcing effects.

Recency bias from locality-biased heads. For the next token, every prior token competes in the same softmax; recent tokens are not special because they face fewer competitors. They are special because they are close. Many trained heads are strongly local, so nearby content is easier to retrieve reliably than facts buried thousands of positions earlier. Putting the goal near the end helps because the next few decoding steps can reach it at short range.

Attention sinks at the beginning. Some attention heads over-attend to a few fixed tokens near the start of the prompt, such as the BOS token or the first system tokens. That helps explain why the beginning of the context can stay unusually salient. But those sink positions are fixed; they are not a writable memory slot. The end of the context is the boundary you can actually rewrite on the fly.

Distractors crowd the recent window. In an agent loop, the text immediately before the next decision is often noisy tool output and intermediate scratch, not the user’s actual goal. If you do nothing, locality-biased heads may spend their strongest short-range attention on that noise. Reciting the goal after a noisy interlude replaces those distractors with a fresh anchor.

So the agent rewriting its todo.md is not just a memory trick. It is moving the most important text back into the part of the prompt the model is most likely to use next.

without recitation goal tool calls, results, scratch decide primacy ↑ recency ↑ with goal recitation goal tool calls, results, scratch goal' decide

The same content, different placement. In the second timeline, “decide” attends to the rewritten goal in its high-attention recency bump rather than digging across half a million tokens of tool output to find the original.

When recitation hurts: how you phrase the reminder matters

Recitation works because it gets the model to condition on the right text right before its next decision. But that text is itself a prompt, and prompts have implications. The same mechanism that anchors the agent on the goal can also push it toward a specific answer, stop it from exploring, or prime it to capitulate the moment something looks ambiguous. Most of the failure modes I have hit with recitation are not about whether to recite; they are about how the recitation is phrased.

The clearest version of this is leading questions in self-checks. Compare:

The first is not really a request to re-evaluate. It is a request to either defend or capitulate, and which one the model picks has more to do with the recent context than with whether the answer is actually correct. Models are well-documented to flip correct answers to wrong ones under this kind of phrasing; Sharma et al. (2023) on sycophancy is the canonical reference. The second leaves the model room to actually compare two things.

The trap goes the other way too. If your recitation tells the model “you are now ready to produce the final answer,” or “the analysis is complete and you can summarise,” you have implicitly said the task is done. The model takes the cue and stops exploring, even when there are open questions or contradictions sitting in the context.

A few patterns I have settled on:

There is no universal best wording. The right tone depends on whether you want the model to be willing to flip (mid-draft reviews) or to stay on the rails (long-horizon execution). The consistent rule is that leading reminders trade reasoning for compliance, and you should know which one you are buying.

A separate but related failure is the one Yan et al. (2025) studied with RoR-Bench: on elementary-school reasoning problems with a single condition quietly changed, frontier models lose about 60% accuracy because they recite the canonical solution path instead of reasoning about the new constraints. That problem is upstream of the prompt: it is about how the model was trained to recite. The phrasing problem above is downstream: it is about how you write the reminder. You can do something about the second one even if you cannot fix the first.

Practical takeaways

What this means in practice:

  1. The last few hundred tokens are your only reliable steerable attention slot. Whatever has to shape the next decision should be there.
  2. Re-emit goals every kk steps, not just once. A few hundred tokens of duplication beats a stale plan buried under tool output.
  3. Externalise bulky state to a file. Two ways to bring a current view of it into the live context. The first: have the agent rewrite the file every step. The rewrite is a tool call, so its content already sits in the recent context once the agent has emitted it. It does not necessarily land at the literal end of the prompt (a few more actions and observations may stack on after the rewrite before the next decision), but it stays in the recent half of the context until the next rewrite, which is the part of the window the model attends to most reliably. Guaranteed recency, at the cost of duplicating the plan every step (Manus’s todo.md pattern). The second: keep the file external and expose it as a tool the agent reads on demand. Cleaner context, but the agent has to remember to look.
  4. Vary the wording slightly between recitations so the agent does not few-shot itself into a rut from its own transcript.
  5. Watch for the reciter overriding the user. If the agent confidently outputs the canonical answer to the wrong question, the problem is in the model, not the loop.

Frontier models have not abolished the U-shape; they have moved the curve. The cheapest engineering move is still to put the important text where the model will actually look, and to keep putting it there as the context grows.

References



Previous Post
How PPO Actually Works