How to Mitigate the Lost-in-the-Middle Effect in LLMs

Recently I’ve been building agents that run for a while: long tasks, many tool calls, plenty of intermediate state piling up in the prompt. After enough turns, the model would start ignoring older instructions. Constraints from a few tool calls back stopped being respected. Sub-tasks I had asked for never got done, even though the request was still right there in the prompt.

I started reading, and turns out a lot of people are hitting the same thing. The phenomenon has names: lost-in-the-middle, and a more general follow-up version called context rot. In long contexts, models are much better at using information near the start and the end of the prompt than information buried in the middle.

The cheapest practical fix in real agent loops is recitation: periodically rewrite the goal, plan, or open sub-tasks at the tail end of the context, and the model starts conditioning on a fresh copy at decision time. This post walks through what recitation is, where it shows up in real systems, why long contexts become harder for transformers to use, and where the technique fails.

Open Table of contents

What “recitation” actually means here
The problem: long contexts quietly break LLMs
- Lost in the Middle
- Context Rot
Where the problem comes from: dilution, distance, and position
Why moving content to the end actually helps
When recitation hurts: how you phrase the reminder matters
Practical takeaways
References

What “recitation” actually means here

The technical use of the word started with Recitation-Augmented Language Models (Sun et al., 2023). The setup there is simple: instead of doing retrieval against an external corpus the way RAG does, prompt the LLM to first recite relevant passages from its own parametric memory, then answer the question conditioned on what it just recited. RECITE works by splitting the task in two. The first step (sample what you remember) mimics the pretraining objective and is something the model is already good at. The second step (answer) is grounded in the recited text, which is now sitting fresh at the end of the context.

In agent systems, the same word covers a slightly broader idea: any time the agent deliberately writes down its goals, plan, retrieved facts, or state into the live context window before its next action. The thing being recited can be different (a goal, a todo list, a summary of what just happened, a behavioural style), but the mechanism is the same: pull the important content into the recent part of the context so the model actually conditions on it.

A few flavours people use in practice:

Knowledge recitation as in RECITE: sample relevant facts from the model itself before answering.
Style recitation as in StyleChat (Li et al., 2024): recite a learned style profile before generating, so the dialogue inherits the right tone.
Goal / plan recitation as in Manus: keep a todo.md file and rewrite it at every step. The Manus team is explicit that this is attention manipulation, not bookkeeping for the user. Yichao “Peak” Ji puts it directly: “By constantly rewriting the todo list, Manus is reciting its objectives into the end of the context. This pushes the global plan into the model’s recent attention span, avoiding ‘lost-in-the-middle’ issues and reducing goal misalignment.”
Self-reflection as in Reflexion (Shinn et al., 2023): after a failure, the agent writes a verbal critique of what went wrong and keeps that critique in an episodic memory it conditions on next round. The reflection itself is a recitation of “what I should remember about that mistake.”

These look different on the surface but they are doing the same physical thing to the prompt: every one of them moves a piece of important content from “somewhere in the conversation” to “right before the next decision.”

The goal-recitation case is the easiest to picture step by step. Without recitation the objectives sit at position 0 forever and get pushed deeper into the past as new actions and observations append. With recitation a fresh copy of the objectives reappears further down at step n+1, riding back into the recency window:

The problem: long contexts quietly break LLMs

Why does that placement matter so much? Two well-known effects, both worth treating as load-bearing:

Lost in the Middle

Liu et al. (2023) put a single key fact (a “needle”) at different positions inside a long document and asked the model to retrieve it. The accuracy curve as a function of where the needle lives is U-shaped: high at the beginning, high at the end, and visibly lower when the needle is buried in the middle. The effect is not subtle, and it appears even on models that are explicitly trained for long contexts.

The intuition behind the dip is that models tend to privilege the boundaries of the prompt. The beginning matters because instructions usually live there and instruction tuning reinforces that pattern. The end matters because it sits closest to the next few decoding steps, and many heads retrieve nearby content more reliably than distant content. Either way, anything load-bearing that lives in the middle of a long prompt has a harder retrieval problem.

Context Rot

The Lost in the Middle paper is from 2023 and the U-shape became folklore. The follow-up question was: does this actually go away with frontier models and 1M-token context windows? Chroma’s “Context Rot” study (2025) ran a careful version of this question across 18 frontier models (Claude Opus 4 and Sonnet 4, GPT-4.1, GPT-4o, Gemini 2.5, Qwen 3, etc.) on extended needle-in-a-haystack and conversational QA benchmarks. The headline finding: every single model degrades as input length grows, and the degradation is not a sharp cliff at the limit but a gradual slide that starts well before. There is no immune model. Long-context training mitigates the problem; it does not solve it.

So we have two empirical facts agents have to live with: middle positions are weak, and total length itself eats accuracy. Now the question is why, because the answer is what justifies the fix.

Where the problem comes from: dilution, distance, and position

There is not one single mechanism behind long-context failure, but the cleanest place to start is geometric. A self-attention head computes, for each query $q$ and every key $k_i$ , a score $s_i = q^\top k_i / \sqrt{d}$ , then turns those scores into weights via softmax:

a_i = \frac{\exp(s_i)}{\sum_{j=1}^N \exp(s_j)}.

The output of the head is then $\sum_i a_i v_i$ , a weighted average of value vectors. Two things follow from that softmax that matter for long contexts.

First, the attention budget is bounded. The $a_i$ are non-negative and sum to one. Adding more tokens does not give the head more attention to spend; it forces it to redistribute the same unit of mass across more candidates. If the relevant key has score $s^*$ and there are $N - 1$ irrelevant keys with scores roughly $\bar s$ , then the relevant key’s attention weight is approximately

a^* \approx \frac{\exp(s^*)}{\exp(s^*) + (N - 1)\exp(\bar s)} = \frac{1}{1 + (N - 1)\exp(\bar s - s^*)}.

See the sigmoid form and how the curve shifts right as N grows

Letting $g = s^* - \bar s$ stand in for the score gap, this is exactly a sigmoid: $a^* = \sigma\bigl(g - \ln(N-1)\bigr)$ . The midpoint of the sigmoid (where the head puts half its mass on the relevant key) sits at $g = \ln(N - 1)$ . So as $N$ grows, the whole curve translates to the right: a larger score gap is needed to claim the same fraction of attention.

Reading the plot: at a score gap of about 2, the head puts roughly half of its mass on the right key when $N = 10$ , but only about 7% when $N = 100$ and effectively nothing when $N = 1000$ . To get back to the half-mass mark in a 1000-token context, the score gap has to grow from 2 to almost 7. Doubling $N$ adds $\ln 2 \approx 0.69$ to where the curve is centered, so each context-doubling tightens the threshold for “large enough advantage” by the same amount.

The picture above is the same head with the same query, looking for the same fact. On the left, the irrelevant keys are few and the right key wins by a wide margin. On the right, the noise floor of “vaguely related” keys has risen high enough that the right key still wins per token, but its share of the total mass is much smaller. The model is conditioning on a flatter, noisier weighted average and the answer it produces becomes correspondingly less crisp.

Second, distance and positional structure make long-range retrieval harder. RoPE-style encodings carry rich relative-position signal at short range, but that signal becomes coarser and more phase-ambiguous at long range. That makes it harder for attention heads to discriminate one distant position from another. By itself this does not fully explain the U-shape: the beginning of the prompt is also privileged by prompt layout, instruction tuning, and sometimes attention-sink behaviour, while the end is privileged because it sits closest to the next decoding steps. Put together, those effects make the middle the weakest place to hide important information.

See why the wave structure of RoPE forces this trade-off

RoPE encodes position $m$ by rotating each pair of embedding dimensions by an angle $m\theta_k$ , with $\theta_k$ ranging from very fast (high frequency, short period) at one end of the embedding to very slow (long period) at the other. Each pair of dims is effectively a sine and cosine sampled at $m\theta_k$ , and dot products between two positions $m$ and $n$ depend on $m - n$ through these sinusoids. Each frequency channel contributes differently:

The high-frequency channel goes through nearly twenty cycles in 300 tokens. Two nearby positions sit at clearly different points on the wave, so the head can tell them apart easily. Two far-apart positions have so many cycles between them that the high-frequency channel is essentially random with respect to their distance: it is just as likely to put them at similar phase as different phase. The signal has aliased.

The low-frequency channel does the opposite. In 300 tokens it covers less than one cycle. Far-apart positions sit at meaningfully different points on the wave, so the model can still tell which side of the document a token is on. But near positions are practically identical: the channel changes by a tiny fraction of a cycle between them, so it carries no useful information for short-range distinctions.

There is no single channel that stays both highly precise and unambiguous across arbitrarily long distances. At long range, the usable signal becomes coarser. The model may still recover broad location, but exact relative position gets harder to pin down.

Recitation does not fix long-context weakness in any deep sense. What it does is shorten the effective distance between “the thing you want me to use” and “the place I am about to make a decision,” and move that content back into a part of the prompt the next few decoding steps can access more reliably.

Why moving content to the end actually helps

There are three reinforcing effects.

Recency bias from locality-biased heads. For the next token, every prior token competes in the same softmax; recent tokens are not special because they face fewer competitors. They are special because they are close. Many trained heads are strongly local, so nearby content is easier to retrieve reliably than facts buried thousands of positions earlier. Putting the goal near the end helps because the next few decoding steps can reach it at short range.

Attention sinks at the beginning. Some attention heads over-attend to a few fixed tokens near the start of the prompt, such as the BOS token or the first system tokens. That helps explain why the beginning of the context can stay unusually salient. But those sink positions are fixed; they are not a writable memory slot. The end of the context is the boundary you can actually rewrite on the fly.

Distractors crowd the recent window. In an agent loop, the text immediately before the next decision is often noisy tool output and intermediate scratch, not the user’s actual goal. If you do nothing, locality-biased heads may spend their strongest short-range attention on that noise. Reciting the goal after a noisy interlude replaces those distractors with a fresh anchor.

So the agent rewriting its todo.md is not just a memory trick. It is moving the most important text back into the part of the prompt the model is most likely to use next.

The same content, different placement. In the second timeline, “decide” attends to the rewritten goal in its high-attention recency bump rather than digging across half a million tokens of tool output to find the original.

When recitation hurts: how you phrase the reminder matters

Recitation works because it gets the model to condition on the right text right before its next decision. But that text is itself a prompt, and prompts have implications. The same mechanism that anchors the agent on the goal can also push it toward a specific answer, stop it from exploring, or prime it to capitulate the moment something looks ambiguous. Most of the failure modes I have hit with recitation are not about whether to recite; they are about how the recitation is phrased.

The clearest version of this is leading questions in self-checks. Compare:

“Are you sure that answers the user’s question?”
“Restate the user’s question and your current answer side by side, then say whether they match.”

The first is not really a request to re-evaluate. It is a request to either defend or capitulate, and which one the model picks has more to do with the recent context than with whether the answer is actually correct. Models are well-documented to flip correct answers to wrong ones under this kind of phrasing; Sharma et al. (2023) on sycophancy is the canonical reference. The second leaves the model room to actually compare two things.

The trap goes the other way too. If your recitation tells the model “you are now ready to produce the final answer,” or “the analysis is complete and you can summarise,” you have implicitly said the task is done. The model takes the cue and stops exploring, even when there are open questions or contradictions sitting in the context.

A few patterns I have settled on:

In execution loops (long agent runs): keep the recitation neutral and structural. Goal, what is open, what was just done. Do not editorialise at every step. Editorialising nudges the agent toward your wording instead of the task.
In self-review (one final pass before output): if you want the model to actually catch its own mistakes, phrase the prompt to surface alternatives rather than to defend a position. “Where might this answer be wrong?” reliably beats “is this correct?”. The neutral framing leaves room to flip; the leading framing usually just collects a defence.
When you genuinely want a flip: leading questions are useful when you have already decided the model should reconsider, and you just want the words to push it. Just be aware that you are buying a course correction, not an evaluation.

There is no universal best wording. The right tone depends on whether you want the model to be willing to flip (mid-draft reviews) or to stay on the rails (long-horizon execution). The consistent rule is that leading reminders trade reasoning for compliance, and you should know which one you are buying.

A separate but related failure is the one Yan et al. (2025) studied with RoR-Bench: on elementary-school reasoning problems with a single condition quietly changed, frontier models lose about 60% accuracy because they recite the canonical solution path instead of reasoning about the new constraints. That problem is upstream of the prompt: it is about how the model was trained to recite. The phrasing problem above is downstream: it is about how you write the reminder. You can do something about the second one even if you cannot fix the first.

Practical takeaways

What this means in practice:

The last few hundred tokens are your only reliable steerable attention slot. Whatever has to shape the next decision should be there.
Re-emit goals every $k$ steps, not just once. A few hundred tokens of duplication beats a stale plan buried under tool output.
Externalise bulky state to a file. Two ways to bring a current view of it into the live context. The first: have the agent rewrite the file every step. The rewrite is a tool call, so its content already sits in the recent context once the agent has emitted it. It does not necessarily land at the literal end of the prompt (a few more actions and observations may stack on after the rewrite before the next decision), but it stays in the recent half of the context until the next rewrite, which is the part of the window the model attends to most reliably. Guaranteed recency, at the cost of duplicating the plan every step (Manus’s todo.md pattern). The second: keep the file external and expose it as a tool the agent reads on demand. Cleaner context, but the agent has to remember to look.
Vary the wording slightly between recitations so the agent does not few-shot itself into a rut from its own transcript.
Watch for the reciter overriding the user. If the agent confidently outputs the canonical answer to the wrong question, the problem is in the model, not the loop.

Frontier models have not abolished the U-shape; they have moved the curve. The cheapest engineering move is still to put the important text where the model will actually look, and to keep putting it there as the context grows.

References

Liu et al. Lost in the Middle: How Language Models Use Long Contexts. 2023. arXiv:2307.03172
Hong et al. Context Rot: How Increasing Input Tokens Impacts LLM Performance. Chroma Research, 2025. trychroma.com/research/context-rot
Sun et al. Recitation-Augmented Language Models. ICLR 2023. arXiv:2210.01296
Li et al. StyleChat: Learning Recitation-Augmented Memory in LLMs for Stylized Dialogue Generation. 2024. arXiv:2403.11439
Shinn et al. Reflexion: Language Agents with Verbal Reinforcement Learning. NeurIPS 2023. arXiv:2303.11366
Yan et al. Recitation over Reasoning: How Cutting-Edge Language Models Can Fail on Elementary School-Level Reasoning Problems? 2025. arXiv:2504.00509
Sharma et al. Towards Understanding Sycophancy in Language Models. 2023. arXiv:2310.13548
Ji, Y. Context Engineering for AI Agents: Lessons from Building Manus. 2025. manus.im