Recently I’ve been building agents that run for a while: long tasks, many tool calls, plenty of intermediate state piling up in the prompt. After enough turns, the model would start ignoring older instructions. Constraints from a few tool calls back stopped being respected. Sub-tasks I had asked for never got done, even though the request was still right there in the prompt.
I started reading, and turns out a lot of people are hitting the same thing. The phenomenon has names: lost-in-the-middle, and a more general follow-up version called context rot. In long contexts, models are much better at using information near the start and the end of the prompt than information buried in the middle.
The cheapest practical fix in real agent loops is recitation: periodically rewrite the goal, plan, or open sub-tasks at the tail end of the context, and the model starts conditioning on a fresh copy at decision time. This post walks through what recitation is, where it shows up in real systems, why long contexts become harder for transformers to use, and where the technique fails.
Table of contents
Open Table of contents
What “recitation” actually means here
The technical use of the word started with Recitation-Augmented Language Models (Sun et al., 2023). The setup there is simple: instead of doing retrieval against an external corpus the way RAG does, prompt the LLM to first recite relevant passages from its own parametric memory, then answer the question conditioned on what it just recited. RECITE works by splitting the task in two. The first step (sample what you remember) mimics the pretraining objective and is something the model is already good at. The second step (answer) is grounded in the recited text, which is now sitting fresh at the end of the context.
In agent systems, the same word covers a slightly broader idea: any time the agent deliberately writes down its goals, plan, retrieved facts, or state into the live context window before its next action. The thing being recited can be different (a goal, a todo list, a summary of what just happened, a behavioural style), but the mechanism is the same: pull the important content into the recent part of the context so the model actually conditions on it.
A few flavours people use in practice:
- Knowledge recitation as in RECITE: sample relevant facts from the model itself before answering.
- Style recitation as in StyleChat (Li et al., 2024): recite a learned style profile before generating, so the dialogue inherits the right tone.
- Goal / plan recitation as in Manus: keep a
todo.mdfile and rewrite it at every step. The Manus team is explicit that this is attention manipulation, not bookkeeping for the user. Yichao “Peak” Ji puts it directly: “By constantly rewriting the todo list, Manus is reciting its objectives into the end of the context. This pushes the global plan into the model’s recent attention span, avoiding ‘lost-in-the-middle’ issues and reducing goal misalignment.” - Self-reflection as in Reflexion (Shinn et al., 2023): after a failure, the agent writes a verbal critique of what went wrong and keeps that critique in an episodic memory it conditions on next round. The reflection itself is a recitation of “what I should remember about that mistake.”
These look different on the surface but they are doing the same physical thing to the prompt: every one of them moves a piece of important content from “somewhere in the conversation” to “right before the next decision.”
The goal-recitation case is the easiest to picture step by step. Without recitation the objectives sit at position 0 forever and get pushed deeper into the past as new actions and observations append. With recitation a fresh copy of the objectives reappears further down at step n+1, riding back into the recency window:
The problem: long contexts quietly break LLMs
Why does that placement matter so much? Two well-known effects, both worth treating as load-bearing:
Lost in the Middle
Liu et al. (2023) put a single key fact (a “needle”) at different positions inside a long document and asked the model to retrieve it. The accuracy curve as a function of where the needle lives is U-shaped: high at the beginning, high at the end, and visibly lower when the needle is buried in the middle. The effect is not subtle, and it appears even on models that are explicitly trained for long contexts.
The intuition behind the dip is that models tend to privilege the boundaries of the prompt. The beginning matters because instructions usually live there and instruction tuning reinforces that pattern. The end matters because it sits closest to the next few decoding steps, and many heads retrieve nearby content more reliably than distant content. Either way, anything load-bearing that lives in the middle of a long prompt has a harder retrieval problem.
Context Rot
The Lost in the Middle paper is from 2023 and the U-shape became folklore. The follow-up question was: does this actually go away with frontier models and 1M-token context windows? Chroma’s “Context Rot” study (2025) ran a careful version of this question across 18 frontier models (Claude Opus 4 and Sonnet 4, GPT-4.1, GPT-4o, Gemini 2.5, Qwen 3, etc.) on extended needle-in-a-haystack and conversational QA benchmarks. The headline finding: every single model degrades as input length grows, and the degradation is not a sharp cliff at the limit but a gradual slide that starts well before. There is no immune model. Long-context training mitigates the problem; it does not solve it.
So we have two empirical facts agents have to live with: middle positions are weak, and total length itself eats accuracy. Now the question is why, because the answer is what justifies the fix.
Where the problem comes from: dilution, distance, and position
There is not one single mechanism behind long-context failure, but the cleanest place to start is geometric. A self-attention head computes, for each query and every key , a score , then turns those scores into weights via softmax:
The output of the head is then , a weighted average of value vectors. Two things follow from that softmax that matter for long contexts.
First, the attention budget is bounded. The are non-negative and sum to one. Adding more tokens does not give the head more attention to spend; it forces it to redistribute the same unit of mass across more candidates. If the relevant key has score and there are irrelevant keys with scores roughly , then the relevant key’s attention weight is approximately
The picture above is the same head with the same query, looking for the same fact. On the left, the irrelevant keys are few and the right key wins by a wide margin. On the right, the noise floor of “vaguely related” keys has risen high enough that the right key still wins per token, but its share of the total mass is much smaller. The model is conditioning on a flatter, noisier weighted average and the answer it produces becomes correspondingly less crisp.
Second, distance and positional structure make long-range retrieval harder. RoPE-style encodings carry rich relative-position signal at short range, but that signal becomes coarser and more phase-ambiguous at long range. That makes it harder for attention heads to discriminate one distant position from another. By itself this does not fully explain the U-shape: the beginning of the prompt is also privileged by prompt layout, instruction tuning, and sometimes attention-sink behaviour, while the end is privileged because it sits closest to the next decoding steps. Put together, those effects make the middle the weakest place to hide important information.
Recitation does not fix long-context weakness in any deep sense. What it does is shorten the effective distance between “the thing you want me to use” and “the place I am about to make a decision,” and move that content back into a part of the prompt the next few decoding steps can access more reliably.
Why moving content to the end actually helps
There are three reinforcing effects.
Recency bias from locality-biased heads. For the next token, every prior token competes in the same softmax; recent tokens are not special because they face fewer competitors. They are special because they are close. Many trained heads are strongly local, so nearby content is easier to retrieve reliably than facts buried thousands of positions earlier. Putting the goal near the end helps because the next few decoding steps can reach it at short range.
Attention sinks at the beginning. Some attention heads over-attend to a few fixed tokens near the start of the prompt, such as the BOS token or the first system tokens. That helps explain why the beginning of the context can stay unusually salient. But those sink positions are fixed; they are not a writable memory slot. The end of the context is the boundary you can actually rewrite on the fly.
Distractors crowd the recent window. In an agent loop, the text immediately before the next decision is often noisy tool output and intermediate scratch, not the user’s actual goal. If you do nothing, locality-biased heads may spend their strongest short-range attention on that noise. Reciting the goal after a noisy interlude replaces those distractors with a fresh anchor.
So the agent rewriting its todo.md is not just a memory trick. It is moving the most important text back into the part of the prompt the model is most likely to use next.
The same content, different placement. In the second timeline, “decide” attends to the rewritten goal in its high-attention recency bump rather than digging across half a million tokens of tool output to find the original.
When recitation hurts: how you phrase the reminder matters
Recitation works because it gets the model to condition on the right text right before its next decision. But that text is itself a prompt, and prompts have implications. The same mechanism that anchors the agent on the goal can also push it toward a specific answer, stop it from exploring, or prime it to capitulate the moment something looks ambiguous. Most of the failure modes I have hit with recitation are not about whether to recite; they are about how the recitation is phrased.
The clearest version of this is leading questions in self-checks. Compare:
- “Are you sure that answers the user’s question?”
- “Restate the user’s question and your current answer side by side, then say whether they match.”
The first is not really a request to re-evaluate. It is a request to either defend or capitulate, and which one the model picks has more to do with the recent context than with whether the answer is actually correct. Models are well-documented to flip correct answers to wrong ones under this kind of phrasing; Sharma et al. (2023) on sycophancy is the canonical reference. The second leaves the model room to actually compare two things.
The trap goes the other way too. If your recitation tells the model “you are now ready to produce the final answer,” or “the analysis is complete and you can summarise,” you have implicitly said the task is done. The model takes the cue and stops exploring, even when there are open questions or contradictions sitting in the context.
A few patterns I have settled on:
- In execution loops (long agent runs): keep the recitation neutral and structural. Goal, what is open, what was just done. Do not editorialise at every step. Editorialising nudges the agent toward your wording instead of the task.
- In self-review (one final pass before output): if you want the model to actually catch its own mistakes, phrase the prompt to surface alternatives rather than to defend a position. “Where might this answer be wrong?” reliably beats “is this correct?”. The neutral framing leaves room to flip; the leading framing usually just collects a defence.
- When you genuinely want a flip: leading questions are useful when you have already decided the model should reconsider, and you just want the words to push it. Just be aware that you are buying a course correction, not an evaluation.
There is no universal best wording. The right tone depends on whether you want the model to be willing to flip (mid-draft reviews) or to stay on the rails (long-horizon execution). The consistent rule is that leading reminders trade reasoning for compliance, and you should know which one you are buying.
A separate but related failure is the one Yan et al. (2025) studied with RoR-Bench: on elementary-school reasoning problems with a single condition quietly changed, frontier models lose about 60% accuracy because they recite the canonical solution path instead of reasoning about the new constraints. That problem is upstream of the prompt: it is about how the model was trained to recite. The phrasing problem above is downstream: it is about how you write the reminder. You can do something about the second one even if you cannot fix the first.
Practical takeaways
What this means in practice:
- The last few hundred tokens are your only reliable steerable attention slot. Whatever has to shape the next decision should be there.
- Re-emit goals every steps, not just once. A few hundred tokens of duplication beats a stale plan buried under tool output.
- Externalise bulky state to a file. Two ways to bring a current view of it into the live context. The first: have the agent rewrite the file every step. The rewrite is a tool call, so its content already sits in the recent context once the agent has emitted it. It does not necessarily land at the literal end of the prompt (a few more actions and observations may stack on after the rewrite before the next decision), but it stays in the recent half of the context until the next rewrite, which is the part of the window the model attends to most reliably. Guaranteed recency, at the cost of duplicating the plan every step (Manus’s
todo.mdpattern). The second: keep the file external and expose it as a tool the agent reads on demand. Cleaner context, but the agent has to remember to look. - Vary the wording slightly between recitations so the agent does not few-shot itself into a rut from its own transcript.
- Watch for the reciter overriding the user. If the agent confidently outputs the canonical answer to the wrong question, the problem is in the model, not the loop.
Frontier models have not abolished the U-shape; they have moved the curve. The cheapest engineering move is still to put the important text where the model will actually look, and to keep putting it there as the context grows.
References
- Liu et al. Lost in the Middle: How Language Models Use Long Contexts. 2023. arXiv:2307.03172
- Hong et al. Context Rot: How Increasing Input Tokens Impacts LLM Performance. Chroma Research, 2025. trychroma.com/research/context-rot
- Sun et al. Recitation-Augmented Language Models. ICLR 2023. arXiv:2210.01296
- Li et al. StyleChat: Learning Recitation-Augmented Memory in LLMs for Stylized Dialogue Generation. 2024. arXiv:2403.11439
- Shinn et al. Reflexion: Language Agents with Verbal Reinforcement Learning. NeurIPS 2023. arXiv:2303.11366
- Yan et al. Recitation over Reasoning: How Cutting-Edge Language Models Can Fail on Elementary School-Level Reasoning Problems? 2025. arXiv:2504.00509
- Sharma et al. Towards Understanding Sycophancy in Language Models. 2023. arXiv:2310.13548
- Ji, Y. Context Engineering for AI Agents: Lessons from Building Manus. 2025. manus.im