<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"><channel><title>Adam&apos;s Blog</title><description>A blog documenting my thoughts and research on AI and LLMs</description><link>https://adambutterworth.com/</link><item><title>Setting Logits to Negative Infinity: How LLMs Actually Output JSON</title><link>https://adambutterworth.com/posts/setting-logits-to-negative-infinity/</link><guid isPermaLink="true">https://adambutterworth.com/posts/setting-logits-to-negative-infinity/</guid><description>Structured outputs aren&apos;t a validation layer; they&apos;re a decoding-time intervention. How logit masking actually works, why token boundaries make it hard, and why reordering one field in your Pydantic schema can move accuracy by 90 points.</description><pubDate>Mon, 11 May 2026 20:00:00 GMT</pubDate></item><item><title>Breaking Down Agent Evals (Part 3): τ²-bench and τ³-bench</title><link>https://adambutterworth.com/posts/tau-bench-successors/</link><guid isPermaLink="true">https://adambutterworth.com/posts/tau-bench-successors/</guid><description>Part 3 of 3. How τ²-bench introduced dual control by giving the user its own tools, what τ³-bench added with sprawling document retrieval and full-duplex voice, and what production agent eval still does not measure.</description><pubDate>Sun, 10 May 2026 17:00:00 GMT</pubDate></item><item><title>LLMs playing Just One: Why Same-Model LLM Ensembles Mode-Collapse</title><link>https://adambutterworth.com/posts/just-one-llm-bench/</link><guid isPermaLink="true">https://adambutterworth.com/posts/just-one-llm-bench/</guid><description>Four Claude Haiku instances asked independently for a clue for &apos;toast&apos; all reply &apos;bread&apos;. Four Sonnets do it more often. Four Opuses do it even more often. I built a tiny benchmark using the board game Just One to measure when LLM ensembles collapse and what makes them stop. The mixed-family ensemble + anti-correlation prompt hits 3.25× the single-model baseline.</description><pubDate>Wed, 22 Apr 2026 17:00:00 GMT</pubDate></item><item><title>Breaking Down Agent Evals (Part 2): τ-bench Deep Dive</title><link>https://adambutterworth.com/posts/tau-bench/</link><guid isPermaLink="true">https://adambutterworth.com/posts/tau-bench/</guid><description>Part 2 of 3. How τ-bench unified a simulated user, domain policies, and a real-world consequence model into one benchmark, why pass^k changed how the field talks about agent quality, and how its design principles transfer to your own eval suite.</description><pubDate>Sun, 15 Mar 2026 17:00:00 GMT</pubDate></item><item><title>Breaking Down Agent Evals (Part 1): A Practitioner&apos;s Guide</title><link>https://adambutterworth.com/posts/breaking-down-agent-evals/</link><guid isPermaLink="true">https://adambutterworth.com/posts/breaking-down-agent-evals/</guid><description>Part 1 of a 3-part series. Why traces (not code) are the source of truth in agents, the three observability primitives, run types, the metrics that matter at each level, the pass^k reliability metric, a five-step methodology for building an eval suite, and a filter funnel approach to why no single eval method is enough.</description><pubDate>Tue, 10 Feb 2026 17:00:00 GMT</pubDate></item><item><title>Why Streaming LLMs Need Attention Sinks</title><link>https://adambutterworth.com/posts/attention-sinks/</link><guid isPermaLink="true">https://adambutterworth.com/posts/attention-sinks/</guid><description>A walkthrough of attention sinks: what they are, why softmax produces them by accident, why naive sliding-window inference collapses without them, and how a four-token reservation lets streaming inference run to four million tokens with no quality loss.</description><pubDate>Wed, 12 Nov 2025 17:00:00 GMT</pubDate></item><item><title>How PPO Actually Works</title><link>https://adambutterworth.com/posts/from-trpo-to-ppo/</link><guid isPermaLink="true">https://adambutterworth.com/posts/from-trpo-to-ppo/</guid><description>PPO walked through from vanilla policy gradients, through the trust region story that motivates it, to the clipped objective you actually run. Intuition first, math when it pays off. Written for ML people who have not done much RL.</description><pubDate>Mon, 15 Sep 2025 17:00:00 GMT</pubDate></item><item><title>How to Mitigate the Lost-in-the-Middle Effect in LLMs</title><link>https://adambutterworth.com/posts/why-recitation-works/</link><guid isPermaLink="true">https://adambutterworth.com/posts/why-recitation-works/</guid><description>A look at why long contexts quietly break LLMs, why important information is easier to use at the boundaries than in the middle, and why agents that periodically restate their goals at the end of the context often work better.</description><pubDate>Fri, 15 Aug 2025 17:00:00 GMT</pubDate></item></channel></rss>