Tag: llm
All the articles with the tag "llm".
-
Anatomy of an Agent Harness
In March 2026, LangChain moved their coding agent from 30th to 5th on a benchmark by changing only the scaffolding around the model. The model weights didn't change; what changed was the harness. A worked-example tour of what an agent harness actually is, built around an inbox triage agent.
-
GEPA: How an LLM Can Write a Better Prompt Than RL Can Train One
A walkthrough of GEPA (Agrawal et al., ICLR 2026), the reflective prompt optimiser that beats GRPO with up to 35× fewer rollouts by reading its own trace logs in plain English. The four-step loop, a worked iteration on a multi-hop QA system, the Pareto trick that keeps the candidate pool diverse, and where 98% of the rollout budget actually goes.
-
Inside MIPROv2: Bootstrap, Propose, Search
A walkthrough of MIPROv2 (Opsahl-Ong et al., 2024), DSPy's flagship prompt optimiser. The three-phase pipeline (bootstrap, propose, search), how Bayesian Optimisation makes the discrete combinatorial space tractable, what changes between the baseline and the compiled prompt, and a decision rule for when to run it.
-
Setting Logits to Negative Infinity: How LLMs Actually Output JSON
Structured outputs aren't a validation layer; they're a decoding-time intervention. How logit masking actually works, why token boundaries make it hard, and why reordering one field in your Pydantic schema can move accuracy by 90 points.
-
Prompts are Hyperparameters
A practitioner's tour of DSPy, MIPROv2 and GEPA. The reframe (prompts are parameters of an LLM program, not the artefact you ship), the five axes any optimiser can tune, how MIPROv2 and GEPA actually work, where this set of methods quietly disappoints, and a decision tree for picking one.
-
LLMs playing Just One: Why Same-Model LLM Ensembles Mode-Collapse
Four Claude Haiku instances asked independently for a clue for 'toast' all reply 'bread'. Four Sonnets do it more often. Four Opuses do it even more often. I built a tiny benchmark using the board game Just One to measure when LLM ensembles collapse and what makes them stop. The mixed-family ensemble + anti-correlation prompt hits 3.25× the single-model baseline.
-
What an eval suite is, and how to build one
An eval suite is not one thing. It is a layered set of checks with different costs, latencies, and confidence levels. This post walks through what the layers are, how to build the dataset (the part most teams under-do), how grading actually works in practice, and how the whole thing wires into your CI.
-
Why Streaming LLMs Need Attention Sinks
A walkthrough of attention sinks: what they are, why softmax produces them by accident, why naive sliding-window inference collapses without them, and how a four-token reservation lets streaming inference run to four million tokens with no quality loss.
-
TextGrad: Automatic Differentiation Through LLM Critiques
A walkthrough of TextGrad (Yuksekgonul et al., Nature 2025), an autograd engine where the gradients are natural-language critiques. The PyTorch-shaped API, the four-step optimisation loop, why one framework optimises prompts, code and molecules with the same machinery, and where DSPy is the better bet.
-
Context Engineering for Long Agent Loops: The Case for Recitation
A look at why long contexts quietly break LLMs, why important information is easier to use at the boundaries than in the middle, and why agents that periodically restate their goals at the end of the context often work better.