Tag: agents
All the articles with the tag "agents".
-
A Mental Model for Ambient Agents
A mental model for ambient agents: a four-part loop wrapped in a three-part human interface, why coding agents have product-market fit and other domains don't, and the two real blockers I keep landing on.
-
Anatomy of an Agent Harness
In March 2026, LangChain moved their coding agent from 30th to 5th on a benchmark by changing only the scaffolding around the model. The model weights didn't change; what changed was the harness. A worked-example tour of what an agent harness actually is, built around an inbox triage agent.
-
What an eval suite is, and how to build one
An eval suite is not one thing. It is a layered set of checks with different costs, latencies, and confidence levels. This post walks through what the layers are, how to build the dataset (the part most teams under-do), how grading actually works in practice, and how the whole thing wires into your CI.
-
Breaking Down Agent Evals (Part 3): τ²-bench and τ³-bench
Part 3 of 3. How τ²-bench introduced dual control by giving the user its own tools, what τ³-bench added with sprawling document retrieval and full-duplex voice, and what production agent eval still does not measure.
-
Breaking Down Agent Evals (Part 2): τ-bench Deep Dive
Part 2 of 3. How τ-bench unified a simulated user, domain policies, and a real-world consequence model into one benchmark, why pass^k changed how the field talks about agent quality, and how its design principles transfer to your own eval suite.
-
Breaking Down Agent Evals (Part 1B): Eval Calibration
A primer on eval calibration: what it means for your scoring pipeline to be trustworthy, the four levels (rubric, human-to-human, LLM-to-human, LLM-to-LLM), the common biases that turn a good-looking dashboard into a fiction, and how to read Cohen's kappa without the textbook. Built around small interactive applets.
-
Breaking Down Agent Evals (Part 1A): Building the Eval Suite, Hands-On
The code companion to Part 1. The same five-step methodology, walked file by file: the toy agent, the eval-case schema, the JSONL dataset, an exact-match grader, an LLM judge, and the runner that ties it together and exits non-zero on regression.
-
Breaking Down Agent Evals (Part 1): A Practitioner's Guide
Part 1 of a 3-part series. Why traces (not code) are the source of truth in agents, the three observability primitives, run types, the metrics that matter at each level, the pass^k reliability metric, a five-step methodology for building an eval suite, and a filter funnel approach to why no single eval method is enough.
-
Context Engineering for Long Agent Loops: The Case for Recitation
A look at why long contexts quietly break LLMs, why important information is easier to use at the boundaries than in the middle, and why agents that periodically restate their goals at the end of the context often work better.