Part 1 of 3. Part 2 walks through τ-bench step by step. Part 3 covers the successors τ²-bench (dual-control coordination) and τ³-bench (knowledge retrieval and voice). Concepts adapted in part from Anthropic’s Demystifying Evals for AI Agents.
Table of contents
Open Table of contents
Why agent evals aren’t software tests
In traditional software, the code is the source of truth. Read the function, know what happens. Inputs are constrained (forms, buttons, typed parameters), outputs are deterministic, and behaviour is fully specified before runtime.
Agents break every one of those assumptions:
- Non-deterministic outputs. Same input, different trajectories.
- Unconstrained inputs. Natural language is unbounded.
- Emergent behaviour. The agent decides actions, calls tools, and mutates state autonomously.
So in agents, the traces are the source of truth. The code just defines a prompt and a set of tools. You don’t know what the agent does until you run it. That single shift is why observability and evaluation in agents are tightly coupled in a way they never are in conventional software: you cannot test what you cannot observe, and you cannot reason about what you have not traced.
This reframes debugging too. Software debugging is finding the failed function in a stack trace. Agent debugging is debugging reasoning: what went into the LLM, what came out, what context was available, which tools were called and in what order, what the model decided to do with the tool’s response. The bug is rarely a bad line of code. It is usually a bad decision in the middle of a trajectory you didn’t know existed until you saw the trace.
The three observability primitives
Almost every agent observability platform builds on the same three nested concepts. The names vary slightly between vendors but the shapes are the same.
Run (Single step). One atomic operation: one LLM call, one tool invocation, or one retrieval step. Has inputs, outputs, latency, cost, metadata. The smallest unit you can observe and evaluate.
Trace (Full Turn). A full agent execution from one user message to the agent’s final response, with no human intervention in between. The agent loops through multiple runs (LLM calls, tool calls, retrievals) until it decides it is done.
Thread (Multiple Turns). A full conversation: multiple traces linked by human turns. Each new user message kicks off a new trace; the thread groups them.
Read top to bottom: the message stream of an agent conversation, with three brackets on the right marking three scopes.
No level alone tells the full story. A run can be perfect (the LLM made a sensible call) inside a trace that fails (the agent picked the wrong tool earlier and never recovered) inside a thread that succeeds (the user re-prompted and the agent fixed itself in the next trace). You evaluate at all three.
Types of runs
Runs come in a few standard shapes. Most agent stacks emit these four:
- LLM-call run. Model name, prompt, completion, token counts, cost, latency, finish reason. The atom of “what the model said.”
- Tool-call run. Tool name, arguments (schema-validated), return value, error if any, latency.
- Retrieval run. Query, retrieved documents, similarity scores, store identifier. Logged separately because retrieval failures and tool failures look different.
- Generation / output run. The final user-visible response for the trace.
Custom runs are common too: input pre-processing, guardrail checks, validators, post-hoc summarisation steps. Anything you might want to evaluate independently should be its own run.
A minimal schema looks roughly like this:
from dataclasses import dataclass, field
from typing import Literal, Any
@dataclass
class Run:
id: str
parent_id: str | None # None for the root run of a trace
trace_id: str
type: Literal["llm", "tool", "retrieval", "generation", "custom"]
name: str # e.g. "gpt-4o", "search_docs", "rerank"
inputs: dict[str, Any]
outputs: dict[str, Any]
started_at: float
ended_at: float
cost_usd: float | None = None
metadata: dict[str, Any] = field(default_factory=dict)
Two things matter about this shape. First, the parent_id link makes the run tree reconstructible, so you can render a waterfall view of any trace. Second, inputs and outputs are stored verbatim. You will want them later. Re-running an LLM call against a captured input is how you bisect failures; replaying a tool call against captured args is how you reproduce a flaky integration.
What to evaluate at each level
Different levels admit different metrics. Picking the wrong level is the single most common eval-design mistake.
Run level. Did this individual operation work? Latency, cost, schema validity (did the tool call have the right arguments?), token usage. Programmatic checks dominate here.
Trace level. Did this full agent attempt achieve the goal?
- Outcome: end-state match against an annotated goal. Cleanest signal you can get.
- Path quality: number of tool calls, number of retries, whether the policy was followed, whether the agent looped.
- Rule compliance: did the trace violate domain constraints (e.g. refunding outside the allowed window)? Often best graded with a programmatic policy checker, sometimes an LLM judge.
Thread level. Across the whole conversation, did the user end up where they wanted? Often the thing that actually matters in production. Hardest to grade automatically because it depends on user intent that may not be in any single message. LLM-as-judge with a careful rubric, or human review.
A useful sanity check: if your only metric is at the run level, you are evaluating the LLM, not the agent. If your only metric is at the thread level, you cannot tell why anything failed. You want all three.
Reliability is its own axis: pass^k
A 70%-pass agent that flips outcomes on identical inputs is not safe to deploy. The eval problem is no longer “did it pass on average” but “does it pass every time on the same input.” That is a separate metric.
τ-bench formalises this with pass^k: run each task k independent times and record the fraction of tasks that succeed on all k runs. A perfect pass^1 (one-shot) score of 70% can collapse to a pass^8 below 25% if the agent is genuinely inconsistent. The decay between pass^1 and pass^k is the reliability signal. We will go deeper on this in Part 2; for now, it’s enough to have it as a metric in your toolbox.
Drag the sliders to see how the two metrics diverge as grows:
pass@k and pass^k diverge as trials increase. At k=1 they are identical (both equal the per-trial success rate). By k=10 they tell opposite stories: pass@k approaches 100% while pass^k falls toward 0%. Framing adapted from Anthropic’s Demystifying Evals for AI Agents.
The practical implication is that you cannot ship reliability with a single-run eval. Every task in your suite needs to be runnable k times, with isolated state, and your harness needs to compute pass^k as a first-class metric.
Four steps to build an eval suite that doesn’t lie to you
Most eval suites fail in the same predictable ways. The loop you actually want is: ship, observe, mine failures, fix, validate, repeat.
- Mine production for focused test sets. Don’t guess failure modes up-front. Ship a thin agent early, instrument everything, and let real users surface the edge cases. Curate these into small, focused test sets: twenty well-curated tasks isolating specific concepts beat two hundred sloppy ones. In my experience, the hardest scenarios aren’t the happy paths, but bizarre edge cases—like a user asking to change a shipping address after a refund was already initiated. You don’t think of these up-front; you find them in the logs.
- Isolate trials and divide the labor. Run each task against a completely fresh state. Reusing state across trials causes correlated failures and hides real reliability problems. For instance, if Trial 1 leaves a dummy user in a mock CRM, Trial 2 might fail simply because the email is no longer unique. In τ-bench, every single trial spins up a completely fresh mock database to guarantee isolation. To maintain this rigor without burning out, split the work: a dedicated evals team owns the infrastructure, while domain experts contribute the tasks.
- Grade outcomes, not paths. Define success at the highest-value granularity. Compare final states against goal states programmatically whenever possible. If you mandate that an agent must call
search_kbbeforereply, you’ll fail an agent that correctly answers from its context window. In τ-bench, success is purely defined by whether the database state matches the expected end state, regardless of how many tool calls it took. When you must use LLM-as-judge graders, score one rubric dimension at a time, build in partial credit, and keep the prompts boring. - Read the transcripts. Sample failed traces and read them manually. You won’t know if your agent is actually failing or if your graders are just wrong until you read fifty failures in one sitting. I once spent two days debugging why an agent’s pass rate tanked, only to read the transcripts and realize the LLM-as-judge was penalizing the agent for being “too polite.” The agent was fine; the grader was broken. Failures should always seem fair.
No single method is enough: the filter funnel
No single method catches every failure. Think of evaluation as a series of increasingly fine sieves in a filter funnel. Each layer catches a different type of error, and tying your methods directly to the observability primitives (Run, Trace, Thread) ensures you don’t leave gaps.
| Primitive | Eval Methods | What it Catches | Limitations |
|---|---|---|---|
| Run (Single step) | Automated schema validation, latency monitors, token counting, static analysis. | Malformed tool calls, API timeouts, context window overflows. | Tells you nothing about whether the agent achieved the user’s goal. |
| Trace (Full turn) | LLM-as-judge rubrics, programmatic state verification, policy compliance checkers. | Hallucinations, logic loops, failure to complete the task, domain rule violations. | Can diverge from real-world usage if the simulated task is unrealistic. |
| Thread (Multiple turns) | User feedback (thumbs up/down), A/B testing on completion rates, manual transcript review. | User frustration, multi-turn drift, UX issues, unanticipated edge cases. | Slow, sparse, and often lacks ground truth for why it failed. |
The takeaway: I’d advise you to use a framework, e.g. LangSmith or Arize. They all give you the same primitives, and the differences are mostly dashboards and integrations. Invest your energy in high-quality test cases and graders. The frameworks are only as good as the tasks you run through them.
What’s next
Part 2 walks through τ-bench (Yao et al., 2024), the benchmark that crystallised most of these ideas: a database-grounded conversational eval with policy documents, a simulated user, and the pass^k metric we sketched here. Part 3 covers the successors τ²-bench (dual-control coordination, where the agent and user can both act on the world) and τ³-bench (retrieval and voice).
If you only had time to study one agent benchmark, τ-bench would be it. If you are shipping agents in 2026, τ²-bench and τ³-bench are closer to your reality.
References
- Yao et al. τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. 2024. arXiv:2406.12045