Breaking Down Agent Evals: A Practitioner's Guide

Part 1 of 3. Part 2 walks through τ-bench step by step. Part 3 covers the successors τ²-bench (dual-control coordination) and τ³-bench (knowledge retrieval and voice). Concepts adapted in part from Anthropic’s Demystifying Evals for AI Agents.

Open Table of contents

Why agent evals aren’t software tests
The three observability primitives
Types of runs
What to evaluate at each level
Reliability is its own axis: pass^k
Four steps to build an eval suite that doesn’t lie to you
No single method is enough: the filter funnel
What’s next
References

Why agent evals aren’t software tests

In traditional software, the code is the source of truth. Read the function, know what happens. Inputs are constrained (forms, buttons, typed parameters), outputs are deterministic, and behaviour is fully specified before runtime.

Agents break every one of those assumptions:

Non-deterministic outputs. Same input, different trajectories.
Unconstrained inputs. Natural language is unbounded.
Emergent behaviour. The agent decides actions, calls tools, and mutates state autonomously.

So in agents, the traces are the source of truth. The code just defines a prompt and a set of tools. You don’t know what the agent does until you run it. That single shift is why observability and evaluation in agents are tightly coupled in a way they never are in conventional software: you cannot test what you cannot observe, and you cannot reason about what you have not traced.

This reframes debugging too. Software debugging is finding the failed function in a stack trace. Agent debugging is debugging reasoning: what went into the LLM, what came out, what context was available, which tools were called and in what order, what the model decided to do with the tool’s response. The bug is rarely a bad line of code. It is usually a bad decision in the middle of a trajectory you didn’t know existed until you saw the trace.

The three observability primitives

Almost every agent observability platform builds on the same three nested concepts. The names vary slightly between vendors but the shapes are the same.

Run (Single step). One atomic operation: one LLM call, one tool invocation, or one retrieval step. Has inputs, outputs, latency, cost, metadata. The smallest unit you can observe and evaluate.

Trace (Full Turn). A full agent execution from one user message to the agent’s final response, with no human intervention in between. The agent loops through multiple runs (LLM calls, tool calls, retrievals) until it decides it is done.

Thread (Multiple Turns). A full conversation: multiple traces linked by human turns. Each new user message kicks off a new trace; the thread groups them.

Read top to bottom: the message stream of an agent conversation, with three brackets on the right marking three scopes.

No level alone tells the full story. A run can be perfect (the LLM made a sensible call) inside a trace that fails (the agent picked the wrong tool earlier and never recovered) inside a thread that succeeds (the user re-prompted and the agent fixed itself in the next trace). You evaluate at all three.

Types of runs

Runs come in a few standard shapes. Most agent stacks emit these four:

LLM-call run. Model name, prompt, completion, token counts, cost, latency, finish reason. The atom of “what the model said.”
Tool-call run. Tool name, arguments (schema-validated), return value, error if any, latency.
Retrieval run. Query, retrieved documents, similarity scores, store identifier. Logged separately because retrieval failures and tool failures look different.
Generation / output run. The final user-visible response for the trace.

Custom runs are common too: input pre-processing, guardrail checks, validators, post-hoc summarisation steps. Anything you might want to evaluate independently should be its own run.

A minimal schema looks roughly like this:

from dataclasses import dataclass, field
from typing import Literal, Any

@dataclass
class Run:
    id: str
    parent_id: str | None              # None for the root run of a trace
    trace_id: str
    type: Literal["llm", "tool", "retrieval", "generation", "custom"]
    name: str                          # e.g. "gpt-4o", "search_docs", "rerank"
    inputs: dict[str, Any]
    outputs: dict[str, Any]
    started_at: float
    ended_at: float
    cost_usd: float | None = None
    metadata: dict[str, Any] = field(default_factory=dict)

Two things matter about this shape. First, the parent_id link makes the run tree reconstructible, so you can render a waterfall view of any trace. Second, inputs and outputs are stored verbatim. You will want them later. Re-running an LLM call against a captured input is how you bisect failures; replaying a tool call against captured args is how you reproduce a flaky integration.

What to evaluate at each level

Different levels admit different metrics. Picking the wrong level is the single most common eval-design mistake.

Run level. Did this individual operation work? Latency, cost, schema validity (did the tool call have the right arguments?), token usage. Programmatic checks dominate here.

Trace level. Did this full agent attempt achieve the goal?

Outcome: end-state match against an annotated goal. Cleanest signal you can get.
Path quality: number of tool calls, number of retries, whether the policy was followed, whether the agent looped.
Rule compliance: did the trace violate domain constraints (e.g. refunding outside the allowed window)? Often best graded with a programmatic policy checker, sometimes an LLM judge.

Thread level. Across the whole conversation, did the user end up where they wanted? Often the thing that actually matters in production. Hardest to grade automatically because it depends on user intent that may not be in any single message. LLM-as-judge with a careful rubric, or human review.

A useful sanity check: if your only metric is at the run level, you are evaluating the LLM, not the agent. If your only metric is at the thread level, you cannot tell why anything failed. You want all three.

Reliability is its own axis: pass^k

A 70%-pass agent that flips outcomes on identical inputs is not safe to deploy. The eval problem is no longer “did it pass on average” but “does it pass every time on the same input.” That is a separate metric.

τ-bench formalises this with pass^k: run each task k independent times and record the fraction of tasks that succeed on all k runs. A perfect pass^1 (one-shot) score of 70% can collapse to a pass^8 below 25% if the agent is genuinely inconsistent. The decay between pass^1 and pass^k is the reliability signal. We will go deeper on this in Part 2; for now, it’s enough to have it as a metric in your toolbox.

Drag the sliders to see how the two metrics diverge as $k$ grows:

pass@k and pass^k diverge as trials increase. At k=1 they are identical (both equal the per-trial success rate). By k=10 they tell opposite stories: pass@k approaches 100% while pass^k falls toward 0%. Framing adapted from Anthropic’s Demystifying Evals for AI Agents.

The math, and how to estimate from a finite sample

Given a per-trial success probability $p$ , the two metrics are:

\text{pass@k} = 1 - (1 - p)^k \qquad \text{pass}^{k} = p^k

The first is the probability that at least one of $k$ independent trials succeeds. The second is the probability that all $k$ succeed. At $k=1$ both equal $p$ ; the curves only diverge after that.

In practice you do not know $p$ exactly. You ran each task $n$ times and observed $c$ successes. The unbiased estimators (given $n \geq k$ ) come from straight combinatorics over which $k$ -subsets of the $n$ runs are successful:

\widehat{\text{pass@k}} = 1 - \binom{n - c}{k} \Big/ \binom{n}{k} \qquad \widehat{\text{pass}^{k}} = \binom{c}{k} \Big/ \binom{n}{k}

from math import comb

def pass_caret_k(num_correct: int, num_trials: int, k: int) -> float:
    """Unbiased pass^k estimator: P(all k of k succeed | c of n succeeded)."""
    if num_trials < k:
        raise ValueError("need at least k trials")
    return comb(num_correct, k) / comb(num_trials, k)

def pass_at_k(num_correct: int, num_trials: int, k: int) -> float:
    """Unbiased pass@k estimator: P(at least one of k succeeds | c of n succeeded)."""
    if num_trials < k:
        raise ValueError("need at least k trials")
    if num_trials - num_correct < k:
        return 1.0
    return 1.0 - comb(num_trials - num_correct, k) / comb(num_trials, k)

The practical implication is that you cannot ship reliability with a single-run eval. Every task in your suite needs to be runnable k times, with isolated state, and your harness needs to compute pass^k as a first-class metric.

Four steps to build an eval suite that doesn’t lie to you

Most eval suites fail in the same predictable ways. The loop you actually want is: ship, observe, mine failures, fix, validate, repeat.

Mine production for focused test sets. Don’t guess failure modes up-front. Ship a thin agent early, instrument everything, and let real users surface the edge cases. Curate these into small, focused test sets: twenty well-curated tasks isolating specific concepts beat two hundred sloppy ones. In my experience, the hardest scenarios aren’t the happy paths, but bizarre edge cases—like a user asking to change a shipping address after a refund was already initiated. You don’t think of these up-front; you find them in the logs.
Isolate trials and divide the labor. Run each task against a completely fresh state. Reusing state across trials causes correlated failures and hides real reliability problems. For instance, if Trial 1 leaves a dummy user in a mock CRM, Trial 2 might fail simply because the email is no longer unique. In τ-bench, every single trial spins up a completely fresh mock database to guarantee isolation. To maintain this rigor without burning out, split the work: a dedicated evals team owns the infrastructure, while domain experts contribute the tasks.
Grade outcomes, not paths. Define success at the highest-value granularity. Compare final states against goal states programmatically whenever possible. If you mandate that an agent must call search_kb before reply, you’ll fail an agent that correctly answers from its context window. In τ-bench, success is purely defined by whether the database state matches the expected end state, regardless of how many tool calls it took. When you must use LLM-as-judge graders, score one rubric dimension at a time, build in partial credit, and keep the prompts boring.
Read the transcripts. Sample failed traces and read them manually. You won’t know if your agent is actually failing or if your graders are just wrong until you read fifty failures in one sitting. I once spent two days debugging why an agent’s pass rate tanked, only to read the transcripts and realize the LLM-as-judge was penalizing the agent for being “too polite.” The agent was fine; the grader was broken. Failures should always seem fair.

No single method is enough: the filter funnel

No single method catches every failure. Think of evaluation as a series of increasingly fine sieves in a filter funnel. Each layer catches a different type of error, and tying your methods directly to the observability primitives (Run, Trace, Thread) ensures you don’t leave gaps.

Primitive	Eval Methods	What it Catches	Limitations
Run (Single step)	Automated schema validation, latency monitors, token counting, static analysis.	Malformed tool calls, API timeouts, context window overflows.	Tells you nothing about whether the agent achieved the user’s goal.
Trace (Full turn)	LLM-as-judge rubrics, programmatic state verification, policy compliance checkers.	Hallucinations, logic loops, failure to complete the task, domain rule violations.	Can diverge from real-world usage if the simulated task is unrealistic.
Thread (Multiple turns)	User feedback (thumbs up/down), A/B testing on completion rates, manual transcript review.	User frustration, multi-turn drift, UX issues, unanticipated edge cases.	Slow, sparse, and often lacks ground truth for why it failed.

The takeaway: I’d advise you to use a framework, e.g. LangSmith or Arize. They all give you the same primitives, and the differences are mostly dashboards and integrations. Invest your energy in high-quality test cases and graders. The frameworks are only as good as the tasks you run through them.

What’s next

Part 2 walks through τ-bench (Yao et al., 2024), the benchmark that crystallised most of these ideas: a database-grounded conversational eval with policy documents, a simulated user, and the pass^k metric we sketched here. Part 3 covers the successors τ²-bench (dual-control coordination, where the agent and user can both act on the world) and τ³-bench (retrieval and voice).

If you only had time to study one agent benchmark, τ-bench would be it. If you are shipping agents in 2026, τ²-bench and τ³-bench are closer to your reality.

References

Yao et al. τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. 2024. arXiv:2406.12045