Skip to content
Go back

Breaking Down Agent Evals (Part 2): τ-bench Deep Dive

Published:

Part 2 of 3. Part 1 covered the fundamentals of agent evaluation. Part 3 covers the successors τ²-bench and τ³-bench.

Table of contents

Open Table of contents

Why τ-bench matters

Most agent benchmarks before τ-bench (SWE-bench, HumanEval, BFCL) hand the agent everything it needs up front and grade the final answer. There is no human in the loop, no policies to follow, no consequences for a wrong tool call beyond a failed test.

τ-bench broke that mould by combining three things no prior benchmark unified:

  1. A simulated user. A second LLM with a persona and a goal talks to your agent in natural language across multiple turns.
  2. Domain-specific policy documents. Real rules like basic economy flights cannot be modified or exchange-or-modify can only be called once per order that the agent must follow.
  3. A real-world consequence model. APIs that successfully execute even when policy says they should not. The agent is responsible for refusing.

The third point is the one that matters most. The API will happily let your agent modify a basic-economy reservation. The reward function will fail you because the database now reflects the change, and the goal state said it should not.

This is exactly how production agents work, and it is exactly why pre-τ-bench benchmarks did not predict production behaviour.

The POMDP formulation

Each τ-bench task is formulated as a partially observable Markov decision process with state, action, observation, transition, reward, and user-utterance components.

The state space is split into a database side and a user side. The action space is similarly split: the agent issues database actions through API tools, and user-facing actions as natural-language messages. The agent is given a policy document that partially describes the world model.

Crucially, the agent cannot see the user’s instruction, and the user cannot see the agent’s tool interactions. Each side has private context, and they must converge on a shared goal through dialogue alone.

τ-bench setup and an example airline trajectory: Tools on the left, Agent LLM in the middle, User LLM on the right. The Agent receives the domain policy; the user receives a persona and goal. Tools read and write a database.
Source: Sierra's τ-bench blog post (Yao et al., 2024).

The three components per environment

Realistic databases and APIs. Mutable state (reservations, orders, user accounts) modified through tool calls. The transition function on the database side is deterministic Python code.

Domain-specific policy documents. Rules the agent must follow, mostly not enforced by the API. Some restrictions are checked (a non-existent payment ID returns an error), but most domain rules require the agent to internalise and apply them.

User simulator. An LLM with a persona and goal generates open-ended utterances. The transition is stochastic: the agent’s message is appended to chat history, then the user LM samples a new response. When the user issues ###STOP###, the episode ends and the agent is evaluated.

The two original domains

τ-retailτ-airline
Databases500 users, 50 products, 1,000 orders500 users, 300 flights, 2,000 reservations
API tools7 write, 8 read6 write, 7 read
Tasks11550
Policy complexityCloser to commonsenseAd-hoc, multi-hop
τ-retail vs τ-airline summary stats: number of users, products/flights, orders/reservations, write APIs, read APIs, and tasks.
Source: Sierra's τ-bench blog post (Yao et al., 2024).

τ-airline is harder by design. Baggage allowance varies by membership tier × cabin class. Flight changes preserve origin, destination, and trip type. Basic economy can not be modified but can be cancelled within 24 hours. The rules require multi-hop reasoning the agent has to perform from a long policy document, not from training data.

State comparison, not trajectory comparison

At the end of each conversation, τ-bench compares the actual database state to the annotated goal state. It does not compare conversation trajectories or tool-call sequences.

This matters for two reasons.

First, the agent can take any conversational path as long as the end state is correct. Creative solutions are not (mostly) penalised. The agent can ask follow-up questions in any order, call read APIs in any sequence, phrase responses however it likes.

Second, evaluation is fast and objective. No human grading required. No subjective rubric. The DB either matches the goal or it does not.

The reward function is binary: r = r_action · r_output, with each factor in {0, 1}.

This is the same principle as the broader guidance from Part 1: grade outcomes, not trajectories. τ-bench made it operational.

pass^k: the consistency metric τ-bench introduced

For tasks like code generation, the community defines pass@k (“pass at k”) as the chance that at least one out of k i.i.d. trials succeeds. This captures possibility, the value of inference-time compute scaling.

For real-world agents requiring reliability, τ-bench proposed pass^k (“pass hat k”): the chance that all k i.i.d. trials succeed. This captures consistency.

Given n total trials with c successes, the unbiased estimators are:

pass^k  = E_task[ C(c, k) / C(n, k) ]
pass@k  = 1 - E_task[ C(n - c, k) / C(n, k) ]

At k=1 they are identical. After that, they diverge sharply. An agent succeeding on 6 of 8 trials (75% raw) gives:

kpass@kpass^k
10.750.75
20.930.54
40.990.21
81.000.03
pass^k vs pass@k across frontier models on τ-retail and τ-airline. pass^k decays sharply with k while pass@k rises, and even the strongest model drops below 25 percent pass^8 on retail.
Source: Sierra's τ-bench blog post (Yao et al., 2024, Figure 4).

This is the headline finding. Even with greater than 60% pass^1, the consistency wall hits hard. Production reliability is the unsolved problem, not capability. That insight changed how the field talks about agent quality.

A concrete task walkthrough

Here is how a task actually plays out.

System prompt to agent (excerpt):

You are an airline customer service agent. Current time: 2024-05-15 15:00 EST.

Tools:
- get_reservation(reservation_id) → reservation details
- search_flights(origin, dest, date) → available flights
- update_reservation(reservation_id, ...) → modify booking
- cancel_reservation(reservation_id) → cancel booking

POLICIES:
- Basic economy flights CANNOT be modified.
- Each reservation can have at most 5 passengers.
- Agent must collect first name, last name, and DOB for each passenger.
- Users can add but NOT remove checked bags.
- Flight changes cannot change origin, destination, or trip type.

User simulator prompt (hidden from agent):

You are Sarah Chen, user_id=USR-4821. You booked reservation RES-7734,
a basic economy round-trip from SFO to JFK. You want to change your
outbound flight from May 20 to May 22. You don't know your fare class.
If the agent asks for your user ID, provide it. If the agent correctly
tells you the flight can't be changed, accept it and ask about alternatives.

Annotated goal state:

{
  "reservation_RES-7734": {
    "status": "unchanged",
    "flights": ["original_outbound", "original_return"],
    "notes": "Agent should have denied modification due to basic economy policy"
  }
}

The failure path is the interesting one. The agent is not refused by the API; it is refused by the reward function, after the fact, because the policy says no. Nothing stops the agent from doing the wrong thing except its own reasoning. That is exactly how production looks.

The fault taxonomy

The τ-bench paper analysed 36 failed gpt-4o trajectories on τ-retail and broke them into four buckets. These map almost 1:1 to the failure modes you will see in production.

Failure breakdown of 36 failed gpt-4o τ-retail trajectories: wrong argument 33.3 percent, wrong decision 25.0 percent, wrong info 22.2 percent, partial resolution 19.4 percent.
Source: Sierra's τ-bench blog post (Yao et al., 2024).

The benchmark codebase also automates fault classification along two axes:

Fault assignment (who caused it): agent / user / environment.

Fault type (what went wrong):

If you are building an agent eval suite, these four labels are a strong starting taxonomy. They cover the failure space well in practice.

One caveat the paper soft-pedals: the buckets are not orthogonal. Wrong argument and wrong decision in particular bleed into each other. An agent that passes the wrong product ID to exchange_order is making a wrong-argument error on the surface, but the root cause is usually a wrong decision one step earlier about which item the user meant or which constraint applied. Counting it as one or the other is a labelling choice, not a clean partition. In your own eval suite, expect to either accept the fuzziness or push the categorisation one layer deeper into causes (retrieval failure, constraint omission, premature commitment, and so on).

Three failure deep dives

Failure 1: Wrong argument (33%), the complex database reasoning problem

The largest failure bucket is not tool selection. It is numerical and constraint reasoning over complex inventory. From the paper:

The user wants to exchange a lamp for a less bright one and prefers an AC adapter over battery or USB power source. The agent fails to reason over the complex inventory of lamps and find the unique option given such a preference.

Multi-attribute preference reasoning over a structured space is genuinely hard. The agent has to filter by product type (lamp), match “less bright” to the brightness attribute, prefer one power source over others, and find the unique item. Current models leak heavily on this.

Weaker models leak even more basically: gpt-4o function-calling makes 0.46 tool calls per task with non-existent IDs, but gpt-3.5-turbo Act makes 6.34. ID hallucination at scale.

Failure 2: Wrong decision (25%), the domain rule following problem

These failures occur when the agent does not internalise a domain-specific rule. From the paper:

The user wants to exchange a couple of items, and according to the policy: “Exchange or modify order tools can only be called once. Be sure that all items to be exchanged are collected into a list before making the tool call.” However, the agent omits this rule and exchanges one item first, locking out the second.

The paper’s policy ablation experiment is striking:

Modelτ-retail (no policy)τ-airline (no policy)
gpt-4o61.2 → 56.8 (-4.4)35.2 → 10.8 (-22.4)
gpt-3.520.0 → 14.5 (-5.5)10.8 → 9.6 (-1.2)

τ-retail rules are close to commonsense, so removing them barely hurts. τ-airline rules are ad-hoc, so removing them devastates strong models. The implication: gpt-4o is genuinely reading and applying the airline policy. gpt-3.5 mostly is not, which is why removing the policy does not hurt it (it never used it).

Failure 3: Partial resolution (19%), compound requests

When a task involves multiple user requests, agents drop one. Sometimes explicit (user mentions two things, agent handles one and asks “anything else?”). Sometimes implicit (user asks to “fix all my orders” but the agent stops after the first one).

Tasks with more required write actions are sharply harder 0 25 50 75 100 pass^1 (%) number of required write actions in ground truth 0 1 2 3 ≥4 gpt-4-turbo gpt-3.5-turbo

The drop-off is steep. On τ-retail, gpt-4-turbo goes from roughly 70% pass^1 on tasks that need zero write actions, to about 35% on tasks that need two writes, to under 20% on tasks needing four or more. gpt-3.5-turbo follows the same curve a band lower: roughly 35% at zero writes, near 10% at two, almost nothing at four. The picture implicates long-context memory, attention decay (the user’s first request is by now buried under tool outputs), and weak request decomposition. It is also the most fixable failure mode of the four: explicit request enumeration plus a checklist verification pass typically claws back most of the loss.

Key results

Modelτ-retail pass^1τ-airline pass^1Average
gpt-4o61.2%35.2%48.2%
gpt-4-turbo57.7%32.4%45.1%
gpt-4-32k56.5%33.0%44.8%
claude-3-opus44.2%34.7%39.5%
mistral-large30.7%22.4%26.6%
gpt-3.5-turbo20.0%10.8%15.4%
pass^1 ranking across 12 frontier models on τ-retail and τ-airline, with gpt-4o at the top and weaker open-weight models at the bottom.
Source: Sierra's τ-bench blog post (Yao et al., 2024).

A few takeaways worth knowing:

  1. Function calling beats text-formatted ReAct consistently across strong models.
  2. Adding a “think” tool to FC agents did not help. Most FC models are not trained for that pattern, so giving them an explicit reasoning slot does not move the needle.
  3. For weaker models, ReAct beats Act-only. Reasoning traces help bridge the gap between observations and unfamiliar action formats. For stronger models, native FC dominates.
  4. τ-airline is much harder than τ-retail for every model.
  5. Open-weight models lag. llama-3-70b and mixtral-8x22b have a meaningful gap to closed-weight frontier models, even on this customer-service-style task that should be in their training distribution.

Cost analysis worth knowing

When pairing gpt-4o FC agent with gpt-4 user simulation on τ-retail:

Of the agent cost, 95.9% is input tokens, 4.1% is output. The cost is dominated by the long system prompt (domain policy + function definitions), not by what the agent generates.

This is a useful design constraint: if you are building a τ-bench-style evaluation in-house, optimise policy documents for token efficiency. Long policies are expensive at evaluation scale.

Designing your own evals using τ-bench principles

Even if you do not run τ-bench itself, the design principles transfer directly to almost any agent eval suite.

  1. Simulate the user. Use a second LLM with a persona and goal. Do not pre-script utterances. The stochasticity is what surfaces consistency failures.
  2. Grade end-state, not trajectory. Compare the database (or whatever your “world” is) at task end to an annotated goal state. Do not check exact tool sequences.
  3. Do not enforce policy in the API. Let the API succeed when the agent does the wrong thing. Catch the violation in the reward function. This forces the agent to internalise policy rather than relying on the environment to fail safely.
  4. Run multiple trials per task. Report pass^k, not just pass@k. One trial per task is misleading at best.
  5. Annotate goal states precisely. Where one outcome is genuinely unique (a specific reservation must be cancelled, a specific record must be created), write it as a single DB state. Where multiple resolutions are acceptable (refund vs replacement vs partial credit), use a goal predicate or a set of acceptable end states rather than forcing a single ground truth. The τ-bench paper used iterative agent runs to flush out ambiguities (run with gpt-4-turbo, examine trajectories, polish instructions), but see the limitations section below for why a single ground-truth state is often the wrong abstraction in the first place.
  6. Use the four-bucket fault taxonomy. Wrong tool, wrong arguments, wrong decision, partial resolution. Almost every failure fits.

When τ-bench is the right benchmark, and when it is not

Use τ-bench (or its principles) when:

Do not use τ-bench when:

BFCL and τ-bench are complementary, not redundant. BFCL isolates single-step function-calling accuracy. τ-bench measures whether the agent can hold a multi-turn conversation, internalise policy, and produce the right end state consistently. A model can top BFCL and still collapse on τ-airline.

What τ-bench changed, and what it left out

Before τ-bench, agent benchmarks made it look like 2023-era LLMs were close to production-ready. After τ-bench, the picture clarified: capability was respectable but consistency was abysmal, and consistency is what production needs. pass^k is the metric that made that visible.

The contributions are by now table stakes for serious agent evaluation:

The benchmark has real limitations though, and several of them matter more than the paper itself dwells on.

One annotated goal state per task. State-equality grading assumes there is a single correct outcome the database should end in. For a lot of real customer-service work that is false. “Help me fix my order” can be resolved by a refund, a replacement, or a partial credit, all of which a human agent would call correct. τ-bench’s grader marks all but one of those wrong. The paper acknowledges this in passing (the authors used iterative agent runs to flush out ambiguous annotations), but the structural fix would be to allow goal sets or goal predicates rather than a single ground-truth DB state. None of the successors do this either.

The user simulator bounds the evaluation. Every τ-bench number is a measurement of agent quality given a particular simulated user. If the user LM is too cooperative, the benchmark gets easier and agents look better than they are; if it is too adversarial or off-distribution, the benchmark measures something other than the deployment task. The paper does not really interrogate this. There is no ablation of the user model, no human-in-the-loop calibration of how realistic the simulated dialogues are, no discussion of how the choice of gpt-4 as user simulator biases relative rankings. If you build a τ-bench-style suite for your own product, the user simulator is the largest single confound in your numbers and worth as much engineering attention as the agent.

State equality misses safety-relevant trajectory differences. Two agents that produce the same final DB state are not equivalent if one of them got there by, say, leaking another customer’s PII on the way, or by silently overriding a policy and then undoing the change after the user pushed back. Outcome grading is the right default (it is fast, objective, and resists trajectory-hacking), but it cannot catch “right answer for the wrong reason” failures, which are exactly the failures you most want to catch in a regulated domain. A serious deployment suite needs to layer trajectory-level safety checks on top of τ-bench-style outcome grading, not replace one with the other.

Other gaps. The benchmark assumes a single agent talking to a passive user (no coordination between agents, no tool-using user). The policy is handed to the agent in clean text rather than buried across multiple internal docs the way it would be in production. It is text-only, so it misses the entire failure surface of voice agents (disfluencies, interruptions, ASR errors).

Each of those gaps motivated a successor. Part 3 covers what τ²-bench and τ³-bench added, what they fixed, and what they still do not.

← Part 1
Breaking Down Agent Evals: A Practitioner's Guide
Foundations: traces, run types, pass^k, four-step methodology, why no single eval method is enough.
Part 3 →
τ²-bench and τ³-bench
Dual control, document-sprawl retrieval, full-duplex voice, and what production eval still does not measure.

References



Previous Post
Breaking Down Agent Evals (Part 3): τ²-bench and τ³-bench
Next Post
Breaking Down Agent Evals (Part 3): τ²-bench and τ³-bench