Breaking Down Agent Evals (Part 3): τ²-bench and τ³-bench

Part 3 of 3. Part 1 covered the fundamentals. Part 2 was a deep dive on the original τ-bench.

Open Table of contents

The Opus loophole story
What τ-bench missed
τ²-bench: the user has tools too
τ³-bench: meeting production reality
The progression: τ → τ² → τ³
What’s still missing
Practical takeaways for your own evals
What the τ-bench family taught the field
References

The Opus loophole story

In late 2025, Anthropic ran Opus 4.5 against τ²-bench. On one flight-booking task, the agent “failed”. The annotated goal state expected one outcome; Opus produced a different one. It had spotted a policy loophole that gave the user a better result than the canonical solution.

By the eval’s lights: failure. By any reasonable definition of customer service: a creative win.

It is a small story but the structural lesson is large. Rigid rule-based grading penalises creative problem-solving. State grading improves on trajectory matching, but it is still imperfect when the goal state implicitly encodes one specific solution path.

The pragmatic answer is to instrument these failures for human review. Roughly half will be eval bugs (your benchmark was wrong). The other half will be agents finding cleverness you did not anticipate. Both are signal.

The deeper answer is that as agents get smarter, evals need to allow for solution diversity. τ²-bench moved in that direction by changing what the user could do.

What τ-bench missed

In retrospect, the original τ-bench is the simplest version of the agent-eval problem:

Single-agent, passive user. The user talks. The agent acts. The world updates.
Clean policy in the system prompt. No retrieval, no document sprawl, no “it is in Confluence somewhere”.
Text-only. No voice, no interruption, no latency sensitivity.
Single conversation. No persistent memory across sessions, no long-running tasks.

Each simplification is a place where production looks different. The two successors target the first three; the fourth is still open.

τ²-bench: the user has tools too

τ²-bench (Barres et al., 2025), released by Sierra and Princeton, adds the dimension τ-bench was missing: dual control.

In the original, only the agent acts on the world. The user is purely a conversation partner. In τ²-bench, the user has its own tools and must take real actions to resolve the task. The canonical example is telecom support: the agent can run diagnostics and check the account, but only the user can physically restart their modem.

That single change shifts the agent’s role from executor to coordinator.

New failure modes the original could not surface

When the user has tools, three failure categories appear that τ-bench had no way to expose:

Forgets to delegate. The agent diagnoses the problem but does not ask the user to take the corrective action.
Misses the user’s action. The user restarts their modem; the agent does not notice and asks them to do it again.
Does the user’s job. The agent attempts something that only exists on the user’s side, hallucinating capability.

These are coordination failures, qualitatively different from the execution failures τ-bench measures. Production agents in IT support, healthcare scheduling, and financial onboarding all see them at meaningful rates.

The telecom domain and what it revealed

τ²-bench’s telecom domain is harder than either of the original two. The drop is sharp: gpt-4.1 goes from 74% pass^1 on retail to 34% pass^1 on telecom. Coordination is harder than execution.

This is not because telecom is intrinsically more complex than retail. The data schema and API surface are comparable. The difference is the user-side action space. The agent has to maintain a mental model of what the user can do, has done, and still needs to do, on top of its own capabilities.

What this means for your eval design

If your production agent involves any user-side action (“please confirm via email”, “restart your device”, “upload your ID”), the original τ-bench framework will systematically underestimate your failure rate. You need a τ²-style dual-control evaluation.

The pattern to borrow is straightforward: model the user as having a tool list. The user’s task is not just to talk; it is to take specific actions when prompted, sometimes fail to take them, and require the agent to follow up. The reward function then checks both the database state and whether user-side actions were correctly delegated.

τ³-bench: meeting production reality

Sierra released τ³-bench in 2026. It expands the framework along two axes that finally close the gap to production: knowledge retrieval and voice.

τ-Knowledge: messy multi-document retrieval

The original τ-bench gives the agent its policy in clean Markdown, inside the system prompt. Real production agents do not get that.

Real production agents get pointed at SharePoint, Notion, Confluence, an internal wiki, and a Google Drive folder, and have to retrieve the right answer at the right time from a sprawl of policy manuals, product catalogues, standard operating procedures, and outdated FAQs.

τ-Knowledge formalises this. It tests whether agents can operate over large collections of internal company documents spread across systems and formats. The first instance is a banking_knowledge domain.

The failure surface shifts. The new dominant failure mode is not reasoning over a known policy; it is retrieving the right policy from a sprawling corpus. Common new failures:

Right policy not in top-k. Retrieval misses the relevant document entirely.
Conflicting policies retrieved. Multiple documents have overlapping rules; the agent picks the wrong one.
Stale policy retrieved. An older version of a policy outranks the current one.
Plausible policy hallucinated. The agent fills in what it thinks the policy should say.

If you have worked on RAG-heavy agents, none of this is new. What is new is having a public benchmark that measures it.

τ-Voice: full-duplex audio agents

The other τ³ extension evaluates voice agents that handle real-time audio with full-duplex communication (both sides can speak and listen at once). It runs through real-time audio APIs and supports both half-duplex (turn-based) and full-duplex (simultaneous) evaluation.

Voice introduces failure modes text agents never see:

Interruption handling. The user starts speaking mid-response. Does the agent stop, finish, or get confused?
Disfluency tolerance. “Um, can you, uh, change my, sorry, I mean cancel my flight.” Text agents see this cleaned up; voice agents do not.
Latency sensitivity. A 2-second pause that is invisible in text feels like a frozen system in voice.
Audio noise robustness. Background traffic, other speakers, poor mic quality.
Endpointing. Knowing when the user is done speaking versus just pausing.

Voice failures are also harder to debug. The trace does not include the audio of the misunderstanding, just the (often wrong) transcription. You are debugging downstream of a lossy compression step.

Why τ³-bench is the closest to production

If you are building or buying a production agent in 2026, τ³-bench measures the surface closest to what you actually ship. Not because τ³ tests something exotic, but because it tests the boring parts of production that prior benchmarks abstracted away:

Retrieval over sprawling document stores
Voice as a primary interface
Policy that lives in actual policy documents, not system prompts

The progression: τ → τ² → τ³

Putting the three together gives a story about how agent evaluation has tracked production reality.

Year	Benchmark	What it added	Deployment problem it caught up to
2024	τ-bench	Multi-turn user simulation, policy compliance, pass^k	”Agents that talk”
2025	τ²-bench	User-side tools, dual control, coordination failures	”Agents that delegate”
2026	τ³-bench	Multi-doc retrieval, full-duplex voice	”Agents that ship”

In 2024, most “agents” in production were chatbots with tool access. By 2025, real coordination workflows became viable (the agent does X, the user does Y, the agent verifies). By 2026, voice agents and document-grounded agents started shipping at meaningful scale, exposing failure surfaces text-only benchmarks could not reach.

What’s still missing

τ³-bench is the closest of the three to production reality, but real deployments still have surfaces no public benchmark covers well:

Multi-agent coordination. Two or more agents collaborating on a shared task. τ-bench has one agent talking to one user. Real production sometimes has a triage agent handing off to a specialist, or parallel agents racing to complete sub-tasks.
Long-horizon memory. Tasks that span days, weeks, or months. τ-bench tasks complete in a single conversation. Real CRMs, project-management agents, and personal assistants need persistence across many sessions.
Tool ecosystem coupling. Real agents call tools that call tools that call tools. The blast radius of a single bad decision is larger than τ-bench tasks reveal.
Adversarial users. τ-bench users have goals and personas but they are cooperative. Real users sometimes try to jailbreak, exploit, or trick the agent. Red-team eval suites exist, but they are not integrated with the τ-style framework.
Latency budgets and cost ceilings. The benchmark measures correctness, not whether the correct answer arrived in time and within budget. Production has both constraints.

Whether τ⁴-bench (or whatever is next) targets these is anyone’s guess. Multi-agent coordination feels like the most likely next step given how much industry attention has moved there.

Practical takeaways for your own evals

If you are building eval infrastructure for production agents in 2026, the τ-bench family suggests a layered approach.

Layer 1: τ-style core. Simulated user, state grading, pass^k, four-bucket fault taxonomy. Most of the value for a fraction of the effort.

Layer 2: τ²-style dual control (if your agent delegates). Model the user with a tool list. Score both database state and delegation correctness.

Layer 3: τ³-style knowledge and voice (if your agent has either). For knowledge, point the agent at a real document store, not a clean system prompt. For voice, evaluate end-to-end with real audio, not transcribed text.

The most common mistake is treating τ-bench as a checkbox (“we ran it, we got 60%”). The benchmark itself matters less than the design principles. Borrow the principles. Adapt the infrastructure to your domain.

What the τ-bench family taught the field

Three takeaways that hold regardless of which version you use:

Consistency, not capability, is the production wall. pass^k made this visible. Every team building agents should measure consistency, not just average success rate.
State grading beats trajectory matching. It allows creative solutions and is cheaper to run.
Eval design tracks deployment design. Each successor caught up to a production reality that already existed before the benchmark did. If you are shipping an agent, your eval suite is probably one generation behind your product. Close that gap.

The τ-bench family is the most consequential thread in agent evaluation since 2024. Not because the benchmarks themselves are perfect (the Opus loophole shows they are not), but because they shifted how the field thinks about what “good” means for an agent.

Capability got the field to demos. pass^k got it to products. The next jump (long-horizon memory, multi-agent coordination, latency and cost as first-class metrics) is the surface no public benchmark covers well yet, and it is the one your eval suite should be pointed at.

References

Yao et al., 2024. τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv:2406.12045
Barres et al., 2025. τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment. arXiv:2506.07982
Sierra Research, 2026. τ³-Bench: Advancing agent evaluation to knowledge and voice. sierra.ai/blog/bench-advancing-agent-benchmarking-to-knowledge-and-voice
Code: github.com/sierra-research/tau2-bench

← Part 1

Breaking Down Agent Evals: A Practitioner's Guide

Foundations: traces, run types, pass^k, four-step methodology, why no single eval method is enough.

← Part 2

τ-bench Deep Dive

Simulated user, domain policies, state grading, pass^k, fault taxonomy, design principles for your own suite.