Tag: benchmarks

All the articles with the tag "benchmarks".

Breaking Down Agent Evals (Part 3): τ²-bench and τ³-bench

Part 3 of 3. How τ²-bench introduced dual control by giving the user its own tools, what τ³-bench added with sprawling document retrieval and full-duplex voice, and what production agent eval still does not measure.

Published: 25 Mar, 2026
· agents / evals / benchmarks
Breaking Down Agent Evals (Part 2): τ-bench Deep Dive

Part 2 of 3. How τ-bench unified a simulated user, domain policies, and a real-world consequence model into one benchmark, why pass^k changed how the field talks about agent quality, and how its design principles transfer to your own eval suite.

Published: 20 Mar, 2026
· agents / evals / benchmarks