Tag: benchmarks
All the articles with the tag "benchmarks".
-
Breaking Down Agent Evals (Part 3): τ²-bench and τ³-bench
Published:Part 3 of 3. How τ²-bench introduced dual control by giving the user its own tools, what τ³-bench added with sprawling document retrieval and full-duplex voice, and what production agent eval still does not measure.
-
Breaking Down Agent Evals (Part 2): τ-bench Deep Dive
Published:Part 2 of 3. How τ-bench unified a simulated user, domain policies, and a real-world consequence model into one benchmark, why pass^k changed how the field talks about agent quality, and how its design principles transfer to your own eval suite.