Tag: evals

All the articles with the tag "evals".

LLMs playing Just One: Why Same-Model LLM Ensembles Mode-Collapse

Four Claude Haiku instances asked independently for a clue for 'toast' all reply 'bread'. Four Sonnets do it more often. Four Opuses do it even more often. I built a tiny benchmark using the board game Just One to measure when LLM ensembles collapse and what makes them stop. The mixed-family ensemble + anti-correlation prompt hits 3.25× the single-model baseline.

Published: 1 Apr, 2026
· llm / evals / ensembles
What an eval suite is, and how to build one

An eval suite is not one thing. It is a layered set of checks with different costs, latencies, and confidence levels. This post walks through what the layers are, how to build the dataset (the part most teams under-do), how grading actually works in practice, and how the whole thing wires into your CI.

Published: 30 Mar, 2026
· llm / evals / testing
Breaking Down Agent Evals (Part 3): τ²-bench and τ³-bench

Part 3 of 3. How τ²-bench introduced dual control by giving the user its own tools, what τ³-bench added with sprawling document retrieval and full-duplex voice, and what production agent eval still does not measure.

Published: 25 Mar, 2026
· agents / evals / benchmarks
Breaking Down Agent Evals (Part 2): τ-bench Deep Dive

Part 2 of 3. How τ-bench unified a simulated user, domain policies, and a real-world consequence model into one benchmark, why pass^k changed how the field talks about agent quality, and how its design principles transfer to your own eval suite.

Published: 20 Mar, 2026
· agents / evals / benchmarks
Breaking Down Agent Evals (Part 1B): Eval Calibration

A primer on eval calibration: what it means for your scoring pipeline to be trustworthy, the four levels (rubric, human-to-human, LLM-to-human, LLM-to-LLM), the common biases that turn a good-looking dashboard into a fiction, and how to read Cohen's kappa without the textbook. Built around small interactive applets.

Published: 15 Mar, 2026
· agents / evals / calibration
Breaking Down Agent Evals (Part 1A): Building the Eval Suite, Hands-On

The code companion to Part 1. The same five-step methodology, walked file by file: the toy agent, the eval-case schema, the JSONL dataset, an exact-match grader, an LLM judge, and the runner that ties it together and exits non-zero on regression.

Published: 12 Mar, 2026
· agents / evals / code
Breaking Down Agent Evals (Part 1): A Practitioner's Guide

Part 1 of a 3-part series. Why traces (not code) are the source of truth in agents, the three observability primitives, run types, the metrics that matter at each level, the pass^k reliability metric, a five-step methodology for building an eval suite, and a filter funnel approach to why no single eval method is enough.

Published: 10 Mar, 2026
· agents / evals / observability