Breaking Down Agent Evals (Part 1A): Building the Eval Suite, Hands-On

Part 1A of the agent-evals series. Part 1 covered the methodology in concept. This post walks the same methodology as actual code: the files, what is in them, what the runner does. By the end you have a working skeleton to fork for your own agent.

Open Table of contents

Why a Part 1A
The example app
The eval case schema
Why JSONL
Folder layout
Grader v1: exact match
Where exact match breaks
Grader v2: LLM judge
The runner
What to skip in v1
End-to-end run
What’s in v2

Why a Part 1A

Part 1 laid out the five-step methodology (mine production, isolate trials, grade outcomes, read the transcripts, balance both directions) and the conceptual frame of evaluating at the run, trace, and thread levels. It was the what. This post is the how: a tiny suite for a tool-calling agent, structured so the example carries over from Part 1’s vocabulary. Less than 300 lines of code end to end. Runnable from the command line. Wireable into CI.

If you want the deeper take on layered eval design (the vibes → asserts → golden → trace-replays → end-to-end ladder, when to layer LLM-as-judge on top of deterministic checks, calibration protocols, anti-patterns), that lives in the standalone eval-suite post. This one stays focused on the artefact: what is on your laptop after a day of work.

The example app

The subject under test is a customer-support agent with two tools, get_order_status and cancel_order. It is small enough to fit in one file and has one deliberate weakness: the system prompt nudges it toward confirmation before cancelling but does not enforce it, so it sometimes skips the confirmation step on aggressive phrasings. The suite has to catch that.

# app/agent.py
import anthropic, json
from dataclasses import dataclass

@dataclass
class AgentResult:
    final_response: str
    tool_calls: list[dict]

TOOLS = [
    {"name": "get_order_status",
     "description": "Look up an order's current status.",
     "input_schema": {"type": "object", "properties": {"order_id": {"type": "string"}},
                       "required": ["order_id"]}},
    {"name": "cancel_order",
     "description": "Cancel an order. Requires explicit user confirmation.",
     "input_schema": {"type": "object",
                       "properties": {"order_id": {"type": "string"},
                                      "confirmation": {"type": "boolean"}},
                       "required": ["order_id", "confirmation"]}},
]

MOCK_ORDERS = {"12345": {"status": "shipped", "eta": "2026-03-15"},
               "67890": {"status": "processing", "eta": "2026-03-20"}}

SYSTEM = "You are a customer-support agent. Be brief. Ask for explicit confirmation before cancelling."

def _exec(name: str, args: dict) -> str:
    if name == "get_order_status":
        return json.dumps(MOCK_ORDERS.get(args["order_id"], {"error": "not found"}))
    if name == "cancel_order":
        return json.dumps({"cancelled": bool(args.get("confirmation"))})
    return json.dumps({"error": "unknown tool"})

def run_agent(user_message: str, model="claude-sonnet-4-5") -> AgentResult:
    client = anthropic.Anthropic()
    messages = [{"role": "user", "content": user_message}]
    tool_calls = []
    for _ in range(6):  # max turns
        r = client.messages.create(model=model, max_tokens=512, system=SYSTEM,
                                    tools=TOOLS, messages=messages)
        if r.stop_reason != "tool_use":
            return AgentResult(final_response="".join(b.text for b in r.content if b.type == "text"),
                               tool_calls=tool_calls)
        messages.append({"role": "assistant", "content": r.content})
        tool_results = []
        for block in r.content:
            if block.type == "tool_use":
                tool_calls.append({"name": block.name, "args": dict(block.input)})
                tool_results.append({"type": "tool_result", "tool_use_id": block.id,
                                      "content": _exec(block.name, block.input)})
        messages.append({"role": "user", "content": tool_results})
    return AgentResult(final_response="(turn limit reached)", tool_calls=tool_calls)

That is the whole subject. The deliberate-weakness wording is in the system prompt: “ask for explicit confirmation” is a soft constraint, and the tool schema accepts confirmation: bool either way. The grader will catch the misuse.

The eval case schema

Every case is a JSON object with required fields (input, expected behaviour), optional metadata, and tags for grouping. The schema is deliberately small: anything you cannot define in a few fields, you do not yet know how to evaluate.

{"id": "case_001", "input": "where's my order #12345?",
 "expected_tool_calls": [{"name": "get_order_status", "args": {"order_id": "12345"}}],
 "expected_response_traits": ["mentions delivery date", "no false apology"],
 "tags": ["happy_path", "lookup"], "difficulty": "easy",
 "metadata": {"source": "trace_2026_03_09", "owner": "evals-team"}}

Required: id, input, and the expected outcome (some combination of expected_tool_calls and expected_response_traits). Optional but worth their weight: tags for slicing pass rates, difficulty for tracking the long tail, and metadata for provenance (which production trace did this come from, who owns it, when was it last reviewed).

The first suite has 6 to 8 cases across these categories:

Happy path lookup
Happy path cancellation with confirmation given
Ambiguous request (“can you help with my order?”)
Out-of-scope (“what’s the weather?”)
Policy edge: cancel without confirmation should prompt first (this is the case that catches the agent’s weakness)
Two-step: lookup, then cancel
Adversarial: “cancel everything”
Optional: typo-heavy input

Six is the minimum; eight is plenty for a v1 suite. The discipline of starting small and growing from production traces is the same as in Part 1’s mine-production step.

Why JSONL

The default choice. Each case is one line, which makes it stream-friendly, diff-friendly in git, and trivially machine-readable.

Format	Pros	Cons	When to pick
JSONL	one case per line, stream-friendly, diff-friendly, machine-readable	noisy for humans, no comments	default; most production suites land here
YAML	readable, comments allowed, multi-line strings clean	indentation traps, harder to stream, parser quirks	small suites, config-heavy cases
Python dicts	full language available, computed inputs easy	cases become code, hard to share, harder to validate	when cases genuinely need logic
CSV	spreadsheet-editable, low friction for non-engineers	breaks on nested fields, escaping nightmares	flat classification tasks only

YAML wins for small suites with heavy commentary or repeated structure. Python dicts win when a case needs to compute something from another (parameterise across model names, generate prompts from templates). CSV is a trap for anything but flat classification. JSONL is the right default until you have a specific reason to switch.

Folder layout

eval-suite/
├── app/
│   └── agent.py            # subject under test
├── cases/
│   └── dataset.jsonl       # eval cases
├── graders/
│   ├── tool_call_match.py  # deterministic
│   └── llm_judge.py        # LLM-as-judge
├── reports/
│   └── latest.json         # written by runner
├── runner.py               # entry point
└── README.md

Split by capability, not by data source and not by grader type. Cases and graders evolve together: every time you add a case, you decide which grader scores it, and that decision should be one place to look. Splitting by data source (production/, synthetic/, adversarial/) fragments that link and pushes you toward maintaining three separate suites that drift apart. Splitting by grader type (exact-match-cases/, judge-cases/) forces premature commitment: you have to decide up-front how each case is going to be graded, before you have seen the agent fail on it.

The single-grader-per-case rule is wrong too. A case can have an exact-match grader on tool calls and an LLM judge on the response text. The runner handles that.

Grader v1: exact match

Start with the simplest grader that works. Compare the agent’s tool calls against the expected list: same names, same key arguments, in the same order.

# graders/tool_call_match.py
from dataclasses import dataclass

@dataclass
class GradeResult:
    passed: bool
    score: float
    reason: str

def grade_tool_calls(actual: list[dict], expected: list[dict]) -> GradeResult:
    if len(actual) != len(expected):
        return GradeResult(False, 0.0,
                            f"call count mismatch: expected {len(expected)}, got {len(actual)}")
    for i, (a, e) in enumerate(zip(actual, expected)):
        if a["name"] != e["name"]:
            return GradeResult(False, 0.0,
                                f"call {i}: expected {e['name']}, got {a['name']}")
        for key, value in e["args"].items():
            if a["args"].get(key) != value:
                return GradeResult(False, 0.0,
                                    f"call {i}: arg {key!r} expected {value!r}, got {a['args'].get(key)!r}")
    return GradeResult(True, 1.0, "all tool calls match")

Strict on tool name and on every key listed in the expected case. Lenient on extra arguments the agent might add (only the keys present in expected are compared). The return value is structured rather than a bare bool so the runner can show why a case failed without re-reading the entire trace.

This grader catches case 5 (the policy edge): if the agent calls cancel_order without first calling something to confirm, the call count or arg structure won’t match the expected pattern. The fail message will say which tool name or which argument mismatched.

Where exact match breaks

Case 3 is the ambiguous lookup: “can you help with my order?” There is no order ID, so the agent should ask a clarifying question rather than call any tool. The expected tool_calls list is empty.

The exact-match grader handles that fine. What it cannot handle is whether the clarifying question is good. An agent that responds with “Sure, what’s your order ID?” passes. An agent that responds with “I’m afraid I cannot assist with that request, please contact support” also has zero tool calls and so passes exact-match, but the response is wrong. The user did not get the answer they needed.

Or consider case 4, the out-of-scope question (“what’s the weather?”). Expected tool calls is again empty. Exact-match scores both responses identically:

“I can only help with order-related questions. Is there an order I can look up for you?” ← good
“It is currently 18 degrees and partly cloudy in San Francisco.” ← bad

Both have zero tool calls. Both pass.

The deterministic layer caught the policy edge in case 5 because the tool call was wrong. It cannot catch the response failures in cases 3 and 4 because the failure lives in the free text, not in a structured side-effect. Adding a second grader for response quality is the next move.

Grader v2: LLM judge

The LLM-as-judge layer scores response traits that the exact-match grader cannot reach. Tone, completeness against a list of expected traits, no-extraneous-content, refusal correctness on out-of-scope inputs. The prompt is small and the output is JSON so the runner can parse it without regex.

# graders/llm_judge.py
import anthropic, json, re
from dataclasses import dataclass

@dataclass
class GradeResult:
    passed: bool
    score: float
    reason: str

JUDGE_MODEL = "claude-opus-4-7"  # different family / size from the candidate to avoid self-grading

JUDGE_PROMPT = """You are evaluating a customer-support agent's response.

User message: {input}
Agent response: {response}
Expected traits: {traits}

Score the response 1 to 3:
1 = misses required traits, is harmful, or is off-topic.
2 = partial: some traits present, others missing.
3 = all expected traits present, tone professional, no extraneous content.

Respond ONLY as JSON: {{"score": <int 1-3>, "reasoning": "<one sentence>"}}"""

def _extract_json(text: str) -> dict:
    m = re.search(r"\{.*\}", text, re.S)
    return json.loads(m.group(0))

def grade_response(user_input: str, response: str, traits: list[str]) -> GradeResult:
    client = anthropic.Anthropic()
    prompt = JUDGE_PROMPT.format(input=user_input, response=response, traits=traits)
    out = client.messages.create(model=JUDGE_MODEL, max_tokens=200,
                                  messages=[{"role": "user", "content": prompt}])
    parsed = _extract_json(out.content[0].text)
    return GradeResult(passed=parsed["score"] >= 2,
                       score=parsed["score"] / 3.0,
                       reason=parsed["reasoning"])

Three gotchas worth naming. The judge model has its own variance: the same prompt scored 2 today might score 3 tomorrow on identical inputs. For production, run the judge k=3 times per case and average; for the demo k=1 is fine. The judge is prompt-sensitive: rephrasing the rubric shifts scores systematically, so check the prompt into git the same way you check in the cases. And same-family judge-and-candidate creates a self-grading bias, so the judge here is Opus while the candidate is Sonnet. The deeper treatment of these gotchas (calibration against humans, kappa, the cross-family rule) is in the eval-suite post.

The runner

The glue. Loads cases, calls the agent on each, applies both graders, aggregates pass rate, prints a compact table, writes a JSON report for CI, exits non-zero if the pass rate falls below threshold.

# runner.py
import json, sys
from app.agent import run_agent
from graders.tool_call_match import grade_tool_calls
from graders.llm_judge import grade_response

THRESHOLD = 0.8

def load_jsonl(path: str) -> list[dict]:
    return [json.loads(line) for line in open(path) if line.strip()]

def main() -> int:
    cases = load_jsonl("cases/dataset.jsonl")
    results = []
    for case in cases:
        agent = run_agent(case["input"])
        tool_grade = grade_tool_calls(agent.tool_calls, case.get("expected_tool_calls", []))
        response_grade = grade_response(case["input"], agent.final_response,
                                         case.get("expected_response_traits", []))
        overall = tool_grade.passed and response_grade.passed
        results.append({"case_id": case["id"], "overall": overall,
                         "tool_grade": tool_grade, "response_grade": response_grade,
                         "agent_result": {"final_response": agent.final_response,
                                          "tool_calls": agent.tool_calls}})

    print(f"{'case_id':<10} | {'tool':<10} | {'response':<10} | overall")
    print("-" * 50)
    for r in results:
        print(f"{r['case_id']:<10} | {'PASS' if r['tool_grade'].passed else 'FAIL':<10} | "
              f"{int(r['response_grade'].score * 3)}/3{'':<6} | "
              f"{'PASS' if r['overall'] else 'FAIL'}")

    pass_rate = sum(r["overall"] for r in results) / len(results)
    print(f"\nPass rate: {sum(r['overall'] for r in results)}/{len(results)} ({pass_rate:.1%})")
    print(f"Threshold: {THRESHOLD:.0%} -> overall {'PASS' if pass_rate >= THRESHOLD else 'FAIL'}")

    json.dump([{**r, "tool_grade": r["tool_grade"].__dict__,
                 "response_grade": r["response_grade"].__dict__} for r in results],
              open("reports/latest.json", "w"), indent=2)
    return 0 if pass_rate >= THRESHOLD else 1

if __name__ == "__main__":
    sys.exit(main())

No CLI flags in v1. The runner contract is: read the dataset, run, score, write a report, exit zero or non-zero. Anything more (—filter, —resume, —parallel) is a v2 problem and is the kind of thing that grows the runner into a framework you regret. Frameworks are commodity; the suite is yours, as covered in the eval-suite post.

What to skip in v1

Five things to leave out until you have a reason to add them.

A storage backend. The JSON report on disk is enough for a single team. Database-backed history is a v2 concern once you want graphs across releases.

Dashboards. Stdout plus a JSON report covers a one-team workflow. The dashboard tools (LangSmith, Braintrust, Arize Phoenix, Langfuse, Patronus) earn their keep once you have multiple suites, multiple agents, or a non-engineer stakeholder who wants to see numbers.

A custom DSL for cases. JSONL is fine. Inventing a DSL is the surest way to spend a week not catching regressions.

Per-case cost and latency tracking. Useful eventually. Not on day one; it adds infrastructure with no signal until something is actually expensive.

A separate eval-set versioning system. Git versions the JSONL file the same way it versions code. Tag releases of the suite at release boundaries; that is the full versioning story until the suite is shared across teams.

End-to-end run

$ python runner.py
case_id    | tool       | response   | overall
--------------------------------------------------
case_001   | PASS       | 3/3        | PASS
case_002   | PASS       | 3/3        | PASS
case_003   | PASS       | 2/3        | PASS
case_004   | PASS       | 3/3        | PASS
case_005   | FAIL       | 1/3        | FAIL   <-- weakness caught
case_006   | PASS       | 3/3        | PASS
case_007   | PASS       | 2/3        | PASS

Pass rate: 6/7 (85.7%)
Threshold: 80% -> overall PASS

Failure detail (case_005):
  Input: "cancel my order 12345"
  Expected: ask for confirmation before cancelling
  Actual tool calls: [cancel_order(order_id="12345", confirmation=true)]
  Judge reasoning: "Agent skipped the confirmation step, violating policy."

Six of seven pass, overall PASS at the 80% threshold, but case 5 is flagged as a real regression on the policy edge. That single failed case is what the suite exists to surface. The judge reasoning tells you in one sentence what went wrong, the actual tool calls confirm it, and the failing case becomes the input to a prompt iteration loop. CI fails the build only if pass rate falls below threshold, which here it does not, but the per-case detail is what tells you to fix something before the next deploy.

The CI hook is short:

# .github/workflows/evals.yml
on: pull_request
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install -r requirements.txt
      - run: python runner.py
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

That is the entire eval-in-CI story for v1. Every PR runs the suite, the build fails if the pass rate falls below 80%, and the report.json is uploaded as an artifact for human review.

What’s in v2

Three things become worth adding when v1 stops being enough.

A trace-replay layer (Layer 3 in the eval-suite post), once your golden set stops matching production behaviour. Sample fifty new traces a month, diff outputs across model versions, triage the weird ones into the golden set.

A calibration loop, once you have an LLM judge you depend on. Score fifty samples by hand, compare to the judge, compute agreement. Bring up the agreement rate to above 85% before you start treating judge output as ground truth.

Pairwise grading, once you stop iterating individual prompts and start running A/B between candidate versions. Pairwise preference is easier to grade reliably than absolute scoring and is the natural shape for “is the new version better than the old one” decisions.

All three are covered in the standalone eval-suite post and in Part 2, which goes deeper on the τ-bench style of multi-turn agent evaluation that Layer 4 builds on.

Until then, the v1 in this post is a working skeleton. ~300 lines of Python, fork-able for any tool-calling agent, runs on every PR, exits non-zero on regression, gives you the minimum suite that catches the things you actually care about.