Our eval gate enforced a contract the tool never had, Workloft Research Note №60

§1Grading your own agents

We run a nine-agent fleet on the Claude stack, and every night an LLM-judge panel scores what those agents shipped against a rubric. The rubrics are generated automatically, one per action, lifted from real examples of that action being performed. It is a nice idea: the grader learns what good looks like from the traffic, so nobody hand-writes a hundred rubrics.

We had just upgraded the whole thing. The old rubrics scored outcomes, a single "did it produce the right shape" check. That misses process failures, so we rebuilt them to judge four things: did the agent pick the right action, follow its contract, compose it correctly into the surrounding work, and show it understood why. Cleaner rubrics, sharper verdicts. We were pleased with ourselves for about a day.

§2A contract that did not exist

Then we actually read what the grader was flagging. The single largest group of automatically-filed fix tickets on our board, half of one whole cluster, were all the same complaint: a valid record failed because a "required" field was null. Except the field was not required. The tool in question, our task logger, takes exactly one mandatory argument, a title. Everything else, the owner, the stage, the tag, the priority, is optional.

The grader did not know that. It had been handed real examples, noticed that most of them happened to include those optional fields, and quietly promoted "usually present" to "must be present". From then on it failed every call that left one out. The gate was not too lenient, the usual worry with an automated judge. It was the opposite. It was strict about a contract the tool never had, and manufacturing failures against it.

This is what an LLM does when you ask it to write a rubric from examples: it turns frequency into law. If ninety per cent of your samples share a field, the model concludes the field is mandatory, because it has no other signal for what is required and what is merely common. The fix reads like a lesson you already knew about ordinary code. A grader is code. It has a contract, and it needs the real one, the actual argument list, not a contract reverse-engineered from a handful of examples. We changed generation to treat only a tool's genuine core payload as required and everything else as optional. The false flags stopped.

§3We nearly shipped the same bug

Here is the part that should worry anyone building evals. When we wrote tests for the new rubric, we wrote a case that asserted a null field should fail, ran it, watched it pass, and almost shipped on the strength of a green result. The test passed because our test oracle was wrong in exactly the same way the grader was, not because the rubric was right. We had assumed the field was required, so we wrote a test that demanded it, so the test agreed with the bug.

It only flipped when we opened the tool's actual command definition and saw the argument was optional. The rubric, the tests, and the human writing both were all hallucinating the same contract, and the tests dutifully confirmed it.

When the grader and its tests are written against the same assumption, green means nothing. A test oracle can be as confidently wrong as the thing it tests. Check against the interface, not against what you think the interface is.

§4The self-check that taught us its own blind spot

Auto-generated rubrics have a second problem: they are not deterministic. Regenerate one from slightly different samples and you get a different rubric, and quality wanders. One draft is well-calibrated; the next quietly reintroduces a false failure through a different axis. So a rubric you verified by hand is not the rubric you get next week. We wanted the grader to check itself.

So we built a smoke gate. Before a new rubric deploys, score it against the cluster's own recent successful outputs. If it kills a large share of work that is known to have succeeded, it is over-strict, so reject it and keep the old one. Building that gate taught us two things we did not want to learn.

First, a check built from real traffic only catches the mistakes that traffic exercises. Our grader had been failing records with null fields, but current traffic fills those fields in. So the traffic-based check ran the over-strict rubric against samples that all had the fields populated, saw nothing killed, and waved it straight through. A self-check is only as good as the cases it actually runs, and live traffic is a biased sample of what can go wrong.

Second, and worse: our logs are lossy. The audit log truncates long values for storage. So a rubric that judged "is this field complete" started flagging titles the logger had clipped, not titles the agent had written badly. The grader was catching our logging pipeline, not our agents. And an earlier catch we had been quietly proud of, the new grader spotting a truncated record the old one missed, turned out to be one of these same false alarms. We had built a stricter judge and mistaken its noise for rigour.

The real fix was to make the rubric judge only what survives a lossy log: was the right action taken, did the call succeed, is the intent coherent. Not whether a stored string looks complete. Completeness is the logger's business, and the logger loses data on purpose.

§5What this leaves you with

None of this is specific to us. If you grade LLM output with an LLM, three things carry over.

A grader is code, so pin it to the real interface. Give it the actual function signature and the genuinely required arguments, or it will invent a contract from whatever your examples happen to share. Frequency is not a specification.
Your test oracle can be as wrong as the thing it tests. When both are written by a model against the same assumption, a passing test proves the assumption is self-consistent, not correct. Verify against the interface.
A judge over lossy telemetry should grade intent and success, not surface form. If your logs drop data, a grader that scores completeness will mostly score your logging. Ask only what survives the pipeline.

The corollary we are still sitting with is the uncomfortable one. For a lot of routine actions that have already succeeded, there is very little an after-the-fact grader can honestly fail. A judge that fires anyway is not being rigorous, it is manufacturing work. The smoke gate we shipped is a coarse net: it reliably catches the egregiously over-strict rubric, and it cannot catch the subtle one. The honest verification is still a small set of hand-written cases per action, each built from the real contract. We would rather say that plainly than ship a green dashboard that means nothing, which is, after all, exactly the mistake we started with.

Methodology note. We run a nine-agent fleet on the Claude stack, so grading agent output is an operational need here, not a thought experiment. Our eval gate scores real shipped outputs nightly against auto-generated rubrics. The numbers in this piece are ours: the false-flag class described was the largest single group of automatically-filed fix tickets on our board. We name the limitation of the fix because a grader you cannot trust is worse than no grader at all. The process-rubric framing was prompted by the SkillCoach paper (arXiv:2607.01874) on evaluating skill-use as a process rather than an outcome; the failure modes, and the fixes, are our own.

Our eval gate enforced a contract the tool never had

§1Grading your own agents

§2A contract that did not exist

§3We nearly shipped the same bug

§4The self-check that taught us its own blind spot

§5What this leaves you with

▸ Related