Trajectories Write Tests

§1The expensive bit is rubric writing, not evaluation

The fashionable critique of agentic AI is that nobody knows whether agents are getting better. Every shop has its own eval harness. Every harness depends on hand-written rubrics. Every rubric drifts behind what the agent is actually doing.

The mostly-unspoken cost in any selection-gate setup is the rubric writer. Vera, our three-juror Panel of LLM Evaluators, will happily pass a candidate against any criteria string you give it. The criteria string is the bottleneck. Hand-write a sharp one and Vera does sharp work. Hand-write a vague one and Vera waves things through.

PhoneWorld, arXiv:2605.29486, is a mobile-agent paper at first glance. Its more interesting move is architectural. It treats real user trajectories as the raw material for two derived artefacts: a controllable environment that replays the same operating conditions, and an auto-generated verifier that checks new agent attempts against the original outcome. The verifier is not written by hand. It is extracted from the trajectory.

Pull that pattern off the phone and put it on Workloft's audit log. Every action the eight agents take is already recorded — agent, action, tool, arguments, response, success, cost, duration, timestamp. The corpus is hundreds of trajectories a day. If the PhoneWorld move generalises, that corpus is also a rubric writer.

§2The shape of "usage writes the test"

Naive eval design starts at the wrong end. A human writes a rubric describing what good output looks like. The agent runs. The evaluator compares output to rubric. That order makes the rubric a guess about future work. It is brittle, because the agent's actual surface area shifts faster than the human's attention to it.

The PhoneWorld order is reversed. The agent runs first. The runtime captures a real trajectory. A downstream process clusters trajectories by task type, samples representative ones, and asks a model to derive the verifier, the property that distinguishes successful trajectories from failed ones in this cluster. The rubric is downstream of usage, not upstream of it.

There is a discipline cost. The trajectories have to carry enough structure to be clusterable. The cluster axis has to map onto something an evaluator can verify. The derivation prompt has to be specific and falsifiable, not "did this look reasonable." But the discipline cost is paid once per pipeline, not once per task type. After that, the rubric set tracks the agent fleet automatically.

The substrate idea: verifier coverage becomes a function of usage, not a function of human attention. Whichever (agent, action) pairs the fleet exercises most, the rubrics for those cluster shapes get refreshed first. Long-tail actions get rubrics on the schedule the runtime can afford to compute them.

§3What this looks like in a regulated stack

For regulated buyers, the question that has been quietly skipped in most agent procurement is "how do you know the evaluator was current when it ran." A rubric stamped six months ago and a rubric stamped last night look identical in a screenshot. They are not equivalent under FCA SS1/23 §3.6 (ongoing monitoring), ICO AI Guidance §11 (controls evidence) or the substantive parts of UK GDPR Article 5 (accuracy, storage limitation).

Auto-generated rubrics make the freshness question explicit. Each rubric carries the cluster it was derived from, the sample count, the timestamp of derivation, and the model that wrote it. When a regulator or auditor asks "what evaluator passed this agent action on 14 March," the answer is a specific, dated artefact, not a static document.

The control is no longer "did someone write a thoughtful policy." The control is "did the runtime have a current rubric for the cluster this action belonged to, and is the derivation trail visible." That is closer to how the rest of operational risk works in financial services and public sector technology. Procedures get derived, dated, reviewed and superseded. The same shape now applies to agent verifiers.

There is also a defensive lift. A defence based on "we trained our evaluator on real production trajectories of this exact task class" is materially stronger than "we wrote a rubric we thought was reasonable." The first has provenance. The second has confidence.

§4Where the pattern gets dangerous

Three failure modes deserve flagging before any production deployment.

Trajectories without honest failure signal. If success is recorded by the agent itself, the rubric will learn what the agent calls success. PhoneWorld can ground-truth on observable phone state. Workloft has to be careful that its success field is not just "the API returned 200." Any cluster with degenerate success signal will yield a degenerate rubric. The fix is upstream: give the runtime independent signal about whether an action actually achieved its purpose, not just whether it completed.

Collapsing distinct tasks under a shared action name. composio.googlesheets_batch_update covers updating the Loop board, the publishing ledger, and a Maggie campaign tracker. Those are different tasks with different right answers. Coarse clustering generates a rubric that is technically correct for all of them and useful for none. The remedy is a richer cluster key (argument-shape fingerprint, or downstream pipeline tag, not just (agent, action)). The cost is more clusters, more rubric calls, more bookkeeping. The benefit is rubrics that mean something.

Drift hidden by averages. If the rubric is regenerated from a rolling window, a slow degradation in agent behaviour can become normalised. The rubric writer reads the recent past and assumes it is the standard. Whatever the agent is doing now becomes the new baseline. This is the failure mode of any system that learns from its own outputs. The runtime needs at least one fixed reference point per cluster, a frozen rubric or a hand-validated trajectory, to anchor against the auto-derived version.

None of these are fatal. They are the shape of work that has to land alongside the basic mechanism. The mechanism on its own is a clean substrate move. The deployment is a governed pipeline.

§5What this is not

This pattern is not a replacement for the selection gate. Vera is still the gate. The rubric is the gate's criteria argument, which makes the gate work, but the gate itself, the three-juror panel, the cross-lineage diversity, the kill bias, is the unchanged piece.

Nor is it a replacement for human policy. A regulated buyer's view of "what good looks like" still has to be expressed by humans somewhere. The auto-derived rubric is downstream of usage; it can capture how a successful action looks, not whether the action was the right one to perform in the first place. That decision belongs upstream, in the goal-setting and intent-mandating layer (in our stack, the AP2 mandate and the HITL approval gate).

And it is not a benchmark. Benchmark rubrics are deliberately fixed across runs so comparisons are valid. Auto-derived rubrics are deliberately tracking the live system. They serve different purposes. A production evaluation stack needs both: frozen reference rubrics for benchmark-style comparisons, and current auto-derived rubrics for everyday gating.

The most useful framing for builders is that this is a runtime feature, not a research artefact. Treating it as a one-off paper implementation misses the substrate point. Treating it as a pipeline alongside the audit log, the eval panel and the publish ledger turns it into a control surface the rest of the stack can rely on.

The most useful framing for buyers is the procurement question. Do not ask only "what evaluator do you use." Ask how the evaluator's rubric is kept current, where the rubrics live, and what trail exists when the rubric for a class of action was last regenerated. The answer to that question is now a deployable feature, not a manifesto.

Methodology note. This Note takes PhoneWorld (arXiv:2605.29486) as a substrate paper, not a mobile-agent demo. Triggers: novel architecture (production usage yielding both environment and verifier as a co-product); non-duplicative (Vera covers gate-running but not rubric-keeping-current); regulated-buyer link (ICO AI Guidance, FCA SS1/23 monitoring expectations, GDPR storage limitation around audit data). The Workloft-side artefact is vera/rubric_gen.py, shipped 29 May 2026 as Ship №20.

§1The expensive bit is rubric writing, not evaluation

§2The shape of "usage writes the test"

§3What this looks like in a regulated stack

§4Where the pattern gets dangerous

§5What this is not

▸ Related