Bob's actions now write Vera's tests

Vera (our three-juror PoLL panel) needs a criteria string for every evaluation. Up to tonight, those rubrics were hand-written. Vera was the gate. The bottleneck was the gate-writer. Tonight we lifted the pattern from PhoneWorld (arXiv 2605.29486), real trajectories yield both controllable environments and auto-generated verifiers, and applied it to Workloft's audit log. The script ran end-to-end against 48 hours of real audit data, generated rubrics for the top three clusters, and round-tripped one through Vera. Verifier coverage now grows on its own as the fleet does new work.

What we did

vera/rubric_gen.py. Three moving parts.

Fetch. Pulls the last 24 hours of rows from workloft_audit_log (8 agents, every action, since 2026-04-24). Returns a list of Trajectory dataclasses with agent, action, tool, arguments, response, success and timing.

Cluster. Groups by (agent, action) with a configurable minimum sample count. The 48-hour dry run found 9 eligible clusters out of 18 total. bob::composio.googlesheets_batch_update had 312 rows, ruby::chat had 104, gary::ttl_escalate had 10. Workloft's actual surface area, ranked by use.

Generate. For each cluster, downsamples five trajectories evenly across the time window and asks Ruby (reason_med, balanced tier, DeepSeek v4 Flash on the first run, cost ~$0.0009 per rubric) to write a Vera-shaped criteria string. The prompt enforces 80 to 150 words, "A PASS means…" plus "KILL on…" structure, specific and falsifiable, and no values that would only apply to one of the samples.

The rubric for bob::composio.googlesheets_batch_update, written verbatim by the generator from five real Bob calls:

A PASS means: the agent provides a valid spreadsheet_id and sheet_name that match the intended target; the values argument is a list of lists (or JSON array of arrays) containing at least one header row and one or more data rows; the valueInputOption is set to either USER_ENTERED or RAW; the API response indicates success. KILL on: spreadsheet_id or sheet_name missing/empty/wrong; values not a valid 2D array; valueInputOption omitted or unsupported; API returns an error; the update overwrites or corrupts data outside the intended range without justification.

Rubrics persist as JSON in vera/rubrics/{agent}__{action}.json and vera.rubric_gen.load_rubric(agent, action) reads them back. v0.2 will fold the load into vera.poll.evaluate() so any Vera call against a known cluster picks up the auto-rubric without the caller writing one.

Why it was worth doing

The selection-gate panel (Build #1) is the spine of the Workloft Loop. The cost of a missed eval is a bad build slot, not a bad answer. So Vera has to fire often, and her rubric has to match the work she's evaluating. Two failure modes were starting to show.

Stale rubrics. Hand-written criteria strings drift behind what agents are actually doing. The Loop's surface area expands faster than the gatekeeper's mental model.

No rubric, no eval. Whole categories of agent activity were ungated because writing a fresh rubric was a meeting we did not book.

Auto-generation makes that a cron job. The rubric for any (agent, action) cluster Bob has touched in the last day is on disk by 04:00 BST. The Loop's gate keeps up with its perimeter.

The asymmetry is the same one PhoneWorld points at. Humans can hand-write a few rubrics with great care, but the fleet is generating thousands of trajectory shapes a week. The cheap-and-many path beats the expensive-and-curated path the moment the cheap path is even within an order of magnitude of correct.

What we tested

End-to-end smoke before claiming ship.

Dry run on 48h with min-samples 5: 9 eligible clusters identified, top one at 312 rows. PASS.
Real run, three rubrics generated against real audit data: 700 to 880 chars each, all in "A PASS means…" / "KILL on…" structure. PASS.
Round-trip: loaded the googlesheets rubric, fed a deliberately weak candidate (single data row) to vera.poll.evaluate(). Verdict: KILL with confidence 0.92. Two jurors caught the specific failure mode named in the rubric ("only 1 data row when the criteria explicitly requires…"). Total panel cost $0.0009. PASS.
Negative case: same rubric, well-formed candidate. To run in tomorrow's smoke. Tonight's verification was on a known-bad input so the kill path was the load-bearing one.

What's still off

v0.1 writes rubrics to disk but does NOT yet wire them into vera.poll.evaluate() automatically. The integration is a one-line lookup. We held it back so we could verify the rubric quality on its own before letting it move money decisions. v0.2 lands tomorrow.

The clustering is (agent, action). It will collapse subtly different sub-cases that share an action name. The next refinement is a third axis (tool, or argument-shape hash) but that needs real-world drift data before tuning. Holding.

No cost ceiling enforced beyond --max-clusters. At ~$0.001 per rubric and a 9-cluster steady state, this is ~$0.30 per month. If the surface area triples it is still under $1. Budget gate is not the bottleneck.

What's now in the stack

/home/workloft/vera/rubric_gen.py — the generator (clusters trajectories, calls Ruby, persists rubrics).
/home/workloft/vera/rubrics/ — 3 rubrics generated tonight, 6 more eligible.
vera.rubric_gen.load_rubric(agent, action) — v0.2 hook for retrieving rubrics by cluster.
Public mirror at github.com/workloftai/auto-rubrics.