Measure Before You Tune

§1The tuning urge gets ahead of the measurement

Any team that runs an LLM-powered scorer has been here. The scorer files something obviously wrong, you read the prompt, you spot the gap, you tighten it, you redeploy. The next time it files something obviously wrong, you do it again. After three months the prompt is a barnacle-encrusted hull of edge cases and the original scoring logic has become folklore.

This is the inner loop. It feels productive, because each tweak fixes a real failure. It is rarely the right thing to do first.

The two-level autoresearch framework (arXiv 2605.30003) is, on first reading, a paper about multi-agent policy synthesis for social dilemmas. The architectural point is more general. It says: when you have a policy generator (a prompt, a scorer, a routing rule), and you have downstream outcomes (did the picked item ship, did the routed call succeed, did the decision hold up to review), you should not be optimising the generator until you have measured the outcome side.

Inner loop tunes the generator. Outer loop measures whether the current generator's outputs predict the outcomes you care about. Run the outer loop and you discover that the prompt is fine and the problem is downstream, or that the prompt is fine in 6 of 7 axes and broken in 1. Skip the outer loop and you spend a quarter tuning the prompt where the actual loss function is elsewhere.

§2What an outer loop costs

The outer loop is cheap relative to the inner loop. It is a join and a ratio. Two ingredients:

Policy outputs with traceable IDs. Walt files HF papers as Gary research todos. Each pick carries a gary_id. The policy output (Walt's score and axis label) is durably linked to the artefact it produced (the todo).
Outcomes captured in the same key space. Gary todos have a status. Some advance to shipped, some sit open, some get killed. The outcome side of the join already exists. We did not have to instrument anything new.

The work was the aggregation, not the data collection. Per axis: how many picks, what mean score, what share advanced (moved), what share stalled (open), what share got killed. The derived health score normalises conversion by mean score, so an axis where Walt is generous (high mean) but right (high conversion) is rated as well-calibrated, and an axis where Walt is generous but wrong is rated as miscalibrated.

Tonight's first run returned an axis-health table where every axis sits at 0.0. That is the correct answer for a 30-day window over research todos that take 6 to 8 weeks to move. The instrumentation is alive. The signal is slow. The 60-day run will be the first to carry actionable shape.

That slowness is itself an outer-loop finding. A team that did not have this measurement might re-prompt Walt this week and feel productive. The right answer is: wait for the conversion signal, then tune the axes where Walt is actually miscalibrated.

§3What the regulator will ask

The procurement question that is starting to land in financial services and public sector AI buys is "how do you know your AI system is doing what it should." The ICO AI Guidance §6 puts accuracy as a fundamental data-protection issue, not just an engineering metric. The GDS AI Playbook §4 specifically calls out measurement before optimisation. FCA SS1/23 §3.6 requires ongoing monitoring against documented performance benchmarks.

An AI scorer that has been "tuned often" but never measured is a worse answer to those questions than a scorer with a steady prompt and a per-axis outcome dashboard. The first is movement. The second is evidence.

This is the same point the audit-log Note made in May. The artefacts of measurement are themselves a control. The outer-loop report file, with its per-axis health number and its derivation trail, is the artefact a regulator can look at. The fact that it exists, that it is regenerated weekly, and that nobody has tuned the underlying policy without consulting it, is the actual control. The control is procedural, not technical.

Buyers should be asking suppliers for the outer-loop equivalent. Not "what model do you use" but "what report tells you whether your model's decisions predict the outcomes you wanted." If the answer is silence, the supplier has built an inner loop with no outer loop.

§4What two-level autoresearch is not

This is not Goodhart-by-another-name. The outer loop measures the same outcomes the operator already cares about. It does not introduce a new proxy metric for the policy to game. It looks at whether the existing proxy (Walt's score) predicts the existing outcome (todo conversion). If the proxy is good, the loop says so. If it is bad, the loop names which axis is bad.

It is not a free lunch. Two-level architectures still need an inner loop. Measurement without ever tuning leaves the policy stuck. The sequencing rule is what matters: measure first, accumulate enough signal, then tune. The inner loop becomes a periodic targeted intervention rather than a continuous noise.

It is not a substitute for human policy. Walt's scoring axes were chosen by humans, on the basis of what Workloft cares about. The outer loop does not pick the axes. It tells you whether the axes you picked are predicting what you said they would.

And it is not the same as RLHF. RLHF takes preference signal and re-trains the model. The outer loop here does not retrain anything tonight. It measures. The next step (Gary 2bdfe880, the GEPA vs DSPy MIPROv2 decision) is the inner-loop substrate, and it will sit downstream of this measurement, not in place of it.

§5The substrate cost of skipping the outer loop

Most production agent stacks skip the outer loop. The reason is structural, not lazy. Outer loops require traceable IDs between policy outputs and outcomes. Most teams do not log policy decisions with stable IDs. They do not link those IDs to downstream tickets, transactions, or business events. The data layer makes the outer loop expensive after the fact.

The cheap path is to build the join in from the start. Every policy output carries the artefact ID it produced. Every artefact records its outcome. The outer loop then becomes a query. That is a discipline question, not a model question. It cost Workloft nothing tonight because the discipline was in place. It would have cost weeks to retrofit.

If you are designing an agent system this year, the most useful artefact is the trace ID. Walt's pick has a gary_id. The carousel render call carries the slide JSON path. The Vera evaluation carries the candidate label. Each of these is doing the same thing: keeping policy and outcome in the same key space so the outer loop can be a one-screen query.

Two-level autoresearch is, then, less about the autoresearch and more about the two-level architecture. Measurement is a layer. Tuning is a separate layer. Both need to exist. The order is fixed. Build the measurement layer first.

Methodology note. This Note takes the two-level autoresearch framework (arXiv:2605.30003) as a substrate paper, not a multi-agent demonstration. Triggers: novel architecture (explicit outer-vs-inner loop separation as a procurement-relevant control); non-duplicative (Vera covers selection, Otto covers changelog, this Note covers the missing measurement layer between them); regulated-buyer link (FCA SS1/23 monitoring expectations, GDS AI Playbook §4, ICO AI Guidance §6). The Workloft-side artefact is walt/weight_loop.py, shipped 29 May 2026 as Ship №21.

§1The tuning urge gets ahead of the measurement

§2What an outer loop costs

§3What the regulator will ask

§4What two-level autoresearch is not

§5The substrate cost of skipping the outer loop

▸ Related