▸ WORKLOFT RESEARCH NOTE №02 · 8 MAY 2026

When no benchmark exists: the methodology your Risk function was already going to need

A small Norwegian-led paper this week formalises the situation every UK Local Authority and FCA-supervised buyer is actually in: deploying LLMs in a sector or language for which no labelled safety benchmark exists yet. It hands them a defensible methodology — and ships the tool that runs it.

By Alfred Churchill · Workloft Labs · ~1,150 words · 6 min read

REG FIT ●●● · STRONG · METHODOLOGY DIRECTLY ANSWERS A REGULATOR'S "HOW DID YOU DECIDE?"

§1The honest situation regulated buyers are actually in

The standard answer to "is this LLM safe enough to ship?" is "we ran it against the benchmarks." For most of the buyers we work with, that answer evaporates the moment you read it back to them.

UK Local Authorities deploying redaction agents for DSAR and FOI — there is no labelled benchmark for "appropriate UK GDPR redaction." FCA-supervised firms running case-triage or pitchbook drafting — there is no labelled benchmark for "compliant adviser-tone for UK retail." Norwegian public-sector procurement comparing two open-weight Scandinavian models — no labelled benchmark in Norwegian, full stop. The buyers still have to choose, and they have to defend the choice in front of a function that will read the answer like an audit paper.

This week's pick — Gautam et al., 2026 — formalises exactly this case. They call it benchmarkless comparative safety scoring, and it is the most boringly important paper on AI safety we've seen this quarter. Boring because there are no leaderboards in it. Important because it gives a Risk function something it can actually defend.

§2What the paper actually proposes

Without ground-truth labels you cannot validate a score the usual way. So the authors substitute instrumental validity — three properties that, taken together, mean the score is doing real work even though no one has defined the right answer:

Responsiveness. If you swap a known-safe model for a known-unsafe ("abliterated") one, the score has to move significantly. AUROC of 0.89–1.00 in their runs.
Variance dominance. When you decompose the score's variance, the target model identity has to dominate the noise from the auditor and the judge. They report η² ≈ 0.52 — model identity is the single biggest factor in the score. If your judge is a bigger source of variance than the model under test, your score is measuring the judge.
Stability across reruns. Severity profiles need to converge. Theirs stabilise by ten reruns.

Pass all three and you have an instrument. Fail any one and you have a vibe. They build the instrument as SimpleAudit — local-first, scenario-pack-driven — and demonstrate it on a Norwegian public-sector procurement case comparing the Borealis (Norwegian-trained) and Gemma 3 family.

They also do something quietly significant: they apply the same chain to the existing tool Petri and show it admits both. The chain isn't sold as their tool — it's sold as an auditable contract that any safety-scoring tool can be put through. That generalises beautifully.

§3Why this lands harder for the regulated buyer than the AI-Twitter take suggests

The community read on this paper, judging by the (modest) Hugging Face thread, is "small benchmarking paper, niche language, marginal." That misses what the paper actually buys you, which is again only obvious if you've sat in a Risk function meeting.

Model risk supervision in financial services — SS1/23 for banks, the parallel SYSC expectations for asset managers — has a recurring, awkward demand: show me the chain by which you concluded this model was acceptable for this use. Not a number. A chain.

The instrumental-validity chain is, structurally, that chain. It maps almost mechanically onto what a regulator wants to read:

Responsiveness → "you can demonstrate the score discriminates between models we already know to be different."
Variance dominance → "you can demonstrate the score is measuring the model under test, not your auditor or your judge."
Stability → "you can demonstrate the score is reproducible across reruns within a defined budget."

Every one of those is a sentence a Chief Risk Officer can read out loud and a regulator can sign. Which is the whole game.

§4What this is not, and what it's quietly preparing the ground for

This is not a benchmark. There is no leaderboard. No "best model for DSAR redaction." The point — and the one most readers seem to miss — is that the authors flatly refuse to collapse the result into a single ranking. They argue that scores, matched deltas, critical rates, uncertainty, and the auditor and judge used must be reported together. Collapsing them into one number is the act that loses the regulator's trust.

This is the right epistemological posture, and it is also a commercial posture. The next generation of regulated AI tooling will not sell "we picked the best model for you." It will sell "we ran the chain, here is the artefact, your Risk function and ours signed off on it together." That is a very different shape of product to the one the consumer-AI side of the market is building.

The frame for any model-selection conversation we have with a regulated buyer should now be: which scenario pack, which auditor, which judge, with which rerun budget — and the deliverable is the chain, not the rank. Buyers who think they want a number are about to find their own Risk function asks for a chain instead. Better we walk in with the chain already.

§5What we're going to do with it

This pick clears all four Workloft implementation triggers — substrate-relevant, non-duplicative of our existing stack, tractable in under a week of focused work, and with a clear customer link via civiclaw and the regulated-AM conversations sitting in the pipeline. It goes onto the Replication Ledger. The build proposal that comes out of it follows.

▸ Build addendum — Workloft Model Risk Audit (WMRA)

A reproducible, regulator-readable model-comparison tool, scoped in three tiers so we ship cheaply, watch for demand, then escalate. Built on the validity-chain pattern from this paper, fronted by Workloft's existing AP2 mandate signing and audit chain so the resulting artefact is tamper-evident.

▸ TIER 1 · THIS WEEK · 0 ENGINEERING

This Note + a one-page scoping doc

You are reading Tier 1. The one-page scoping doc lives at /home/workloft/labs-api/notes/wmra-scope.md and walks a regulated buyer through the chain, the deliverable, and the cost. It is a sales asset, not a product. Carried into the next regulated-buyer conversation it does the work of explaining why we are different from a vendor selling a leaderboard.

▸ TIER 2 · WEEK 2 · 1–2 DAYS

workloft-audit CLI · MVP

A local-first command. workloft-audit run --models claude-opus-4-7,gpt-5.5,llama-3-70b --pack civiclaw-dsar.yaml --reruns 10 loads a YAML scenario pack, runs each candidate model, computes the validity-chain stats (responsiveness, variance dominance, stability) and emits a JSON report plus a regulator-shaped Markdown summary. SimpleAudit-compatible by design. Useful immediately for our own dogfood — which model ought to power civiclaw redaction, and what is our defence of that choice?

▸ TIER 3 · DEMAND-GATED · 1 WEEK

Hosted /v1/audit endpoint on labs-api

A buyer uploads their scenario pack, names the candidate models, gets back a signed JSON artefact plus a PDF for their model-risk file. AP2-mandate-signed so the chain is tamper-evident. This is the £5–15k/mo Bench-as-a-Service hook. Built only when an inbound conversation explicitly asks for it.

▸ TRIGGER FOR ESCALATION

Tier 2 ships only if Tier 1 generates a regulated-buyer conversation that says "we'd want this." Tier 3 ships only if Tier 2 dogfooding plus a buyer conversation says "we'd buy this." We will not pre-build past current demand.

§6The honest caveat

This paper is one week old. Citation count: zero. Hugging Face upvotes: one. The validity chain is fundamentally an old social-science move repackaged for AI safety, which is part of what we like about it — old moves survive. The risk is that the chain becomes another vendor-marketing surface ("audited under the validity chain!") without anyone holding tools to it. We mitigate that by also publishing — when Tier 2 ships — every failure of the chain we encounter, alongside the successes. The point of a chain is that it can fail.

We will re-check the paper's citation count in two months. If the chain has caught — across regulated AI suppliers, not just academic re-runs — we'll know. If it hasn't, the question becomes which alternative formalism filled the gap, and we'll write that up.

Methodology note. Walt (Workloft's classification agent, Gemini 2.5 Flash) screened the 8 May 2026 batch and surfaced this paper at 10/10 on the LLM-evaluation/safety axis. Hugging Face upvotes at time of writing: 1. Semantic Scholar citations: 0. The four-trigger implementation rule (substrate-relevant, non-duplicative, ≤1 week, customer link) was applied; the paper passed three triggers immediately and the customer link is via civiclaw DSAR model selection plus the regulated-AM pipeline conversations in flight. This Note is itself the Tier-1 deliverable for the WMRA build proposal.