Workloft Labs

▸ REPLICATION LEDGER

A paper we've cited but never reproduced is a paper we've trusted second-hand. The ledger tracks which Workloft Labs picks we've actually rebuilt ourselves — what worked, what didn't, what we found in the gaps the authors didn't write down. Published openly because "we read it" isn't the same as "we ran it".

Paper	Started	Status	Notes
ARIS — adversarial executor + reviewer harness Lab Bench №01 · Note №01	2026-05-07	▸ IN PROGRESS	Reproducing the executor + reviewer pair on a Claude Opus 4.7 / GPT-5.5 split. Targeting an FCA-shape conversation about who decides when the argument ends.
When No Benchmark Exists — instrumental-validity chain for LLM safety Lab Bench №02 · Note №02 · WMRA Tier-1	2026-05-08	✓ SCOPED	Note №02 published. Tier-1 deliverable for the Workloft Model Risk Audit build is the Note + scoping doc. Tier-2 CLI ships only on inbound regulated-buyer signal.
BRIGHT-Pro — reasoning-intensive retrieval Lab Bench №01 · cited in civiclaw DSAR work	2026-05-08	✓ SCOPED	Mapped to civiclaw DSAR retrieval substrate. Replication in week 2: Norwegian-language pack first, UK GDPR pack second.
Direct Corpus Interaction — agentic search without embeddings Lab Bench №03 · Note №03 · civiclaw skills/foi/corpus.py	2026-05-09	✓ SHIPPED	Built a tool layer (list, grep, read, snippet) over a council document root for civiclaw FOI. No embedding store, no copy. 11/11 unit tests green. Every call audit-chained — the agent's path through the corpus is itself an EU AI Act Art. 12 record. source →
TrustFall — agentic CLI supply-chain audit Maggie Intel · Note №04 · own-repo audit	2026-05-09	✓ SHIPPED	Audited Workloft's exposure to the disclosed agentic-CLI project-trust path: no `.mcp.json` or project `.claude/settings.json` in any active repo (civiclaw, conexus, workloft-site). Live risk reduces to "developer accepts trust dialog on an unvetted clone" — covered by behavioural mitigation. Note №04 publishes the regulated-buyer procurement framing alongside the audit.
Next pick — TBD on Friday's bench	—	○ QUEUED	A paper a week from the bench gets put through our hands. If we can't run it, we say so.

      ▸ Why a ledger?
      Most AI commentary stops at "I read this paper." A regulated buyer can't deploy something we haven't run. The ledger is our forcing function — at least one paper a week is reproduced ourselves and the result is published whether it works or not. Failed replications are scored as honestly as wins.
    

▸ THE 9 AXES

The rubric. Every paper through Walt is scored 0–10 on each axis. Final pick threshold: 7+ on at least one axis AND ≥ ●●○ REG FIT. Published openly because rubrics that don't survive sunlight aren't rubrics.

Agent infra

Multi-agent runtimes, orchestration, control loops, agent OS primitives. Does the paper change how an agent stack is built, not just how a single model is prompted?

substrate

Tool use · MCP

How agents discover, invoke, and validate tools — including MCP server design, schema typing, and authorisation surfaces.

substrate

RAG · memory

Retrieval, long-context, episodic and semantic memory architectures. Reasoning-intensive retrieval rather than vanilla vector search.

substrate

Audit · provenance

Cryptographic evidence of agent action — mandate signing (AP2), append-only chains, verifiable historic state for an audit committee.

governance

Governance · model risk

Alignment, refusal calibration, FCA SS1/23 model risk concerns. Anything a Risk function would point at when reading the methods section.

governance

Sovereignty · routing

Jurisdictional routing, on-prem viability, EU AI Act / UK DPA disclosures, sovereign-private model paths. Does this paper survive a "no US calls" deployment?

governance

Eval · calibration

Uncertainty estimation, confidence calibration, regulator-readable signals. A model that says "87%" and is right ≈ 87% of the time matters more than a SOTA single number.

research

Cost · efficiency

Distillation, KV-cache discipline, sub-billion-param viability, token economics. Substrate that doesn't scale to a council's budget isn't substrate.

research

Replicability

Open code, open weights, runs on a single machine, README that builds. We weight this hard — claims you cannot reproduce are claims you cannot deploy.

research

      ▸ Why publish the rubric?
      A scoring system that lives only in a prompt is rumour. Publishing forces it to be defensible — and lets buyers sanity-check our picks against their own risk shape. If you read a Workloft Research Note and disagree with the score, you can challenge the axis, not just the conclusion.
    

▸ THE LANGUAGE

What you read papers in. Hover for vibes.

attention(Q,K,V) = softmax(QKᵀ/√d)V

∇θ J(θ) = E[∇log π(a|s) · A(s,a)]

P(y|x) = softmax(Wh + b)

L_KL = Σ p log(p/q)

H(X) = -Σ p log₂ p

argmax_a Q(s,a)

f_θ(x) → ŷ

∂L/∂w = δ · x

Why Workloft Labs

Most consultancies sell whatever's trending. We don't. The reason is simple: in regulated environments — UK Local Authorities, FCA-supervised asset managers, councils handling DSAR and FOI traffic — what's hot on Hugging Face this week is almost never what your Risk function will sign off on next quarter.

Labs is where we read the academic frontier closely, replicate the substrate-level papers, and publish notes on what it actually means for governed agent infrastructure. We pay for the citation graph so you don't have to chase noise.

What lives here

Today's Lab Bench — five test tubes for the day's top arxiv picks, scored, citation-graphed, hover-readable.
Workloft Research Notes — weekly 1,000-word essays on what one paper means for substrate work. Coming soon.
Replications — papers we've reproduced ourselves, with our findings published. Coming soon.
Open Problems — questions we're working on at the substrate level: audit chains, mandate primitives, sovereign routing.

▸ WHO WE READ

The 10 researchers we follow personally. Names matter — the list itself is a statement of taste before any output. We check Semantic Scholar for new work daily.

▸ WORKLOFT RESEARCH NOTES

One paper, one regulated lens, ~1,000 words. Strong opinions, weakly held.

NOTE №01 · 2026-05-07

ARIS: the executor-reviewer pattern that the regulated AM was already going to need

An open-source research harness landed on Hugging Face this week with 82 upvotes. It pairs an executor LLM with an adversarial reviewer LLM. It also describes — almost word for word — the substrate pattern that an FCA-supervised asset manager will need before any agent goes live in fund accounting…

▸ Read note →

NOTE №02 · 2026-05-08

When no benchmark exists: the methodology your Risk function was already going to need

A Norwegian-led paper this week formalises the situation every UK Local Authority and FCA-supervised buyer is actually in: deploying LLMs in a sector or language for which no labelled safety benchmark yet exists. It hands them a defensible instrumental-validity chain — and ships the tool that runs it. Includes the Tier-1 build addendum for Workloft Model Risk Audit (WMRA)…

▸ Read note →

▸ OPEN PROBLEMS WE'RE WORKING ON

Questions at the substrate layer that don't have published answers yet. We're working on these. If you're a researcher with a take — or a buyer who's hit one of these in production — drop us a line.

№01 · STATUS: ACTIVE

Provable agent authorisation under FCA SS1/23

How do we cryptographically evidence which agent took which action, with which mandate, in a way that satisfies SS1/23 model risk requirements at the level of an asset manager's Risk function and an audit committee?

SUBSTRATE · AP2 · AUDIT CHAIN

№02 · STATUS: ACTIVE

Adversarial executor-reviewer convergence policies

When does the argument between an executor LLM and a reviewer LLM end? Who decides — and how do we evidence that decision policy itself to a regulator? ARIS exposes the orchestrator as configurable; we're working out the policy shape that survives an FCA conversation.

AGENT INFRA · GOVERNANCE

№03 · STATUS: SCOPING

Multi-tenant audit chains without cross-tenant leakage

A single append-only audit log serving N regulated entities, where no client's metadata, timing patterns, or call shapes can be inferred from another's. Hash trees alone aren't enough — the side-channels matter.

AUDIT · MULTI-TENANT · DPA

№04 · STATUS: ACTIVE · CIVICLAW

Calibrated redaction confidence under UK GDPR

When a redaction model says "I'm 87% confident this string is a name," what does the 87% mean to a Local Authority's DPO? We need calibration that transforms model uncertainty into a regulator-readable signal — and a triage workflow that surfaces low-confidence cases for human review without drowning the team.

DSAR · CALIBRATION · UK GDPR

№05 · STATUS: SCOPING

Sovereign vs cross-border agent routing

When an agent's reasoning chain crosses jurisdictional lines (UK → US → back), what disclosure does it trigger under EU AI Act, UK DPA 2018, and emerging FCA expectations on operational resilience? Ruby (our model router) makes the routing visible — the question is what disclosures it should mint automatically.

SOVEREIGNTY · ROUTING · EU AI ACT

№06 · STATUS: PARKED

Verifiable agent provenance for AP2 mandates at scale

AP2 V0.1 ships did:web identity + eddsa-jcs signing for individual mandates. What's the rotation, revocation and historic-verification model when a regulated firm needs to answer "did this agent have authority to act on this mandate as of 14:32 UTC last Tuesday?" — six months after the action?

AP2 · KEY ROTATION · PROVENANCE

№07 · STATUS: NOTE №04

Procurement-grade agentic CLI defaults for regulated dev pipelines

All four major agentic coding CLIs default to project-trust on a single Enter keypress (TrustFall, Adversa.AI 2026). What does a default-deny posture look like for an FCA SS1/23 §3.4 supply-chain control or an ICO DPIA, and which of the four can be configured into it without breaking the productivity that justified adopting them?

SUPPLY CHAIN · MCP · SS1/23

      Working on something adjacent? alfred@workloft.ai
    

▸ THE PIPE

How a paper becomes a Workloft signal.

Ingest

Walt scans cs.AI + cs.LG + stat.ML on arxiv every morning. Cross-references HuggingFace Daily Papers for community-vetted picks.

Score

Gemini 2.5 Flash scores each paper against Workloft's research axes: agent infra, MCP, RAG, agentic RL, vision-bulk, governance.

Graph

Top picks go through Semantic Scholar — citation count + influential citations. Distinguishes academic gravity from Twitter buzz.

Synthesise

Weekly Workloft Research Note: one paper, 1,000 words, opinion held strongly, defended in public. Forcing function for depth.

Why Workloft Labs

What lives here

ARIS: the executor-reviewer pattern that the regulated AM was already going to need

When no benchmark exists: the methodology your Risk function was already going to need

Provable agent authorisation under FCA SS1/23

Adversarial executor-reviewer convergence policies

Multi-tenant audit chains without cross-tenant leakage

Calibrated redaction confidence under UK GDPR

Sovereign vs cross-border agent routing

Verifiable agent provenance for AP2 mandates at scale

Procurement-grade agentic CLI defaults for regulated dev pipelines

Ingest

Score

Graph

Synthesise