Workloft
▸ NOW LOADING WORKLOFT.LABS ◂

WORKLOFT LABS

Substrate before spectacle.

The research arm of Workloft. We track the AI frontier daily — 70+ papers a week filtered by Walt, citation-graphed against Semantic Scholar — and publish what to actually build for governed agent infrastructure. Less buzz. More lineage.

▸ NEW · LABS API + MCP — get a free key
71
Papers screened today
5
Above threshold
7
Workloft agents
Citations tracked
TODAY'S LAB BENCH
Five test tubes. Five papers. Top arXiv picks for today, scored against the Workloft research axes. Live from Walt's pipeline — refreshes on page load. Hover a tube. Click for the abstract.
▸ pulling today's picks from labs-api…
REG FIT = "would this clear an FCA Risk review, a UK GDPR DPIA, or a Local Authority procurement audit?"
●●● strong fit · ●●○ moderate · ●○○ low / academic-only. The column nobody else in AI research publishes — and exactly the one regulated buyers need.
▸ THE PLANE WE SELECT ON
Hugging Face buzz on the x-axis. Regulated-buyer fit on the y-axis. Today's five picks plotted. Top-right is the only quadrant a regulated AM is allowed to deploy from — and the only one where alpha and consensus disagree productively.
UNDERRATED · ALPHA low buzz, regulator-ready FRONTIER YOU CAN SHIP high buzz · high reg fit ACADEMIC NOISE low buzz, low reg fit SHINY · CAN'T SHIP high buzz, fails risk review HUGGING FACE BUZZ → REG FIT → 0 25 50 75 100 ●○○ ●●○ ●●● Skills-Coach tool-use · 3↑ RL post-training framework agent-infra · 2↑ OpenSearch-VL tool-use · 12↑ BRIGHT-Pro RAG · 12↑ · ●●● ARIS ⭐ agent-infra · 82↑ · ●●●
▸ Read it like this. Most consultancies pick from the right column (high buzz). Most academics pick from the top row (regulator-ready). Workloft Labs only signals out of the top-right quadrant — and tracks the top-left for alpha buyers haven't priced in yet.
▸ REPLICATION LEDGER
A paper we've cited but never reproduced is a paper we've trusted second-hand. The ledger tracks which Workloft Labs picks we've actually rebuilt ourselves — what worked, what didn't, what we found in the gaps the authors didn't write down. Published openly because "we read it" isn't the same as "we ran it".
Paper Started Status Notes
ARIS — adversarial executor + reviewer harness
Lab Bench №01 · Note №01
2026-05-07 ▸ IN PROGRESS Reproducing the executor + reviewer pair on a Claude Opus 4.7 / GPT-5.5 split. Targeting an FCA-shape conversation about who decides when the argument ends.
When No Benchmark Exists — instrumental-validity chain for LLM safety
Lab Bench №02 · Note №02 · WMRA Tier-1
2026-05-08 ✓ SCOPED Note №02 published. Tier-1 deliverable for the Workloft Model Risk Audit build is the Note + scoping doc. Tier-2 CLI ships only on inbound regulated-buyer signal.
BRIGHT-Pro — reasoning-intensive retrieval
Lab Bench №01 · cited in civiclaw DSAR work
2026-05-08 ✓ SCOPED Mapped to civiclaw DSAR retrieval substrate. Replication in week 2: Norwegian-language pack first, UK GDPR pack second.
Direct Corpus Interaction — agentic search without embeddings
Lab Bench №03 · Note №03 · civiclaw skills/foi/corpus.py
2026-05-09 ✓ SHIPPED Built a tool layer (list, grep, read, snippet) over a council document root for civiclaw FOI. No embedding store, no copy. 11/11 unit tests green. Every call audit-chained — the agent's path through the corpus is itself an EU AI Act Art. 12 record. source →
TrustFall — agentic CLI supply-chain audit
Maggie Intel · Note №04 · own-repo audit
2026-05-09 ✓ SHIPPED Audited Workloft's exposure to the disclosed agentic-CLI project-trust path: no .mcp.json or project .claude/settings.json in any active repo (civiclaw, conexus, workloft-site). Live risk reduces to "developer accepts trust dialog on an unvetted clone" — covered by behavioural mitigation. Note №04 publishes the regulated-buyer procurement framing alongside the audit.
Next pick — TBD on Friday's bench ○ QUEUED A paper a week from the bench gets put through our hands. If we can't run it, we say so.
▸ Why a ledger? Most AI commentary stops at "I read this paper." A regulated buyer can't deploy something we haven't run. The ledger is our forcing function — at least one paper a week is reproduced ourselves and the result is published whether it works or not. Failed replications are scored as honestly as wins.
▸ THE 9 AXES
The rubric. Every paper through Walt is scored 0–10 on each axis. Final pick threshold: 7+ on at least one axis AND ≥ ●●○ REG FIT. Published openly because rubrics that don't survive sunlight aren't rubrics.
A1
Agent infra
Multi-agent runtimes, orchestration, control loops, agent OS primitives. Does the paper change how an agent stack is built, not just how a single model is prompted?
substrate
A2
Tool use · MCP
How agents discover, invoke, and validate tools — including MCP server design, schema typing, and authorisation surfaces.
substrate
A3
RAG · memory
Retrieval, long-context, episodic and semantic memory architectures. Reasoning-intensive retrieval rather than vanilla vector search.
substrate
A4
Audit · provenance
Cryptographic evidence of agent action — mandate signing (AP2), append-only chains, verifiable historic state for an audit committee.
governance
A5
Governance · model risk
Alignment, refusal calibration, FCA SS1/23 model risk concerns. Anything a Risk function would point at when reading the methods section.
governance
A6
Sovereignty · routing
Jurisdictional routing, on-prem viability, EU AI Act / UK DPA disclosures, sovereign-private model paths. Does this paper survive a "no US calls" deployment?
governance
A7
Eval · calibration
Uncertainty estimation, confidence calibration, regulator-readable signals. A model that says "87%" and is right ≈ 87% of the time matters more than a SOTA single number.
research
A8
Cost · efficiency
Distillation, KV-cache discipline, sub-billion-param viability, token economics. Substrate that doesn't scale to a council's budget isn't substrate.
research
A9
Replicability
Open code, open weights, runs on a single machine, README that builds. We weight this hard — claims you cannot reproduce are claims you cannot deploy.
research
▸ Why publish the rubric? A scoring system that lives only in a prompt is rumour. Publishing forces it to be defensible — and lets buyers sanity-check our picks against their own risk shape. If you read a Workloft Research Note and disagree with the score, you can challenge the axis, not just the conclusion.
▸ THE LANGUAGE
What you read papers in. Hover for vibes.
attention(Q,K,V) = softmax(QKᵀ/√d)V
∇θ J(θ) = E[∇log π(a|s) · A(s,a)]
P(y|x) = softmax(Wh + b)
LKL = Σ p log(p/q)
H(X) = -Σ p log₂ p
argmaxa Q(s,a)
fθ(x) → ŷ
∂L/∂w = δ · x

Why Workloft Labs

Most consultancies sell whatever's trending. We don't. The reason is simple: in regulated environments — UK Local Authorities, FCA-supervised asset managers, councils handling DSAR and FOI traffic — what's hot on Hugging Face this week is almost never what your Risk function will sign off on next quarter.

Labs is where we read the academic frontier closely, replicate the substrate-level papers, and publish notes on what it actually means for governed agent infrastructure. We pay for the citation graph so you don't have to chase noise.

What lives here

▸ WHO WE READ
The 10 researchers we follow personally. Names matter — the list itself is a statement of taste before any output. We check Semantic Scholar for new work daily.
▸ WORKLOFT RESEARCH NOTES
One paper, one regulated lens, ~1,000 words. Strong opinions, weakly held.
NOTE №01 · 2026-05-07

ARIS: the executor-reviewer pattern that the regulated AM was already going to need

An open-source research harness landed on Hugging Face this week with 82 upvotes. It pairs an executor LLM with an adversarial reviewer LLM. It also describes — almost word for word — the substrate pattern that an FCA-supervised asset manager will need before any agent goes live in fund accounting…

▸ Read note →
NOTE №02 · 2026-05-08

When no benchmark exists: the methodology your Risk function was already going to need

A Norwegian-led paper this week formalises the situation every UK Local Authority and FCA-supervised buyer is actually in: deploying LLMs in a sector or language for which no labelled safety benchmark yet exists. It hands them a defensible instrumental-validity chain — and ships the tool that runs it. Includes the Tier-1 build addendum for Workloft Model Risk Audit (WMRA)…

▸ Read note →
▸ OPEN PROBLEMS WE'RE WORKING ON
Questions at the substrate layer that don't have published answers yet. We're working on these. If you're a researcher with a take — or a buyer who's hit one of these in production — drop us a line.
№01 · STATUS: ACTIVE

Provable agent authorisation under FCA SS1/23

How do we cryptographically evidence which agent took which action, with which mandate, in a way that satisfies SS1/23 model risk requirements at the level of an asset manager's Risk function and an audit committee?

SUBSTRATE · AP2 · AUDIT CHAIN

№02 · STATUS: ACTIVE

Adversarial executor-reviewer convergence policies

When does the argument between an executor LLM and a reviewer LLM end? Who decides — and how do we evidence that decision policy itself to a regulator? ARIS exposes the orchestrator as configurable; we're working out the policy shape that survives an FCA conversation.

AGENT INFRA · GOVERNANCE

№03 · STATUS: SCOPING

Multi-tenant audit chains without cross-tenant leakage

A single append-only audit log serving N regulated entities, where no client's metadata, timing patterns, or call shapes can be inferred from another's. Hash trees alone aren't enough — the side-channels matter.

AUDIT · MULTI-TENANT · DPA

№04 · STATUS: ACTIVE · CIVICLAW

Calibrated redaction confidence under UK GDPR

When a redaction model says "I'm 87% confident this string is a name," what does the 87% mean to a Local Authority's DPO? We need calibration that transforms model uncertainty into a regulator-readable signal — and a triage workflow that surfaces low-confidence cases for human review without drowning the team.

DSAR · CALIBRATION · UK GDPR

№05 · STATUS: SCOPING

Sovereign vs cross-border agent routing

When an agent's reasoning chain crosses jurisdictional lines (UK → US → back), what disclosure does it trigger under EU AI Act, UK DPA 2018, and emerging FCA expectations on operational resilience? Ruby (our model router) makes the routing visible — the question is what disclosures it should mint automatically.

SOVEREIGNTY · ROUTING · EU AI ACT

№06 · STATUS: PARKED

Verifiable agent provenance for AP2 mandates at scale

AP2 V0.1 ships did:web identity + eddsa-jcs signing for individual mandates. What's the rotation, revocation and historic-verification model when a regulated firm needs to answer "did this agent have authority to act on this mandate as of 14:32 UTC last Tuesday?" — six months after the action?

AP2 · KEY ROTATION · PROVENANCE

№07 · STATUS: NOTE №04

Procurement-grade agentic CLI defaults for regulated dev pipelines

All four major agentic coding CLIs default to project-trust on a single Enter keypress (TrustFall, Adversa.AI 2026). What does a default-deny posture look like for an FCA SS1/23 §3.4 supply-chain control or an ICO DPIA, and which of the four can be configured into it without breaking the productivity that justified adopting them?

SUPPLY CHAIN · MCP · SS1/23

Working on something adjacent? alfred@workloft.ai
▸ THE PIPE
How a paper becomes a Workloft signal.
01

Ingest

Walt scans cs.AI + cs.LG + stat.ML on arxiv every morning. Cross-references HuggingFace Daily Papers for community-vetted picks.

02

Score

Gemini 2.5 Flash scores each paper against Workloft's research axes: agent infra, MCP, RAG, agentic RL, vision-bulk, governance.

03

Graph

Top picks go through Semantic Scholar — citation count + influential citations. Distinguishes academic gravity from Twitter buzz.

04

Synthesise

Weekly Workloft Research Note: one paper, 1,000 words, opinion held strongly, defended in public. Forcing function for depth.