The research arm of Workloft. We track the AI frontier daily — 70+ papers a week filtered by Walt, citation-graphed against Semantic Scholar — and publish what to actually build for governed agent infrastructure. Less buzz. More lineage.
| Paper | Started | Status | Notes |
|---|---|---|---|
|
ARIS — adversarial executor + reviewer harness
Lab Bench №01 · Note №01
|
2026-05-07 | ▸ IN PROGRESS | Reproducing the executor + reviewer pair on a Claude Opus 4.7 / GPT-5.5 split. Targeting an FCA-shape conversation about who decides when the argument ends. |
|
When No Benchmark Exists — instrumental-validity chain for LLM safety
Lab Bench №02 · Note №02 · WMRA Tier-1
|
2026-05-08 | ✓ SCOPED | Note №02 published. Tier-1 deliverable for the Workloft Model Risk Audit build is the Note + scoping doc. Tier-2 CLI ships only on inbound regulated-buyer signal. |
|
BRIGHT-Pro — reasoning-intensive retrieval
Lab Bench №01 · cited in civiclaw DSAR work
|
2026-05-08 | ✓ SCOPED | Mapped to civiclaw DSAR retrieval substrate. Replication in week 2: Norwegian-language pack first, UK GDPR pack second. |
|
Direct Corpus Interaction — agentic search without embeddings
Lab Bench №03 · Note №03 · civiclaw skills/foi/corpus.py
|
2026-05-09 | ✓ SHIPPED | Built a tool layer (list, grep, read, snippet) over a council document root for civiclaw FOI. No embedding store, no copy. 11/11 unit tests green. Every call audit-chained — the agent's path through the corpus is itself an EU AI Act Art. 12 record. source → |
|
TrustFall — agentic CLI supply-chain audit
Maggie Intel · Note №04 · own-repo audit
|
2026-05-09 | ✓ SHIPPED | Audited Workloft's exposure to the disclosed agentic-CLI project-trust path: no .mcp.json or project .claude/settings.json in any active repo (civiclaw, conexus, workloft-site). Live risk reduces to "developer accepts trust dialog on an unvetted clone" — covered by behavioural mitigation. Note №04 publishes the regulated-buyer procurement framing alongside the audit. |
| Next pick — TBD on Friday's bench | — | ○ QUEUED | A paper a week from the bench gets put through our hands. If we can't run it, we say so. |
Most consultancies sell whatever's trending. We don't. The reason is simple: in regulated environments — UK Local Authorities, FCA-supervised asset managers, councils handling DSAR and FOI traffic — what's hot on Hugging Face this week is almost never what your Risk function will sign off on next quarter.
Labs is where we read the academic frontier closely, replicate the substrate-level papers, and publish notes on what it actually means for governed agent infrastructure. We pay for the citation graph so you don't have to chase noise.
An open-source research harness landed on Hugging Face this week with 82 upvotes. It pairs an executor LLM with an adversarial reviewer LLM. It also describes — almost word for word — the substrate pattern that an FCA-supervised asset manager will need before any agent goes live in fund accounting…
A Norwegian-led paper this week formalises the situation every UK Local Authority and FCA-supervised buyer is actually in: deploying LLMs in a sector or language for which no labelled safety benchmark yet exists. It hands them a defensible instrumental-validity chain — and ships the tool that runs it. Includes the Tier-1 build addendum for Workloft Model Risk Audit (WMRA)…
How do we cryptographically evidence which agent took which action, with which mandate, in a way that satisfies SS1/23 model risk requirements at the level of an asset manager's Risk function and an audit committee?
SUBSTRATE · AP2 · AUDIT CHAIN
When does the argument between an executor LLM and a reviewer LLM end? Who decides — and how do we evidence that decision policy itself to a regulator? ARIS exposes the orchestrator as configurable; we're working out the policy shape that survives an FCA conversation.
AGENT INFRA · GOVERNANCE
A single append-only audit log serving N regulated entities, where no client's metadata, timing patterns, or call shapes can be inferred from another's. Hash trees alone aren't enough — the side-channels matter.
AUDIT · MULTI-TENANT · DPA
When a redaction model says "I'm 87% confident this string is a name," what does the 87% mean to a Local Authority's DPO? We need calibration that transforms model uncertainty into a regulator-readable signal — and a triage workflow that surfaces low-confidence cases for human review without drowning the team.
DSAR · CALIBRATION · UK GDPR
When an agent's reasoning chain crosses jurisdictional lines (UK → US → back), what disclosure does it trigger under EU AI Act, UK DPA 2018, and emerging FCA expectations on operational resilience? Ruby (our model router) makes the routing visible — the question is what disclosures it should mint automatically.
SOVEREIGNTY · ROUTING · EU AI ACT
AP2 V0.1 ships did:web identity + eddsa-jcs signing for individual mandates. What's the rotation, revocation and historic-verification model when a regulated firm needs to answer "did this agent have authority to act on this mandate as of 14:32 UTC last Tuesday?" — six months after the action?
AP2 · KEY ROTATION · PROVENANCE
All four major agentic coding CLIs default to project-trust on a single Enter keypress (TrustFall, Adversa.AI 2026). What does a default-deny posture look like for an FCA SS1/23 §3.4 supply-chain control or an ICO DPIA, and which of the four can be configured into it without breaking the productivity that justified adopting them?
SUPPLY CHAIN · MCP · SS1/23
Walt scans cs.AI + cs.LG + stat.ML on arxiv every morning. Cross-references HuggingFace Daily Papers for community-vetted picks.
Gemini 2.5 Flash scores each paper against Workloft's research axes: agent infra, MCP, RAG, agentic RL, vision-bulk, governance.
Top picks go through Semantic Scholar — citation count + influential citations. Distinguishes academic gravity from Twitter buzz.
Weekly Workloft Research Note: one paper, 1,000 words, opinion held strongly, defended in public. Forcing function for depth.