▸ WORKLOFT RESEARCH NOTES

One paper.
One regulated lens.

~1,000 words. Strong opinions, weakly held.

Long-form essays from Workloft Labs. Each note takes a single substrate-relevant paper, frames it through the lens of an FCA-regulated asset manager or a UK Local Authority, and says something a buyer can actually use. Published openly. Updated weekly.

No. 54 · 25 Jun 2026 · PaperReproduction · AgentInfrastructure · Triage · ResearchOps

We Tried To Hand Our Paper Backlog To A Robot

A new tool turns any paper into running code by changing one word in the URL. We pointed it at our backlog of thirty. It did the boring 80%, then stopped at the part that mattered, so we built the cheap gate that decides which papers are worth our time at all.

Read note №54 →

No. 51 · 23 Jun 2026 · RegulatedAI · AISubstrate · AgentInfrastructure · DataGovernance · AIAct · ModelRisk

Synthetic Tasks Have No Provenance, And That Is The Audit Problem

A synthesis engine generates terminal-agent training tasks via a capability taxonomy. The substrate problem: synthetic data with no provenance fails regulated audit.

Read note №53 →

No. 50 · 22 Jun 2026 · RegulatedAI · AISubstrate · AgentInfrastructure · DataProtection · UKGDPR

Shared Memory Is the Multi-Principal Problem Nobody Costed

Memory agents break in shared institutional use because access control and forgetting are scoped per principal, not per system. The substrate-level read for regulated buyers.

Read note №50 →

No. 49 · 21 Jun 2026 · Economics · LocalLLM · Sovereignty · Cost · Substrate

Local Models Don't Save You Money

We did the maths on running a frontier model ourselves. The fleet costs about a dollar a day, a £10k box pays back in a decade, and big models run slow by physics. Local is a sovereignty purchase, not a cheaper one.

Read note №49 →

No. 48 · 21 Jun 2026 · Retrieval · RAG · QueryFormulation · Memory · Substrate

Tune the Query to the Retriever

Most stacks pick an embedding model and feed it whatever the agent typed. A new paper shows different retrievers want different question styles, and the cheap half of that lesson is a prompt, not a training run.

Read note №48 →

No. 47 · 21 Jun 2026 · OpenWeights · ModelRouting · Sovereignty · DataPolicy · Substrate

A 1T Open Coding Model Dropped. Our Agent Could Not Reach It.

Kimi K2.7 Code is open-weights, MIT-licensed, and roughly 8x cheaper than the model that runs our agent. Our zero-data-retention line 404'd every endpoint, and a trillion parameters is too big to self-host. Open is a permission, not a delivery.

Read note №47 →

No. 46 · 20 Jun 2026 · RegulatedAI · AISubstrate · AgentInfrastructure · AIGovernance · FCA

The Ledger Belongs Outside the Prompt

LEDGERAGENT moves customer-service agent task state into a separate ledger. For regulated buyers, that externalised record is the audit substrate the prompt could never give you.

Read note №46 →

No. 45 · 19 Jun 2026 · Sovereignty · Infrastructure · Substrate

The Fleet Reaches Home

A small owned machine on a private mesh became our fastest sovereign inference tier, with automatic fallback to the cloud. The pattern, the honest ceiling, and the future of owned edge nodes in an agent fleet.

Read note №45 →

No. 44 · 19 Jun 2026 · GEO · Distribution · Substrate

We Rebuilt Our Site to Be Read by Machines, Not Google

Generative engine optimisation in practice: how a technical brand makes its site legible and citable to ChatGPT, Perplexity and AI Overviews, and what is theatre.

Read note №44 →

No. 43 · 19 Jun 2026 · Agent Infrastructure · Verification · Substrate

Agents Don't Need to Be Evil, Just Chatty

A dealer chatbot and AI legal briefs failed the same way: outbound text with no verifier in front of send(). The boring channel beats the clever attack.

Read note №43 →

No. 42 · 19 Jun 2026 · Agent Infrastructure · Verification · Substrate

Verify Only the Answers You Doubt

Selective verification and FAPO both say the same thing: attribute effort to where it changes the outcome, do not spread it evenly. We shipped it into our gate.

Read note №42 →

No. 41 · 18 Jun 2026 · MultiAgent · DynamicWorkflows · CostVsRigour · AISubstrate · BuildingWithLLMs

Seven Agents Fact-Checked What One Cheap Call Just Guessed

We rebuilt our hand-rolled classifier with the native multi-agent feature. It checked the facts, refused to pad its picks, and cost two orders of magnitude more. Here is when that trade is right.

Read note №41 →

No. 40 · 18 Jun 2026 · ModelRouting · DataPolicy · OpenWeights · AISubstrate · BuildingWithLLMs

Four Cheap Models Shipped This Month, Our Gateway Refused Every One

We lined up the month's new budget models to benchmark them. Our gateway refused four of five on data policy before we ran a single task. The benchmark was the routing.

Read note №40 →

No. 39 · 18 Jun 2026 · RegulatedAI · AISubstrate · AgentInfrastructure · ModelRisk · AIAct

Claim Drift Is the Audit Problem Nobody Named

Xcientist externalises research synthesis into inspectable artifacts and names claim drift. The same gap sits under every regulated agent deployment.

Read note №39 →

No. 38 · 18 Jun 2026 · RegulatedAI · AISubstrate · AgentInfrastructure · ChangeManagement · ModelRisk

When an Agent Rewrites and Approves Its Own Harness, You Have Removed the Reviewer

Self-Harness lets an LLM diagnose its own failures, edit its own scaffolding, and accept the change after a regression test it set itself. A real capability gain, with the sign-off step quietly deleted.

Read note №38 →

No. 37 · 17 Jun 2026 · RegulatedAI · AISubstrate · AgentInfrastructure · DataResidency · UKGDPR

A Guardrail Refused Our Model Upgrade — and That Is the Control Working

We tried to route to the #2 frontend model on the public leaderboard. Our zero-data-retention policy returned a 404 before a human could be tempted. The refusal is the feature.

Read note №37 →

No. 36 · 17 Jun 2026 · RegulatedAI · AISubstrate · AgentInfrastructure · ModelRisk · AIGovernance

Prompt-Level Distillation and the Audit Gap Nobody Costed

Prompt-level distillation moves reasoning patterns from teacher to student models. For regulated buyers, it quietly relocates the audit boundary. Here is the cost.

Read note №36 →

No. 35 · 16 Jun 2026 · RegulatedAI · AISubstrate · AgentInfrastructure · AIGovernance · UKGDPR

Cache Continuity Is an Audit Problem, Not a Cost Problem

TokenPilot cuts agent inference costs by up to 87% by keeping prompt prefixes stable. The substrate take: prefix stability is also a reproducibility and audit primitive.

Read note №35 →

No. 34 · 15 Jun 2026 · RegulatedAI · AISubstrate · AgentInfrastructure · ModelRisk · AIGovernance

The Harness Is the Control Surface Nobody Audits

HarnessX evolves agent runtime interfaces from execution traces. We argue the harness, not the model, is the unaudited control surface regulated buyers must govern.

Read note №34 →

No. 32 · 13 Jun 2026 · RegulatedAI · AISubstrate · AgentInfrastructure · ExplainableAI · ModelRisk

When Agents Stop Talking: KV-Cache Communication and the Audit Hole It Opens

KV-cache communication between heterogeneous agents beats text on cost and performance. But it removes the human-readable transcript regulators rely on. The substrate take.

Read note №32 →

No. 31 · 12 Jun 2026 · RegulatedAI · AISubstrate · AgentInfrastructure · AIAudit · SpendGovernance

Who Is Worth 10× the Token Budget?

The industry admits it cannot tell which spend deserves 10× the budget. Our fleet's 30-day audit ledger suggests the question is wrong: meter task classes, not people.

Read note №31 →

No. 30 · 12 Jun 2026 · RegulatedAI · AISubstrate · AgentInfrastructure · ModelRisk · AIGovernance

The Action Interface Is the Audit Surface

SpatialClaw uses a stateful Python kernel as the agent action interface, beating structured tool calls by 11.2 points. What that means for agent auditability.

Read note №30 →

No. 29 · 11 Jun 2026 · RegulatedAI · AISubstrate · AgentInfrastructure · ModelRisk · AIGovernance

Agents Need Environment Contracts, Not More Sandboxes

Li et al.’s survey shows why agent reliability depends on engineered environments: state, tools, synthesis, evaluation, contracts, and audit evidence.

Read note №29 →

No. 28 · 11 Jun 2026 · RegulatedAI · AISubstrate · DeepResearch · AIVerification · AIAudit · OpenWeights

The Missing Middle: What Apodex 1.0 Verifies

Apodex 1.0 ships a verifier sub-agent inside the team, with an evidence graph the answer cannot bypass. That is the audit layer mandate-bounded stacks have been missing, and the release is Apache 2.0.

Read note №28 →

No. 27 · 10 Jun 2026 · RegulatedAI · AISubstrate · AP2 · DSAR · AIGovernance · AIAudit

The Intent Debt: The Audit Liability Agentic Stacks Don't Count

Production agent stacks count completed work, not signed intents. AP2 already gave us the cryptographic primitive (IntentMandate vs CartMandate) to make the gap auditable. Most teams use only half of it, and the missing half is what gets you fined.

Read note №27 →

No. 26 · 9 Jun 2026 · RegulatedAI · AISubstrate · AgentInfrastructure · AIGovernance · AIAudit

Cold-Start Scores Are Lying to You: What OmniGameArena's Improvement Curves Mean for Agent Audit

OmniGameArena measures how VLM agents improve across reflection rounds, not just first-attempt scores. For regulated buyers, that's the audit observable nobody tracks.

Read note №26 →

No. 25 · 8 Jun 2026 · RegulatedAI · AISubstrate · AgentInfrastructure · ModelGovernance · ICO

Self-Improving Agents Need a Guardian, Not a Logbook

A self-improving AI framework updates both weights and agent architecture via an LM feedback agent. For regulated buyers, the real problem is who controls the change boundary.

Read note №25 →

No. 24 · 6 Jun 2026 · Agent Architecture · Orchestration · Topology vs Control

The Four-Agent Question Every System-Design Card Gets Wrong

A popular card asks you to pick one orchestration pattern for a Planner, Researcher, Coder and Reviewer: central orchestrator, event-bus choreography, DAG, or supervisor. It reads as multiple choice with one right answer. It is not. Three of the options answer "how does work flow" (topology) and the fourth answers "who handles failure" (control), and those are orthogonal. The honest design is a small DAG with a supervisor over it, which is roughly what our own fleet runs. The quadrants are a teaching fiction, useful for a card and misleading for a build.

Read note №24 →

No. 23 · 6 Jun 2026 · Supply-Chain · Read-Only Scanning · UK GDPR Art.32 · NCSC

We Scanned Our Own Agent Fleet. The Clean Result Is the Boring Part.

Perplexity open-sourced Bumblebee, a read-only supply-chain scanner. We pointed it at 18,772 components across our agent VPS and matched zero against six campaign catalogues — Shai-Hulud, AntV, node-ipc, GemStuffer and friends. The green tick is the least interesting output. The 18,772-line inventory it produced as a by-product is the artefact, because the safeguard is not the scan, it is keeping the bill of materials and re-diffing it against live threat intelligence in seconds. Read-only by design matters too: a scanner that invoked the package manager would risk tripping the very postinstall worms it hunts.

Read note №23 →

No. 15 · 29 May 2026 · Two-Level Autoresearch · Measurement-Before-Tuning · FCA SS1/23 · GDS AI Playbook

Measure Before You Tune

The tuning urge gets ahead of the measurement. The two-level autoresearch framework (arXiv 2605.30003) says outer loop (do my policies even predict outcomes) must run before inner loop (re-prompt them). Tonight we wired the outer loop on Walt's HF paper scorer. Per-axis conversion against Gary outcomes is now legible. Inner loop stays parked on the GEPA-vs-MIPROv2 decision.

Read note №15 →

No. 14 · 29 May 2026 · Agent Infrastructure · Auto-Generated Verifiers · ICO AI Guidance · FCA SS1/23

Trajectories Write Tests

PhoneWorld's design point is not the mobile GUI part. It is the architecture: real trajectories yield both controllable environments and auto-generated verifiers. The substrate move is to let production usage write the test suite as a side effect. We lifted that pattern onto our audit log and shipped it as Ship №20 the same night.

Read note №14 →

No. 13 · 27 May 2026 · Test-Time Scaling · FCA SS1/23 · ICO AI Guidance · UK GDPR Art.5

Shared Search Memory Is the Agent Cost Control

Collaborative Parallel Thinking treats repeated inference as an infrastructure fault, not a model weakness. For regulated buyers it points at the missing control surface in parallel test-time search: the shared inference state that sits between branches. Rollout budgets need memory contracts.

Read note №13 →

No. 12 · 25 May 2026 · Distillation · Failure-Driven Training · Agent Evaluation

Stop Teaching Agents the Whole Transcript

Failure-relevant distillation trims teacher trajectories to the actions that mattered to the failure, then trains the student on those. The procurement question is no longer "what did the teacher do well" but "what did the student need to see to not repeat the teacher's mistakes". Smaller, sharper, auditable.

Read note №12 →

No. 11 · 23 May 2026 · Small Models · Tool Calling · On-Device Inference

Can a 26M-parameter model call your tools?

Cactus Compute's Needle, distilled from Gemini 3.1, against five real Workloft tool schemas and 50 hand-labelled queries. 68 per cent overall tool-match, median 2.36s latency. Per-schema the pattern is clean: otto_changelog 100%, maggie_reply 80%, bob_skill 60%, gary_inbox 50%, hindsight 50%. Tool-calling capability is at least two capabilities — dispatch on clean schemas, judgement on ambiguous ones. A 26M-parameter model can do the first to a surprising degree. It cannot do the second yet.

Read note №11 →

No. 10 · 22 May 2026 · A2A v1.0 · Agent Infrastructure · FCA SS1/23 · EU AI Act Art.13

Interop is no longer the moat

A2A v1.0 just crossed 150 organisations and one year inside the Linux Foundation. Agent-to-agent interoperability is officially commodity. For sovereign-first agent stacks, that is good news and a forcing function: the differentiator moves up to verifiability and governance, which is the layer A2A explicitly does not touch. The procurement question for 2026 stops being "can it interoperate" and starts being "can the counterparty cryptographically prove who took what action and when".

Read note №10 →

No. 09 · 22 May 2026 · Long-Context · Agent Infrastructure · FCA SS1/23 · UK GDPR Art.5(2)

Your audit log is training data

Agent Context Compilation (arXiv:2605.21850), applied to a production audit log rather than a synthetic benchmark. We compiled 25 of our own multi-turn agent sessions into 102 grounded long-context QA pairs for $0.0132 of compute. The audit log every regulated buyer already has to keep is also a private evaluation set and a source of cheap supervision for the local models running their sovereign workloads. The substrate point is that the data source matters more than the algorithm. Tool open under MIT.

Read note №09 →

No. 08 · 20 May 2026 · Regulated AI · Agent Infrastructure · FCA SS1/23 · ICO §11

The Boundary Is the Product

Srinivasan's paper formalises the stochastic-deterministic boundary as a four-part object — proposer, verifier, commit, reject. Most production agents shipped in the last eighteen months have all four as accidents. Naming the contract is the move that turns those accidents into auditable surface, and the lasting contribution is the SDB itself: separation of concerns becomes enforceable. Replay divergence is the failure mode nobody is logging for, and the next two years of reliability gains will come from better boundaries, not better models.

Read note №08 →

No. 07 · 18 May 2026 · Regulated AI · Visual Agents · Skill Packages · Substrate

Visual agents need skill packages, not longer prompts

Most visual-agent papers are read as perception papers. The more interesting move in arXiv:2605.13527 is treating procedural knowledge as an external, multimodal object — text, state cards and visual keyframes packaged as a named artefact. For FCA-regulated firms and UK Local Authorities, that is the missing middle layer between raw model capability and enterprise workflow control: a skill registry that can be signed, scoped, monitored and withdrawn.

Read note №07 →

No. 06 · 14 May 2026 · Regulated AI · Graph-RAG · Audit · ICO

Memory Is Substrate, Not a Feature: What PersonalAI 2.0 Gets Right About Agent Recall

Most production agent teams treat memory as a vector retrieval problem. The question that ends a regulated pilot is not "is the answer good", it is "why did the agent recall that, and not this". PersonalAI 2.0 implicitly rejects the vector-only framing, memory as a knowledge graph with a planned, judge-scored traversal. The accuracy gap was never the bottleneck for regulated deployment. The explainability gap was.

Read note №06 →

No. 05 · 9 May 2026 · FCA SS1/23 · ICO §11 · NCSC Secure-AI

Pre-send verification: when an agent speaks for the firm, "the model was careful" is not a control

Outbound agents are now firm-of-record speakers. The producer model that drafts the message cannot also be the control that approves it. A four-axis guardian — two deterministic, two semantic — sitting in front of send() is the substrate pattern that survives an audit. Includes the architecture we shipped on 9 May 2026 and the buyer-side procurement question.

Read note №05 →

No. 04 · 9 May 2026 · Procurement · FCA SS1/23 · ICO DPIA

TrustFall and the procurement question for any regulated buyer adopting agentic coding tools

All four major agentic coding CLIs spawn unsandboxed MCP servers from a malicious repo on a single Enter keypress. Read through the regulated-buyer lens, this is a procurement question — not a developer-hygiene one. Includes the Workloft exposure audit and the buyer-side controls.

Read note №04 →

No. 03 · 9 May 2026 · UK GDPR · UK Local Authorities · civiclaw

Direct corpus interaction: the GDPR-shaped retrieval pattern that was hiding in plain sight

Li et al. propose agents skip embedding retrieval entirely and read raw corpora with grep, cat and find. Read through the UK GDPR lens, embedding stores look like a data-protection liability that a tool-use agent already knows how to avoid. Includes the civiclaw module we shipped with it.

Read note №03 →

No. 02 · 8 May 2026 · UK GDPR · Risk function

When no benchmark exists: the methodology your Risk function was already going to need

What to do when the buyer asks "is it safe?" and there is no published benchmark to point to. A regulated-lens essay on building defensible measurement when the literature hasn't caught up.

Read note №02 →

No. 01 · 7 May 2026 · FCA SS1/23 · Asset managers

ARIS: the executor-reviewer pattern that the regulated AM was already going to need

ARIS proposes an executor-reviewer agent split. Read in the FCA SS1/23 lens, it's not a research curiosity — it's the pattern asset managers were always going to be forced into. Why the architecture matters more than the benchmark.

Read note №01 →

One paper.One regulated lens.