One paper.
One regulated lens.
Long-form essays from Workloft Labs. Each note takes a single substrate-relevant paper, frames it through the lens of an FCA-regulated asset manager or a UK Local Authority, and says something a buyer can actually use. Published openly. Updated weekly.
We Tried To Hand Our Paper Backlog To A Robot
A new tool turns any paper into running code by changing one word in the URL. We pointed it at our backlog of thirty. It did the boring 80%, then stopped at the part that mattered, so we built the cheap gate that decides which papers are worth our time at all.
Read note №54 →Synthetic Tasks Have No Provenance, And That Is The Audit Problem
A synthesis engine generates terminal-agent training tasks via a capability taxonomy. The substrate problem: synthetic data with no provenance fails regulated audit.
Read note №53 →Shared Memory Is the Multi-Principal Problem Nobody Costed
Memory agents break in shared institutional use because access control and forgetting are scoped per principal, not per system. The substrate-level read for regulated buyers.
Read note №50 →Local Models Don't Save You Money
We did the maths on running a frontier model ourselves. The fleet costs about a dollar a day, a £10k box pays back in a decade, and big models run slow by physics. Local is a sovereignty purchase, not a cheaper one.
Read note №49 →Tune the Query to the Retriever
Most stacks pick an embedding model and feed it whatever the agent typed. A new paper shows different retrievers want different question styles, and the cheap half of that lesson is a prompt, not a training run.
Read note №48 →A 1T Open Coding Model Dropped. Our Agent Could Not Reach It.
Kimi K2.7 Code is open-weights, MIT-licensed, and roughly 8x cheaper than the model that runs our agent. Our zero-data-retention line 404'd every endpoint, and a trillion parameters is too big to self-host. Open is a permission, not a delivery.
Read note №47 →The Ledger Belongs Outside the Prompt
LEDGERAGENT moves customer-service agent task state into a separate ledger. For regulated buyers, that externalised record is the audit substrate the prompt could never give you.
Read note №46 →The Fleet Reaches Home
A small owned machine on a private mesh became our fastest sovereign inference tier, with automatic fallback to the cloud. The pattern, the honest ceiling, and the future of owned edge nodes in an agent fleet.
Read note №45 →We Rebuilt Our Site to Be Read by Machines, Not Google
Generative engine optimisation in practice: how a technical brand makes its site legible and citable to ChatGPT, Perplexity and AI Overviews, and what is theatre.
Read note №44 →Agents Don't Need to Be Evil, Just Chatty
A dealer chatbot and AI legal briefs failed the same way: outbound text with no verifier in front of send(). The boring channel beats the clever attack.
Read note №43 →Verify Only the Answers You Doubt
Selective verification and FAPO both say the same thing: attribute effort to where it changes the outcome, do not spread it evenly. We shipped it into our gate.
Read note №42 →Seven Agents Fact-Checked What One Cheap Call Just Guessed
We rebuilt our hand-rolled classifier with the native multi-agent feature. It checked the facts, refused to pad its picks, and cost two orders of magnitude more. Here is when that trade is right.
Read note №41 →Four Cheap Models Shipped This Month, Our Gateway Refused Every One
We lined up the month's new budget models to benchmark them. Our gateway refused four of five on data policy before we ran a single task. The benchmark was the routing.
Read note №40 →Claim Drift Is the Audit Problem Nobody Named
Xcientist externalises research synthesis into inspectable artifacts and names claim drift. The same gap sits under every regulated agent deployment.
Read note №39 →When an Agent Rewrites and Approves Its Own Harness, You Have Removed the Reviewer
Self-Harness lets an LLM diagnose its own failures, edit its own scaffolding, and accept the change after a regression test it set itself. A real capability gain, with the sign-off step quietly deleted.
Read note №38 →A Guardrail Refused Our Model Upgrade — and That Is the Control Working
We tried to route to the #2 frontend model on the public leaderboard. Our zero-data-retention policy returned a 404 before a human could be tempted. The refusal is the feature.
Read note №37 →Prompt-Level Distillation and the Audit Gap Nobody Costed
Prompt-level distillation moves reasoning patterns from teacher to student models. For regulated buyers, it quietly relocates the audit boundary. Here is the cost.
Read note №36 →Cache Continuity Is an Audit Problem, Not a Cost Problem
TokenPilot cuts agent inference costs by up to 87% by keeping prompt prefixes stable. The substrate take: prefix stability is also a reproducibility and audit primitive.
Read note №35 →The Harness Is the Control Surface Nobody Audits
HarnessX evolves agent runtime interfaces from execution traces. We argue the harness, not the model, is the unaudited control surface regulated buyers must govern.
Read note №34 →When Agents Stop Talking: KV-Cache Communication and the Audit Hole It Opens
KV-cache communication between heterogeneous agents beats text on cost and performance. But it removes the human-readable transcript regulators rely on. The substrate take.
Read note №32 →Who Is Worth 10× the Token Budget?
The industry admits it cannot tell which spend deserves 10× the budget. Our fleet's 30-day audit ledger suggests the question is wrong: meter task classes, not people.
Read note №31 →The Action Interface Is the Audit Surface
SpatialClaw uses a stateful Python kernel as the agent action interface, beating structured tool calls by 11.2 points. What that means for agent auditability.
Read note №30 →Agents Need Environment Contracts, Not More Sandboxes
Li et al.’s survey shows why agent reliability depends on engineered environments: state, tools, synthesis, evaluation, contracts, and audit evidence.
Read note №29 →The Missing Middle: What Apodex 1.0 Verifies
Apodex 1.0 ships a verifier sub-agent inside the team, with an evidence graph the answer cannot bypass. That is the audit layer mandate-bounded stacks have been missing, and the release is Apache 2.0.
Read note №28 →The Intent Debt: The Audit Liability Agentic Stacks Don't Count
Production agent stacks count completed work, not signed intents. AP2 already gave us the cryptographic primitive (IntentMandate vs CartMandate) to make the gap auditable. Most teams use only half of it, and the missing half is what gets you fined.
Read note №27 →Cold-Start Scores Are Lying to You: What OmniGameArena's Improvement Curves Mean for Agent Audit
OmniGameArena measures how VLM agents improve across reflection rounds, not just first-attempt scores. For regulated buyers, that's the audit observable nobody tracks.
Read note №26 →Self-Improving Agents Need a Guardian, Not a Logbook
A self-improving AI framework updates both weights and agent architecture via an LM feedback agent. For regulated buyers, the real problem is who controls the change boundary.
Read note №25 →The Four-Agent Question Every System-Design Card Gets Wrong
A popular card asks you to pick one orchestration pattern for a Planner, Researcher, Coder and Reviewer: central orchestrator, event-bus choreography, DAG, or supervisor. It reads as multiple choice with one right answer. It is not. Three of the options answer "how does work flow" (topology) and the fourth answers "who handles failure" (control), and those are orthogonal. The honest design is a small DAG with a supervisor over it, which is roughly what our own fleet runs. The quadrants are a teaching fiction, useful for a card and misleading for a build.
Read note №24 →We Scanned Our Own Agent Fleet. The Clean Result Is the Boring Part.
Perplexity open-sourced Bumblebee, a read-only supply-chain scanner. We pointed it at 18,772 components across our agent VPS and matched zero against six campaign catalogues — Shai-Hulud, AntV, node-ipc, GemStuffer and friends. The green tick is the least interesting output. The 18,772-line inventory it produced as a by-product is the artefact, because the safeguard is not the scan, it is keeping the bill of materials and re-diffing it against live threat intelligence in seconds. Read-only by design matters too: a scanner that invoked the package manager would risk tripping the very postinstall worms it hunts.
Read note №23 →Measure Before You Tune
The tuning urge gets ahead of the measurement. The two-level autoresearch framework (arXiv 2605.30003) says outer loop (do my policies even predict outcomes) must run before inner loop (re-prompt them). Tonight we wired the outer loop on Walt's HF paper scorer. Per-axis conversion against Gary outcomes is now legible. Inner loop stays parked on the GEPA-vs-MIPROv2 decision.
Read note №15 →Trajectories Write Tests
PhoneWorld's design point is not the mobile GUI part. It is the architecture: real trajectories yield both controllable environments and auto-generated verifiers. The substrate move is to let production usage write the test suite as a side effect. We lifted that pattern onto our audit log and shipped it as Ship №20 the same night.
Read note №14 →Shared Search Memory Is the Agent Cost Control
Collaborative Parallel Thinking treats repeated inference as an infrastructure fault, not a model weakness. For regulated buyers it points at the missing control surface in parallel test-time search: the shared inference state that sits between branches. Rollout budgets need memory contracts.
Read note №13 →Stop Teaching Agents the Whole Transcript
Failure-relevant distillation trims teacher trajectories to the actions that mattered to the failure, then trains the student on those. The procurement question is no longer "what did the teacher do well" but "what did the student need to see to not repeat the teacher's mistakes". Smaller, sharper, auditable.
Read note №12 →Can a 26M-parameter model call your tools?
Cactus Compute's Needle, distilled from Gemini 3.1, against five real Workloft tool schemas and 50 hand-labelled queries. 68 per cent overall tool-match, median 2.36s latency. Per-schema the pattern is clean: otto_changelog 100%, maggie_reply 80%, bob_skill 60%, gary_inbox 50%, hindsight 50%. Tool-calling capability is at least two capabilities — dispatch on clean schemas, judgement on ambiguous ones. A 26M-parameter model can do the first to a surprising degree. It cannot do the second yet.
Read note №11 →Interop is no longer the moat
A2A v1.0 just crossed 150 organisations and one year inside the Linux Foundation. Agent-to-agent interoperability is officially commodity. For sovereign-first agent stacks, that is good news and a forcing function: the differentiator moves up to verifiability and governance, which is the layer A2A explicitly does not touch. The procurement question for 2026 stops being "can it interoperate" and starts being "can the counterparty cryptographically prove who took what action and when".
Read note №10 →Your audit log is training data
Agent Context Compilation (arXiv:2605.21850), applied to a production audit log rather than a synthetic benchmark. We compiled 25 of our own multi-turn agent sessions into 102 grounded long-context QA pairs for $0.0132 of compute. The audit log every regulated buyer already has to keep is also a private evaluation set and a source of cheap supervision for the local models running their sovereign workloads. The substrate point is that the data source matters more than the algorithm. Tool open under MIT.
Read note №09 →The Boundary Is the Product
Srinivasan's paper formalises the stochastic-deterministic boundary as a four-part object — proposer, verifier, commit, reject. Most production agents shipped in the last eighteen months have all four as accidents. Naming the contract is the move that turns those accidents into auditable surface, and the lasting contribution is the SDB itself: separation of concerns becomes enforceable. Replay divergence is the failure mode nobody is logging for, and the next two years of reliability gains will come from better boundaries, not better models.
Read note №08 →Visual agents need skill packages, not longer prompts
Most visual-agent papers are read as perception papers. The more interesting move in arXiv:2605.13527 is treating procedural knowledge as an external, multimodal object — text, state cards and visual keyframes packaged as a named artefact. For FCA-regulated firms and UK Local Authorities, that is the missing middle layer between raw model capability and enterprise workflow control: a skill registry that can be signed, scoped, monitored and withdrawn.
Read note №07 →Memory Is Substrate, Not a Feature: What PersonalAI 2.0 Gets Right About Agent Recall
Most production agent teams treat memory as a vector retrieval problem. The question that ends a regulated pilot is not "is the answer good", it is "why did the agent recall that, and not this". PersonalAI 2.0 implicitly rejects the vector-only framing, memory as a knowledge graph with a planned, judge-scored traversal. The accuracy gap was never the bottleneck for regulated deployment. The explainability gap was.
Read note №06 →Pre-send verification: when an agent speaks for the firm, "the model was careful" is not a control
Outbound agents are now firm-of-record speakers. The producer model that drafts the message cannot also be the control that approves it. A four-axis guardian — two deterministic, two semantic — sitting in front of send() is the substrate pattern that survives an audit. Includes the architecture we shipped on 9 May 2026 and the buyer-side procurement question.
TrustFall and the procurement question for any regulated buyer adopting agentic coding tools
All four major agentic coding CLIs spawn unsandboxed MCP servers from a malicious repo on a single Enter keypress. Read through the regulated-buyer lens, this is a procurement question — not a developer-hygiene one. Includes the Workloft exposure audit and the buyer-side controls.
Read note №04 →Direct corpus interaction: the GDPR-shaped retrieval pattern that was hiding in plain sight
Li et al. propose agents skip embedding retrieval entirely and read raw corpora with grep, cat and find. Read through the UK GDPR lens, embedding stores look like a data-protection liability that a tool-use agent already knows how to avoid. Includes the civiclaw module we shipped with it.
Read note №03 →When no benchmark exists: the methodology your Risk function was already going to need
What to do when the buyer asks "is it safe?" and there is no published benchmark to point to. A regulated-lens essay on building defensible measurement when the literature hasn't caught up.
Read note №02 →ARIS: the executor-reviewer pattern that the regulated AM was already going to need
ARIS proposes an executor-reviewer agent split. Read in the FCA SS1/23 lens, it's not a research curiosity — it's the pattern asset managers were always going to be forced into. Why the architecture matters more than the benchmark.
Read note №01 →