The Ledger Belongs Outside the Prompt

§1The state was always there, just in the wrong place

LEDGERAGENT proposes something that sounds modest and is actually the most important thing anyone building customer-service agents can do this year: it takes the task state out of the model's context window and writes it to a separate ledger. The agent reads from and writes to that ledger as it moves through a multi-step interaction, calling tools and checking policy along the way.

Read that as a benchmark story and you miss it. The interesting claim is not that policy adherence goes up. The interesting claim is where the state now lives. For everyone selling agents into FCA-regulated firms, UK Local Authorities and NHS bodies, that relocation is the whole game. The prompt has never been an audit record. The ledger can be.

§2Why "state in the prompt" was a compliance dead end

Until now, an LLM agent's working memory of a task, what the customer asked, what the policy says, which step it is on, what it has already done, has lived inside the rolling context window. That is a terrible place to keep anything you later have to defend to a regulator.

Context is reconstructed every turn. It gets truncated, summarised, re-ordered and occasionally dropped when the window fills. When a customer disputes a decision and an FCA-regulated firm has to show, under SS1/23 model risk expectations, exactly what the agent knew and what rule it applied at the moment of action, "it was in the context somewhere" is not an answer. You cannot diff a context window across turns. You cannot replay it deterministically. You cannot show the ICO, under its AI guidance on explainability, a clean before-and-after of the decision state.

LEDGERAGENT's separation fixes the category error. The task state becomes a first-class, addressable object that exists independently of whatever the model happened to be attending to. The producer of the action (the model) is no longer also the sole keeper of the record. That separation, producer of an action versus the durable record of it, is exactly the separation of concerns the ICO leans on in its §11 reasoning about contestable automated decisions.

§3The substrate take: a ledger is an audit log that finally tracks decisions, not tokens

Most agent observability tooling logs the wrong layer. It captures token-level traces, prompt strings and tool-call JSON. That is forensics for an engineer debugging a flaky run. It is not a record a compliance officer, an FOIA request, or an EIR disclosure can act on, because it is shaped like the runtime, not like the decision.

An externalised task ledger is different in kind. It is structured around the decision: the policy clause invoked, the state transition, the tool action authorised, the value of the relevant fields at that moment. That is the granularity a UK Local Authority needs when a resident appeals a benefits or housing-adjacent automated outcome, and the council has to produce, under the Equality Act public sector duty and UK GDPR Article 22, a coherent account of how the decision was reached.

So the right way to think about LEDGERAGENT is not "a trick that improves policy adherence by a few points". It is the beginning of a runtime where the audit artefact is generated as a by-product of the agent doing its job, rather than reconstructed afterwards from logs that were never meant to bear that weight. If the ledger is the working memory, the ledger is also the evidence. Those two things being the same object is the property regulated buyers should be insisting on.

There is a design consequence here that the paper hints at but does not press. Once state is external, you can put a guard between the agent and the ledger. The agent proposes a state transition; a separate, deterministic checker validates it against policy before it is committed. The model never gets to silently mark a refund as approved. That is the same posture we have argued for elsewhere on Workloft Labs: the thing that produces an action should not be the thing that certifies it.

§4What the paper does not solve

The ledger is only as trustworthy as the writes into it. LEDGERAGENT moves state out of the context window, but the model is still the one deciding what to write. A confident, wrong agent will write a confident, wrong entry, and now that error has a clean, durable, audit-shaped home. Externalising state does not make the state correct; it makes the state legible. Those are different properties and a regulated buyer needs both.

The paper, on the summary available, is framed around policy adherence rather than tamper-evidence. A ledger you can quietly rewrite is not an audit record. For anyone deploying this against SS1/23 or ICO expectations, the open questions are: is the ledger append-only, is it signed, who can edit it, and can you prove an entry existed at time T and was not altered. None of that is in the method as described, and all of it is the difference between a debugging aid and a defensible record.

Finally, there is no evidence yet on adversarial pressure. Customer-service flows attract people trying to talk an agent into a refund it should not give. Does the ledger constrain the agent under that pressure, or does it just record the agent being talked round? That is the experiment we would want to see before recommending this pattern for any flow where money or eligibility changes hands.

Methodology note. This Note takes LEDGERAGENT (arXiv:2606.20529) as a substrate signal rather than a benchmark result. Triggers: substrate-relevant (it relocates agent task state from the context window to a durable external object, which is an audit-layer change, not a model change); non-duplicative (most agent observability work logs tokens and tool calls, not decision state); regulated-buyer link (FCA SS1/23 model risk, ICO explainability §11, UK GDPR Article 22 contestability for customer-service automation). Forthcoming: a Workloft test of append-only, signed task ledgers under adversarial customer-service pressure, measuring whether externalised state survives contest and tamper checks.

§1The state was always there, just in the wrong place

§2Why "state in the prompt" was a compliance dead end

§3The substrate take: a ledger is an audit log that finally tracks decisions, not tokens

§4What the paper does not solve

▸ Related