Workloft
▸ WORKLOFT RESEARCH NOTE №09 · 22 MAY 2026

Your audit log is training data

Agent Context Compilation (arXiv:2605.21850), applied to a production audit log rather than a synthetic benchmark. 25 agent trajectories, 102 grounded long-context QA pairs, $0.0132 of compute. Open source.

REG FIT ●●● · STRONG · APPLIES TO FCA SS1/23 §3.5, ICO AI GUIDANCE §11, UK GDPR ART.5(2), NCSC SECURE AI SYSTEM DEVELOPMENT

§1The interesting move is not the method, it is the source

This week's arXiv:2605.21850 proposes Agent Context Compilation, or ACC. The method takes multi-turn agent trajectories and distils them into structured question and answer pairs. Each pair has the property that answering it correctly requires combining information from non-adjacent turns. The pairs are then used as direct supervision for long-context reasoning, rather than asking a model to discover the structure by reading raw transcripts.

That is a clean technical contribution and worth reading on its own terms. But it leaves a more interesting question open. Where do the trajectories come from?

The paper assumes the answer is a benchmark sandbox: instrument an agent, give it tasks, collect what comes out. That is a respectable choice for an academic result. It is the wrong choice if you actually run agents in production.

If you have been operating multi-agent infrastructure for any length of time, you already have trajectories. They are sitting inside whatever append-only ledger you keep for compliance, debugging or cost accounting. You do not need to generate them. You need to compile them. The substrate point is that the data source matters more than the algorithm.

§2What sits inside a production audit log

We run eight agents at Workloft. Every action any of them takes writes a row to one Postgres table called workloft_audit_log. The row records the agent, the tool, the arguments, the response, the cost, the duration, and a session_id that ties consecutive turns together. The table is append-only, enforced at the database level. We built it for FCA-style auditability and for cost accounting. It was not built as a research asset.

At the time of writing, that table holds 9,170 rows across 170 distinct sessions. The median session is short. The mean is around six turns. The long tail is what matters: ten sessions have more than ten turns each, three sessions have more than thirty, and the longest single session in the table has 2,426 turns. The longest sessions are slow-running operational loops such as our Gmail-to-todos pipeline, our Composio tool exerciser, and our daily arXiv scoring sweep.

These are not synthetic. They are the everyday operating record of agents that draft posts, triage email, file todos, run safety scans, score papers, run our outbound cadences, and answer the phone. Each session is, in effect, a small private benchmark on a task we genuinely care about. None of them existed to be a benchmark, which is part of the point.

This pattern is not particular to us. Any team running agents seriously already has the same artefact. Regulators in our market expect it to exist. The ICO's accountability principle, the FCA's SS1/23 on AI in financial services, and the NCSC's secure AI development guidance all push toward something audit-log-shaped. Production teams either have it or shortly will. We argue they should be looking at it as research material as well as evidence material.

§3What we actually compiled

We built a small open-source tool, trajectory-compiler, that does three things end to end. It extracts trajectories from the audit log by grouping rows under a session and ordering them in time. It compiles each trajectory into one to five QA pairs by asking a cheap model, Gemini 2.5 Flash by default, to find questions whose answers require at least two non-adjacent supporting turns. It stores the resulting pairs in a local SQLite file by default, with an optional mirror into a shared Supabase table for team use.

On the v0 run, against our own audit log, we got the following:

One representative pair, distilled from a 30-turn session of our own agent Bob:

Q. What is the status of the Twilio voice setup, and what specific credentials are still missing?

A. Setup incomplete. TWILIO_SID, TWILIO_TOKEN and the phone number are not provided. No KYC completion evidence. No credentials present on the VPS. Only a console password in /home/workloft/secrets/twilio-account.txt.

Supporting turns: #002, #011, #012, #013, #025.

The answer is correct against the underlying trajectory, grounded in the trajectory text, and would have been hard to produce without traversing five non-adjacent turns. That is the supervision pattern the ACC paper argues is missing from current long-context training pipelines. The novelty here is not that the pattern exists; it is that you can synthesise it for under two pence using a model that runs on someone else's hardware.

§4Why a regulated buyer should care

For FCA-regulated firms running AI assistants over case files, claims systems or servicing portals, two compliance pressures pull in the same direction. SS1/23 expects model risk management proportionate to the use case. UK GDPR Article 5(2) expects accountability for decisions involving personal data. The natural artefact for both is an append-only operational log of what the agent did and why.

If you already keep that log for compliance, you also have a private long-context benchmark that no public dataset can match. Every model upgrade you evaluate can be tested directly against the work your agents actually do. Public benchmarks know nothing about your firm. Your audit log does.

For UK Local Authorities, the same logic applies under a different statutory frame. Accountability under UK GDPR, the ICO's explainability expectations and FOIA-sensitive recordkeeping all push toward operational logs of agent actions. A council deploying an agent against a benefits system, housing case or safeguarding referral has the same reason to keep the log, and gains the same opportunity to use it as a private evaluation set. The cost of compliance compounds into capability. That is unusual.

For the local model question, the calculation is even cleaner. We run a Qwen 2.5 7B locally on the VPS for sovereign workloads. The QA pairs distilled from our own trajectories are exactly the supervision signal that local model needs in order to behave like the larger models on the work we actually give it. At $0.00013 per pair, the budget for instruction-tuning data is no longer the limiting factor.

§5The audit log has to exist first

This method has a precondition that is often skipped in agent procurement: the buyer must already have an append-only audit log of agent activity. Without that, ACC against your own work is not possible, and you fall back to synthetic trajectory generation, which is closer to the academic regime the original paper assumes.

This matters when councils, asset managers or central government buyers evaluate agent stacks. The right order of work is audit log first, agents second, automation third, governance fourth, evaluation fifth. Vendors that ship the agent without shipping the audit log give the buyer a capability they cannot govern, cannot evaluate against their own work, and cannot improve over time. The compiler in this Note assumes you have done the first piece. If you have, the rest is small.

One useful concrete: if you are choosing between two otherwise comparable agent vendors, ask each of them to point you at the schema of the operational ledger they would leave you with. The vendor with a believable answer can be made to support this pattern. The vendor without one cannot.

§6What we are not claiming

The pairs are not yet verified. We filter on having at least two supporting turn references in the LLM's output, but we do not yet validate that those cited turns actually contain the answer. A verifier pass that scores each pair against the underlying trajectory is the obvious next iteration and the one we will write first.

The dataset is small. 102 pairs is enough to demonstrate the method and produce this Note. It is not enough to retrain anything serious. The cheap unit economics make scaling to tens of thousands of pairs trivially affordable; the work to actually do that, on the full 9,170 audit rows we have today, is the next run.

The compiler does not redact the trajectories before sending them to an external model. Our own audit log is internal and the Gemini endpoint we use is acceptable for that workload, but anyone applying this method to a log containing personal or commercially sensitive data needs to add redaction at the compilation boundary, not at the conversation layer downstream.

The technique is also not a substitute for the original ACC training regime. We are producing supervised examples; the paper trains models against them. The link between the two is exactly the kind of cross-organisational work that an academic group and an operations-heavy shop are well placed to do together. We are not the academic group.

§7What is next

Three concrete steps follow this Note. We will add a verifier pass, using a stronger model, that scores each pair against its cited supporting turns. We will add a redaction layer at the compilation boundary so the tool is safe to point at regulated data. And we will run the compiler against the full audit log on a daily schedule, so the dataset grows with our actual operating volume.

The open question we would like to hear from other production teams about is straightforward. If you keep an append-only agent audit log and you would be willing to compare notes on QA-pair quality across organisations, please get in touch. We suspect the inter-organisation diversity is interesting on its own.

The wider point is the one we keep coming back to in Workloft Labs. Substrate beats spectacle. The agent your buyers care about is the one whose record they can read, whose history they can evaluate, and whose future they can shape with the work it has already done. ACC, applied to a production audit log, is a small piece of that loop. The code is open. The data source is sitting in your database already.


Methodology note. This Note pairs arXiv:2605.21850 (Agent Context Compilation) with an internal v0 run against our own workloft_audit_log table. Compilation used Gemini 2.5 Flash via our internal model router (Ruby), with a system prompt that requires at least two non-adjacent supporting turns per pair. Output was validated against a fixed JSON schema and discarded otherwise. The full v0 dataset contains 102 pairs from 25 sessions, total compute cost $0.0132. Tool is open source under MIT at github.com/workloftai/trajectory-compiler. Forthcoming: a verifier pass that scores each pair against its cited turns, plus a redaction layer for regulated trajectories.