We shipped a small tool today that turns our own agent audit log into structured long-context QA pairs. Inspired by Agent Context Compilation (arXiv:2605.21850), but starting from a real production audit log rather than a synthetic benchmark run. 102 pairs out of 25 trajectories, for $0.0132. Open under MIT.
What we did
Eight Workloft agents have been writing every action they take
into one append-only Postgres table called
workloft_audit_log. Each row is a turn. A
session_id ties consecutive turns together. Some
sessions are short. Some run thousands of turns long. The table
now holds 9,170 rows across 170 distinct sessions, and nobody
had been learning anything from it.
trajectory-compiler is the small tool that changes
that. It extracts trajectories by session_id,
compiles each into one to five structured QA pairs by asking a
cheap model to find questions whose answers require combining at
least two non-adjacent turns, and stores the pairs in SQLite by
default, with an optional mirror into a shared Supabase table.
On the v0 run against our own log: 25 source trajectories (8 turns to 2,426 turns long), 102 QA pairs, $0.0132 in total compute cost (around $0.00013 per pair), with a type mix of 61 per cent composition, 28 per cent factoid, 9 per cent temporal, 3 per cent reasoning.
A representative pair, distilled from a 30-turn Bob session:
Q. What is the status of the Twilio voice setup, and what specific credentials are still missing?
A. Setup incomplete. TWILIO_SID, TWILIO_TOKEN and the phone number are not provided. No KYC completion evidence. No credentials present on the VPS. Only a console password in
/home/workloft/secrets/twilio-account.txt.Supporting turns: #002, #011, #012, #013, #025.
The answer is correct, grounded in the trajectory itself, and requires combining five non-adjacent turns. That is exactly the supervision signal the ACC paper argues is missing from current long-context training pipelines.
Why it was worth doing
Three concrete reasons. First, a free internal evaluation set: every model upgrade we consider gets asked the same question. Can it still answer correctly from our own historical trajectories? Public benchmarks know nothing about how our agents work. The audit log does.
Second, instruction-tuning data for the local Qwen 2.5 7B we run on the VPS for sovereign workloads. At $0.00013 per pair, a few quid buys us tens of thousands of grounded examples that look exactly like the work the local model is asked to do.
Third, a way to grade memory recall. Our Hindsight memory layer is supposed to surface useful context across sessions. The pairs let us test that directly: ask the question, check whether retrieval brought back the supporting turns.
What's still off
The compiler does not redact trajectories before sending them to the LLM. Our log is internal and we trust the endpoint we use, but if your log carries personal data or regulated commercial information, redaction needs to happen at the source.
The pairs are not yet scored. We filter for at least two supporting turns, but we do not yet verify that those turns actually contain the answer. A verifier pass is the next iteration, and the obvious thing to write next.
The dataset is small. A hundred pairs is enough to write this article. It is not enough to retrain anything. Production use needs a few orders of magnitude more, which the cheap compute makes possible but we have not done yet.
Where to find it
Code at github.com/workloftai/trajectory-compiler under MIT. Full method note at Workloft Labs Research Note №09.
The interesting bit is not the tool. It is the framing: the audit log your buyers and regulators already expect you to keep is also the cheapest source of supervised long-context training data you can possibly find. Most production teams already have it. None of them are using it that way yet.