Synthetic Tasks Have No Provenance, And That Is The Audit Problem

§1The angle: where did this task come from, and who can prove it

The paper describes a synthesis engine that mints high-quality terminal-agent tasks. It builds a multi-dimensional capability taxonomy, runs evidence-guided research to ground each task, then distils the output into a training set that produces measurable gains when you train an LLM on it. As an engineering artefact it is genuinely good work. The taxonomy is the interesting bit: rather than scraping whatever terminal transcripts happen to exist, the authors generate tasks deliberately across a capability grid, which fixes the coverage problem that plagues harvested datasets.

But here is the thing nobody training agents wants to hear. A synthetic task has no birth certificate. When a human writes a terminal task, or you scrape one from a real session, there is a chain you can walk back: who wrote it, what system it ran against, what the expected output was and why. A task minted by a synthesis engine has none of that unless you build the provenance machinery deliberately, and this paper, like almost every synthetic-data paper before it, treats provenance as out of scope. For a regulated-AI buyer that is not a footnote. That is the whole game.

§2The argument: training-data provenance is becoming a runtime audit requirement

Workloft sits at the substrate layer, the bit underneath the agent that has to survive an FCA review or an ICO information request. From that seat, the question a synthesis engine raises is not "does the dataset improve benchmark scores" (it does, that is the paper's point). The question is: when your terminal agent does something expensive or wrong in production, can you trace the behaviour back to the training task that taught it?

With harvested data you can, painfully, attempt this. With a synthesis engine the lineage runs synthesis-engine to taxonomy-cell to generated-task to model-weights, and at no point in that chain does the paper carry forward a stable identifier you could cite in an audit. The taxonomy is the one place this could have been solved cheaply. Every task is born into a capability cell, so every task could carry a cell-id, a generation seed, the evidence documents that grounded it, and a hash. That would give you a queryable map from "the agent tried to rm -rf a mounted volume" back to "these 40 synthesised tasks in the destructive-filesystem-operations cell, generated on this date, grounded on these sources".

None of that is exotic. It is the same separation-of-concerns discipline that the ICO's AI guidance §11 expects around data lineage, and that FCA SS1/23 §3.5 expects around model risk traceability. The synthesis engine does not produce it because the engine is built to maximise training gain, not auditability. That is the recurring substrate failure: the producer of the artefact is never the party who has to defend it.

Consider what "evidence-guided research" means for liability. The engine pulls grounding evidence to make each task realistic. If that evidence includes copyrighted documentation, internal-looking configs, or anything resembling personal data, it is now baked into your training distribution with no record of where it came from. A regulated buyer who trains on this set inherits a provenance gap they did not create and cannot close after the fact. You cannot un-train a task whose origin you never recorded.

§3What this means for anyone building on synthetic agent data

The practical move is to refuse synthetic training data that arrives without a manifest. If a synthesis engine cannot emit, per task, the taxonomy cell, the generation parameters, the grounding sources, and a content hash, then the dataset is unauditable by construction and you should treat it as such. This is not a counsel of perfection. It is the minimum that lets you answer a regulator's "why did the model do that" with something other than a shrug.

For terminal agents specifically the stakes are higher than for chat. A terminal agent runs commands against real systems. A badly-grounded synthesised task that normalises a dangerous command pattern does not produce a bad sentence, it produces a model more willing to execute destructive operations. The capability taxonomy makes this worse before it makes it better: by deliberately covering the destructive cells of the grid for completeness, you are training the model on more dangerous-command examples than a harvested set would contain, and you want to know exactly which examples those were.

The good news is that the taxonomy structure makes the fix tractable. A capability grid is already a provenance schema waiting to be used. Workloft's position is that any synthesis engine destined for regulated training should emit a per-task lineage record keyed on the cell-id, and that this record should travel with the weights into the runtime, so that a runtime guard can reason about which capabilities a model was trained on and flag when it acts outside them.

§4What the paper does not solve

To be fair to the authors, provenance is genuinely not what they set out to do, and the paper is honest about being a data-quality and training-gain result. It does not claim audit fitness. The gains are real and the taxonomy is a contribution. But three things are missing for a regulated buyer. First, no per-task lineage: the distilled dataset is delivered as training fuel, not as an auditable record. Second, no handling of grounding-evidence provenance, so the copyright and personal-data exposure baked in by "evidence-guided research" is undocumented. Third, no carry-through to runtime: the capability taxonomy that structures training never becomes a structure the running agent can be held to. Until a synthesis engine treats the manifest as a first-class output rather than a discarded by-product, regulated buyers should assume the data is unauditable and price that in.

Methodology note. This Note takes the terminal-agent task synthesis engine paper (arXiv:2606.22883) as a strong engineering result with a provenance-shaped hole. Triggers: substrate-relevant (training-data lineage is becoming a runtime audit requirement, not a research nicety); non-duplicative (the synthetic-data discourse fixates on quality and gain, almost never on per-task auditability); regulated-buyer link (FCA SS1/23 model-risk traceability, ICO AI guidance §11 data lineage, EU AI Act Art.10 data governance). The Workloft-side angle is that a capability taxonomy is already a provenance schema, so the fix is cheap if treated as first-class. Forthcoming: a Workloft reference manifest format for synthetic agent tasks that carries cell-id lineage into runtime guards.

Synthetic Tasks Have No Provenance, And That Is The Audit Problem

§1The angle: where did this task come from, and who can prove it

§2The argument: training-data provenance is becoming a runtime audit requirement

§3What this means for anyone building on synthetic agent data

§4What the paper does not solve

▸ Related