Claim Drift Is the Audit Problem Nobody Named

§1The angle

Most papers about AI scientists are spectacle. They show a model that wrote a paper, ran an experiment, and produced a result, and they ask you to be impressed by the artifact at the end. Wang et al. do the opposite, and that is why this one matters at the substrate layer. Their system, Xcientist, is barely about the science. It is about the bookkeeping of the science: the literature evidence, the idea states, the implementation plans, the ablation records, the repair traces. They make all of it persistent, inspectable, and governed by contracts.

The interesting move is that they name a failure mode the rest of the field has been quietly tolerating. They call it claim drift: the point at which the runnable artifact no longer supports the mechanism that was originally claimed. The model said it was doing X. The code that survives does Y. The result looks fine. Nobody can tell the difference from the output alone.

If you build agent infrastructure for regulated buyers, you have already met claim drift. You just did not have a word for it.

§2What the harness actually does

Xcientist treats research as a chain of artifacts rather than a single forward pass. A mechanism gets proposed, grounded against evidence, turned into an implementation plan, executed, tested through ablations, and then revised within bounds. Each of those steps leaves a record, and the records are linked. The reasoning that connects prior evidence to generated idea to experiment to final claim is pulled out of the model's hidden inference and written down where you can look at it.

Wang et al. demonstrate this across three domains: training-free memory systems, graph-structured traffic forecasting, and multi-scale physics-informed neural networks. The claim is not that Xcientist produces better science. It is that Xcientist produces attributable science. You can trace the trajectory from problem formulation to mechanism design to validation and back, and you can see where a claim and its evidence part company.

The contract framing is the part substrate builders should sit up for. A contract here is a checkable promise about what an artifact is supposed to support. The producer of a result is not the same thing as the process that validates whether the result still means what it said. That separation, producer from guardian, is the entire reason this is worth a Note.

§3Why this is the same problem as a regulated agent in production

Swap the vocabulary. An AI scientist proposes a mechanism, runs it, and claims a result. An agent in a regulated firm makes a decision, takes an action, and produces a justification. In both cases the dangerous gap is identical: the explanation the system gives, and the thing the system actually did, can drift apart while every output keeps looking plausible.

This is precisely the territory of FCA SS1/23 on model risk management, where the regulator expects firms to evidence not just outcomes but the process that produced them, including validation independent of the producer. It is the territory of ICO guidance on AI and data protection §11, where you owe data subjects an explanation that genuinely reflects the logic applied, not a post-hoc story the model generated to sound reasonable. A justification that no longer matches the action is not an explanation. It is claim drift wearing a compliance badge.

Most agent frameworks in the wild today have no defence against this. They log the prompt, log the output, maybe log a chain of thought, and call it observability. But a chain of thought is generated text, not a contract. There is nothing that checks whether the recorded reasoning actually governed the action that followed. Wang et al. are pointing at the missing layer: persistent artifacts with checkable links between what was claimed and what was run.

For a UK Local Authority deploying an agent to triage housing applications, or an FCA-regulated firm running an agent over creditworthiness signals, the question an auditor will eventually ask is not "what did the agent decide". It is "can you show that the decision was made for the reason the record states". Without an externalised, contract-governed trail, the honest answer is no.

§4The substrate take they will not hand you

Here is the argument the abstract does not make explicitly. Evaluating AI systems by their final artifacts is the original sin of the whole field, and it scales straight into compliance failure. Benchmarks reward outputs. Regulators ask about process. Those are different axes, and the gap between them is exactly where claim drift lives.

What Xcientist gets right, and what almost no production agent runtime gets right, is that accountability has to be a property of the substrate, not a feature of the model. You cannot prompt a model into being attributable. You cannot fine-tune away claim drift, because claim drift is not a model error, it is a structural absence: nothing in the system is responsible for holding the claim and the artifact together over time.

The contract pattern Wang et al. use, separating the thing that produces from the thing that validates, is the same pattern we keep arguing for in pre-send verification. A guardian process that is structurally independent of the producer, checking a specific promise, is the only thing that survives an audit. The producer cannot mark its own homework, and a generated explanation is the producer marking its own homework.

§5What the paper does not solve

Xcientist is a research harness, not a production runtime. It runs in a setting where experiments are repeatable, latency is irrelevant, and the artifacts are scientific code. A regulated agent operates under real-time constraints, against changing data, and often with no clean ablation to fall back on. The contracts in the paper are checkable because the domain is bounded. In a housing-triage agent, the equivalent contract has to be defined against messy policy, protected characteristics, and statutory duties, and nobody has shown that those contracts are cheap to write or stable over time.

The paper also does not quantify the overhead. Externalising every idea state, ablation, and repair trace is not free, and Wang et al. do not tell us what it costs in compute or human review time to maintain the trail at production scale. "Bounded revision" is doing a lot of work in their framing, and the bounds are set by researchers who already know the domain.

And the harness still trusts the artifacts it records. It catches the case where the runnable code no longer supports the claim. It does not, on its own, catch the case where the recorded evidence was wrong from the start. Attributability is necessary for accountability. It is not sufficient. But naming claim drift, and building a layer that can see it, is more than the rest of the AI-scientist literature has managed, and that is worth borrowing.

Methodology note. This Note takes Wang et al.'s Xcientist (arXiv:2606.18874) as a substrate document dressed as an AI-scientist paper. Triggers: substrate-relevant (it externalises reasoning into inspectable, contract-governed artifacts, exactly the layer production agent runtimes lack); non-duplicative (the field evaluates AI scientists on final artifacts, almost nobody names claim drift); regulated-buyer link (FCA SS1/23 independent validation, ICO §11 genuine explanation, EU AI Act Art.13 transparency, all of which fail under claim drift). Forthcoming: a Workloft teardown of how pre-send verification implements the producer-guardian separation for housing and creditworthiness agents under real-time constraints.

§1The angle

§2What the harness actually does

§3Why this is the same problem as a regulated agent in production

§4The substrate take they will not hand you

§5What the paper does not solve

▸ Related