Cold-Start Scores Are Lying to You: What OmniGameArena's Improvement Curves Mean for Agent Audit

§1The benchmark is about games. The observable is about audit.

Lin et al. built OmniGameArena (arXiv:2606.09826) to fix a parochial problem: VLM game benchmarks report one first-attempt score per agent-game pair, only test solo play, and can't compare commercial VLMs, open-weight VLMs and specialised game policies on the same footing. Twelve new Unreal Engine 5 games, unified action interfaces, seven solo, three PvP, two coop. Fine. If you stop reading at the leaderboard, you've missed the part that matters for anyone running agents under a regulator.

The interesting contribution is the Improvement Dynamics Curve (IDC). Instead of a single number, IDC runs a tool-using reflector LLM that autonomously refines a bounded skill prompt across multiple rounds, then measures two things: how the score moves round over round, and how the refined skill behaves on held-out task variants. They report cold-start scores for twelve agents and IDC for the four best.

Strip away the games. What Lin et al. have actually instrumented is the gap between what an agent does on first contact and what it does after it has been allowed to reason about its own failures. That gap is the single most under-measured property in every regulated agent deployment we look at.

§2Why a first-attempt score is the wrong unit for compliance

The way most agent evaluations work today, and the way most procurement teams read them, is a static snapshot. You run the agent against a test set, you get a number, you put the number in the model card, and you treat that number as the system's behaviour. The FCA's SS1/23 governance expectations and the ICO's guidance on AI both assume you can describe how a system behaves and monitor it over time. A cold-start score describes a system that no longer exists the moment it starts operating.

Production agents do not sit at their first-attempt behaviour. They retry. They reflect. They get retrieval context, scratchpad memory, reflexion loops, tool feedback. A modern agent harness is a feedback loop by construction. So the number you signed off on in procurement is the number for an agent that has been deliberately switched off. IDC measures the agent you actually deploy.

This matters in two directions, and Lin et al. surface both. First: an agent that scores poorly cold-start but climbs steeply under reflection is genuinely more capable than its leaderboard rank suggests. Second, and this is the dangerous one for regulated buyers: the held-out variant test. An agent can refine its bounded skill prompt to ace the rounds it has seen, and then that learned skill can fail to transfer. That is overfitting to the reflection loop. In a game it costs you a level. In a regulated workflow it means the agent has learned to pass the cases it was tuned against and has no honest handle on the cases it hasn't.

§3The two-curve view is what an audit trail should record

Here is the substrate take. OmniGameArena gives you, per agent-game pair, three observables instead of one: the cold-start score, the improvement trajectory across reflection rounds, and the held-out transfer. We'd argue that the second and third are the ones a regulated deployment needs to log, and almost nobody logs them.

Think about what an FCA-regulated firm or a Local Authority actually has to demonstrate. Not just "the system scored X on a test". They have to show that the behaviour they validated is the behaviour in production, and that the system has not quietly drifted because its reflection loop learned something the validators never saw. The IDC framing maps onto that requirement directly. The improvement curve is a record of how the agent's behaviour changes when it is allowed to self-modify its prompt. The held-out variant gap is a measure of how much of that change is real competence versus memorised pattern.

If you run a reflection-based or self-improving agent and you cannot produce those two curves, your audit position is that you signed off on attempt one and have no documented account of everything the agent did after. That is the position most teams are in right now. The bounded skill prompt that Lin et al.'s reflector rewrites round over round is, in production terms, an unversioned, self-editing configuration artefact. NCSC guidance on AI security and the ICO's accountability principle both want you to know what your system's controlling configuration is and who changed it. A reflector LLM rewriting its own skill prompt is a change you currently can't attribute to anyone.

Producer is not the same as guardian. The reflector that refines the skill is the producer. Nothing in the IDC harness audits the reflector. That separation, an independent observer of the self-improvement process, is the piece the substrate layer is missing.

§4What this means if you are building agent infrastructure

Concretely: if you are building a runtime for agents that reflect, retry or self-edit, you should be capturing an improvement-dynamics record as a first-class artefact, not a debugging side-effect. Three columns: starting behaviour, the trajectory of each self-modification with the diff to the controlling prompt, and a periodic held-out evaluation that the agent has never tuned against. Lin et al. have shown the third one is where the lie hides. An agent that looks like it's getting better round over round can be getting better only at the rounds it can see.

The held-out test is the cheap control and it is the one teams skip, because reflection loops are usually evaluated on the same distribution they train against within a session. Borrow OmniGameArena's discipline: every time the agent refines its skill, re-test on variants it has not touched, and log the gap. If the gap widens while the in-loop score climbs, your agent is overfitting to its own reflection and your audit trail should say so loudly.

§5What the paper does not solve

OmniGameArena is a benchmark of games, and the authors are honest that it is. The reflector LLM, the bounded skill prompt and the held-out variants are all defined inside game environments with clean reward signals and unified action interfaces. Regulated workflows have none of that: the reward is ambiguous, the action space is open, and "held-out variant" is not a thing you can synthesise by tweaking a UE5 level. So the IDC machinery does not transfer as-is. What transfers is the observable, the insistence that you measure the trajectory and the transfer, not the snapshot.

The paper also does not audit its own reflector. IDC measures the outcome of self-improvement; it does not constrain or verify the reflector's edits to the skill prompt. For a game that's fine. For a regulated agent, an unsupervised LLM rewriting the controlling configuration of another LLM, with no independent check, is exactly the thing a governance regime is supposed to prevent. And IDC reports curves for only the four top agents under reflection, so the dynamics for the long tail of weaker agents, the ones more likely to overfit their reflection loop, are unmeasured. The principle is right. The coverage and the guardrail are the work that's left.

Methodology note. This Note takes OmniGameArena (arXiv:2606.09826) as a game benchmark whose real contribution is an audit observable in disguise. Triggers: substrate-relevant (the Improvement Dynamics Curve measures how self-reflecting agents change after first contact, the exact gap production governance ignores); non-duplicative (we read past the leaderboard to the IDC harness and its held-out variant test); regulated-buyer link (FCA SS1/23 and ICO accountability both assume you can describe and monitor behaviour over time, which a cold-start score cannot). Forthcoming: a Workloft pattern for logging improvement-dynamics records as first-class audit artefacts, with an independent reflector-edit observer.