Agents Need Environment Contracts, Not More Sandboxes

§1The agent is not the unit of control

The useful provocation in Li, Jin, Men, Hao, Zhu and Wang’s survey is that the agent is no longer the whole story. The paper is not trying to win a model leaderboard. It treats the environment around the model as an engineered object: something that can be modelled, synthesised, evaluated, reused and changed. That is the part most production discussions still under-specify.

For regulated buyers, this matters more than the next agent demo. A bank using an agent to investigate complaints, a council using one to triage service requests, or a healthcare infrastructure provider using one to manage operational runbooks is not only buying a model. It is buying a defined world for that model to act inside. That world determines what the agent can observe, which tools it can call, which records it can amend, how success is measured, how failures are replayed, and which traces survive for audit.

The survey’s structure is helpful because it refuses to treat environments as incidental test harnesses. It studies an environment engineering lifecycle: environment modelling, automated synthesis, evaluation, and application. It also organises representative environments through eight attributes and eight domains, then looks at how environments and agents co-evolve. In plainer language, the paper says the room is part of the system. If the room changes, the agent changes, even when the model weights do not.

That is the substrate-level point. The compliance question is not simply whether a model is accurate in isolation. It is whether the buyer can prove what operating conditions produced a recommendation, an action, or a failed attempt. The object that needs a contract is the agentic environment.

§2Sandboxes hide the thing auditors need to see

Most teams already have something they call a sandbox. It is usually a place where the agent can try tool calls without damaging production data. That is necessary, but it is too thin. A sandbox says where experimentation happens. It does not necessarily say what the world contains, how it is generated, what can change, who approved the change, or which behaviours count as failure.

An environment contract is stricter. It should specify the state schema the agent sees, the tool surface it can use, the permission boundaries on those tools, the data sources and fixtures available during a run, the reset behaviour after a run, the scoring or review criteria, the log retention rules, and the version identifier for every dependency that can alter behaviour. Without those elements, a buyer cannot reliably replay a disputed decision or compare two agent versions.

This is where the paper’s modelling lens becomes practical. Environment modelling is not just academic classification. It is a way to name the variables that would otherwise sit inside notebooks, hidden prompts, tool wrappers and one-off evaluation scripts. In a regulated deployment, those variables become part of the control evidence. If the environment gives the agent access to a customer record API, a payments workflow, a document store or a case-management system, that access is not background context. It is part of the risk-bearing system.

For FCA-regulated firms, this links directly to model risk management. The material question is whether the firm can explain and control the conditions under which a model-driven process operates. For UK GDPR accountability, the point is similar: a controller cannot show appropriate governance over an automated or assisted process if the process depends on an undocumented environment. For public bodies, the same issue appears through records management, auditability, FOIA and operational resilience.

The wrong procurement question is: which agent performs best in our benchmark? The better question is: which agent-environment pair can we specify, freeze, test, monitor, replay and amend under change control?

§3Synthesis is where governance can move out of sight

Li et al. split automated environment synthesis into two broad paradigms: symbolic synthesis and neural synthesis. That distinction is more important than it first appears. Symbolic synthesis uses explicit structures, rules, scripts, grammars or formalised generators. Neural synthesis uses learned systems, including LLMs, to generate environments, tasks or variations.

Symbolic synthesis has an obvious governance advantage. It is easier to inspect the generator, define constraints, reproduce a case, and explain why a task appeared. If a claims-handling simulation produces a particular customer scenario, the buyer can point to the rule set or generator configuration that produced it. That does not make the environment correct, but it gives reviewers something concrete to challenge.

Neural synthesis is attractive for the opposite reason. It can produce broader, messier and more human-like variation. That may expose an agent to cases that a hand-written generator would miss. It may help create training or evaluation conditions that are less brittle. But it also moves part of the design process into another model. The environment is no longer only the place where the agent acts. It may itself be generated by a probabilistic system whose outputs need provenance, filtering and review.

That shift matters because environment synthesis can change the evidence base. If a supplier says an agent passed a large generated evaluation set, the buyer needs to know how the set was produced. Was the task distribution fixed? Were difficult cases added after earlier failures? Did the generator have access to the agent’s previous trajectories? Were unsafe, trivial or duplicated cases filtered? Could the evaluation set be regenerated exactly?

The survey does not pretend there is one answer. Its value is to make the synthesis layer visible. Once visible, it can be governed. A regulated buyer should require separate identifiers for the agent version, environment version, task generator version, fixture data version and scoring policy. Otherwise, a reported improvement may simply mean the world became easier, narrower or differently sampled.

§4Evaluation must include the room, not only the actor

The paper’s discussion of agent-environment co-evolution is the part that should make infrastructure teams pay attention. Li et al. describe agent evolution in dynamic environments through four complementary pathways: memory-centric experience evolution, orchestration-centric workflow evolution, trajectory-centric offline evolution, and exploration-centric online evolution. Put simply, agents can improve or change through remembered experience, workflow redesign, offline use of trajectories, or online exploration.

Each pathway complicates evaluation. A memory-centric agent may behave differently after prior exposure to similar cases. An orchestration-centric agent may improve because the sequence of tools around it has changed. A trajectory-centric agent may be trained or adapted from historical runs. An exploration-centric agent may learn from interaction while the environment changes in response. In all four cases, the environment is not a neutral background. It is part of how capability is produced.

That means evaluation needs two ledgers, not one. The first ledger records the agent: model, prompts, policies, memory settings, tool routing and guard conditions. The second records the environment: tasks, data, simulators, APIs, permissions, scoring rules, reset behaviour, observation limits and change history. If only the first ledger exists, audit is partial by design.

There is a common failure mode here. A team runs an agent evaluation in a rich internal test setting, then deploys into a production environment with different data freshness, different tool latency, different permissions and different error responses. The model has not changed, but the operating conditions have. The result is treated as model failure when it is often environment drift.

The answer is not to freeze every environment forever. Production systems change. Regulations change. APIs change. Operational policies change. The answer is to treat environment change as first-class change. If a new tool is exposed to the agent, if a case-management schema changes, if a retrieval source is added, or if a task generator is updated, that should trigger evaluation and a recorded approval path. The paper gives a vocabulary for that discipline.

§5Environment-as-a-Service is a procurement question

Li et al. identify Environment-as-a-Service as a future direction. It is easy to see why. Teams do not want to build every simulator, task generator, evaluation harness and multi-agent test setting themselves. A service that supplies maintained environments could accelerate agent development, especially where realistic interaction is expensive to construct.

For regulated buyers, however, Environment-as-a-Service should be treated as more than developer convenience. It is a supplier dependency that can alter assurance evidence. If a third party changes the environment distribution, scoring policy, simulator behaviour or tool mocks, the buyer’s prior evaluation may no longer mean what it claimed to mean. A vendor can change the room underneath the actor.

Procurement needs to catch that. Contracts should require versioned environment releases, change notices, exportable logs, reproducible evaluation sets where possible, documented synthesis methods, data boundary statements, retention controls, and a clear route to replay historical runs. If neural synthesis is used, the supplier should disclose the generator model class, prompt or policy controls where available, filtering process and review arrangements. If synthetic data is derived from real operational material, privacy review is not optional.

Multi-agent environments raise the bar again. Once several agents interact, the environment includes shared state, message channels, roles, incentives, collision rules and dispute handling. A failure may come from one agent, from another agent, or from the rules that allowed the two to affect the same object. That is not a model-only question. It is a systems governance question.

The buyer’s internal owner should therefore not be only the AI lead. Security, risk, legal, records management, data protection and service operations all have claims on the environment contract. The model is only one part of the controlled process.

§6What the paper does not solve

This is a survey, not a standard. It does not provide a deployable environment contract schema. It does not tell a council, bank or hospital which environment attributes must be mandatory for a given use case. It does not resolve liability between a model provider, an environment provider and the buyer operating the agent. It also does not give a simple way to compare symbolic and neural synthesis across every domain.

Those omissions are not flaws so much as a marker of where the field now is. The research community is beginning to name the environment as an engineered artefact. The practitioner community now has to turn that into boring controls: specifications, versioning, logging, test packs, approval paths and incident replay. Boring is the point. Boring is how agents become governable.

The paper also leaves a hard tension unresolved. Dynamic environments are useful because they can produce richer learning and evaluation. Regulated assurance, however, often requires stable evidence. The more an environment adapts, the more carefully its adaptation must be bounded and recorded. A learning environment without governance is not automatically advanced. It may simply be unrepeatable.

The best reading of Li et al. is therefore not that every organisation should rush to buy or build more elaborate agent worlds. It is that the environment has become part of the product surface and part of the audit surface. If you cannot describe the world your agent is acting in, you cannot credibly govern the agent.

Methodology note. This Note takes Li et al. (arXiv:2606.12191) as a substrate paper rather than a survey roundup. Triggers: substrate-relevant, because it treats environments as engineered runtime objects; non-duplicative, because it shifts attention from agent benchmarks to modelling, synthesis and evaluation conditions; regulated-buyer link, because FCA-regulated firms, councils and healthcare infrastructure buyers need replayable evidence for agent behaviour. The Workloft-side angle is the environment contract: a versioned control object for state, tools, task generation, scoring and logs. Forthcoming: a practical schema for environment change control in agent evaluations.