The Recovery Gap: Why GUI Agents Fail the Second Time

§1The success rate hides the only number that matters

Almost every GUI agent paper leads with a headline pass rate. The agent completed N% of tasks on OSWorld. The implicit claim is that a higher number means a more deployable agent. For a regulated buyer, a UK Local Authority automating a benefits-eligibility workflow, or an FCA-regulated firm running an agent across an internal claims console, that number tells you almost nothing useful. It tells you what happens when the world behaves. It says nothing about what happens when it does not.

The work behind GUI-RobustEval and Robustness-driven Trajectory Synthesis (arXiv:2605.29447) makes the right move: it stops asking whether the agent can do the task once and starts asking whether it can recover when a step goes wrong. That reframing is the whole story. A pop-up appears that the agent did not expect. A button moves. A page loads slowly and the click lands on the wrong element. The interesting question is not whether the agent ever errs, because it always will. The question is what it does on the step after the error.

§2All-Pass@4 is a robustness metric wearing a benchmark's clothes

The metric to watch here is All-Pass@4: the agent has to succeed on all four runs of the same task, not just one. This is a deceptively sharp instrument. A single-shot pass rate rewards an agent that gets lucky once. All-Pass@4 punishes the agent that succeeds when conditions are clean and falls apart when they are not. The variance between those two numbers is, in effect, a measurement of how brittle the agent is. For anyone building agent infrastructure for compliance-bound work, that variance is the figure you should be putting in front of your risk committee.

Why does this matter at the substrate layer rather than the model layer? Because recovery is not a property the model has on its own. It is a property of the whole runtime: how errors are surfaced back to the agent, whether the agent can see that its last action did nothing, whether the environment gives it a clean signal of failure or a silent no-op. The paper's Robustness-driven Trajectory Synthesis works by generating training trajectories that deliberately include errors and recoveries, teaching the agent what a correction looks like. That is a model-side fix. But the deployment-side version of the same problem lands squarely on whoever runs the agent in production.

If your runtime cannot reliably tell the agent 'that click did nothing, the form did not submit', the agent cannot recover from a failure it cannot perceive. The synthesised recovery skill is wasted on a substrate that hides the error. So the paper's contribution is real, and the operational lesson sits one layer below where the paper operates.

§3What recovery looks like under audit

For a regulated buyer the recovery question is not abstract. Under the ICO's guidance on AI and data protection, you need to be able to explain a decision and demonstrate the process that produced it. An agent that fails silently, retries blindly, or carries a corrupted state forward is an agent whose decision trail you cannot reconstruct. Under FCA SS1/23 on model risk management, you are expected to understand a model's behaviour across the conditions it will actually meet, including degraded ones. A success rate measured under clean conditions does not satisfy that.

The practical translation: every recovery is a state transition that has to be logged, and every failure-then-recovery sequence is a richer audit artefact than a clean pass. The agent that hit a pop-up, recognised the failed action, and corrected course produces an evidence trail that a regulator can read. The agent that got it right first time produces almost nothing to inspect. Counterintuitively, the recoverable agent is the more auditable one, because its corrections are visible.

This is where the GUI-RobustEval framing earns its place in a procurement conversation. When you evaluate a vendor's GUI agent, the question is not 'what is your OSWorld score'. It is 'show me your All-Pass@N spread, and show me the trace from a run where the agent hit an error and recovered'. If the vendor cannot produce that trace, their agent either never errs (it does) or it cannot see when it errs (the dangerous case).

§4The recovery skill has to live somewhere durable

There is a deeper structural point. Robustness-driven Trajectory Synthesis bakes recovery behaviour into the model weights. That makes the recovery skill implicit and non-inspectable: you cannot read the model and see what it will do when a particular error class appears. For a one-off agent that may be acceptable. For an agent doing repeated regulated work, you want recovery policies that are explicit, versioned, and auditable, not buried in a checkpoint you re-train every quarter.

This is the separation-of-concerns argument again. The model can be good at recognising and correcting errors. The policy that says 'on a failed financial-form submission, do not retry; halt and escalate to a human' should not live in the weights. It should live in the runtime, where it can be reviewed, changed, and shown to an auditor. The paper improves the model's instinct. The substrate has to supply the rules.

§5What the paper does not solve

GUI-RobustEval measures recovery; it does not tell you which recoveries are safe to attempt autonomously. In a regulated workflow, some errors should trigger a retry and others should trigger a hard stop. A benefits agent that recovers gracefully from a mistyped postcode is welcome. A benefits agent that recovers gracefully from a failed identity check, by trying again until it gets through, is a control failure. The paper's metric treats all recoveries as good. Production cannot.

It also operates on OSWorld and synthetic trajectories, not on the bespoke internal consoles most regulated buyers actually run. The error distribution on a council's legacy case-management system is nothing like the open desktop environments here. And because the work is model-side, it gives no account of how recovery interacts with logging, rollback, or human handover, which is exactly the part a deployer has to build. The contribution is genuine and the direction is right. The robustness number is one most vendors are not yet reporting, and buyers should start asking for it. But a recoverable agent is not a safe agent. It is a more honest starting point.

Methodology note. This Note takes GUI-RobustEval and Robustness-driven Trajectory Synthesis (arXiv:2605.29447) as a robustness story dressed as a benchmark paper. Triggers: substrate-relevant (error recovery is a runtime property, not just a model one); non-duplicative (we have not covered GUI-agent recovery before, and most coverage fixates on pass rates); regulated-buyer link (UK councils and FCA-regulated firms automating internal consoles, where recovery behaviour is the audit artefact under SS1/23 and ICO AI guidance). Forthcoming: a follow-up on encoding explicit, versioned recovery-and-escalation policies in the runtime rather than in model weights, with a worked council-workflow example.