Visual agents need skill packages, not longer prompts

§1The interesting move is packaging, not perception

Most visual-agent papers are read as perception papers. Can the model understand the screenshot? Can it find the button? Can it follow the instruction? That is useful, but it is not the part regulated buyers should be watching most closely.

arXiv:2605.13527 is more interesting because it treats procedural knowledge as an external object. The paper proposes multimodal procedural knowledge frameworks in which a visual agent can use reusable skill packages. Those packages combine text, state cards and visual keyframes. A trajectory-to-skill generator turns prior demonstrations into structured skills. A branch-loaded multimodal skill agent then decides, at runtime, which part of the skill to use under the current visual state.

That sounds like an agent capability story. It is really an infrastructure story. The important shift is from skill as something latent inside the model, or smeared across a long prompt, to skill as a package that can be named, stored, selected, inspected and replaced.

For a local authority, FCA-regulated firm, healthcare supplier or education infrastructure provider, that distinction matters. A model that has vaguely learned how to use a claims portal is hard to govern. A package called submit-supporting-evidence-v3, containing a task description, state conditions and visual keyframes for the relevant screens, is not automatically safe, but it is at least a thing that governance can attach to. It can have an owner, a version, an approval record, a test suite, a revocation path and an audit trail.

This is the substrate lesson from the paper: production visual agents will not be governed by asking whether the base model is generally good at GUIs. They will be governed by the quality of the procedural artefacts they are allowed to load.

§2Prompts are the wrong unit of reuse

The current habit in agent design is to make reuse look like prompt reuse. Put more instructions in the system prompt. Store previous examples in memory. Retrieve a relevant plan. Ask the model to imitate the successful run. That can work in demos, but it gives buyers a poor control surface.

A prompt is usually too loose to be an operational artefact. It mixes policy, task description, examples, fallback rules, tone and tool hints. When something goes wrong, the reviewer has to ask whether the fault sits in the instruction, the model, the retrieved context, the tool output, the screen state or the hidden interaction between all of them.

The paper's proposed structure separates some of that muddle. Text expresses the procedure. State cards describe conditions under which parts of the skill apply. Visual keyframes anchor the skill to what the agent should see. The branch-loaded agent does not merely read a recipe; it conditions on the current screen and chooses the relevant branch.

That is a better unit of reuse because visual work is not linear. A human doing the same back-office task knows that a portal may open on the dashboard, a warning modal, a locked record, a search result or a partially completed form. A useful agent skill must encode those variations. If the only representation is a paragraph saying click submit after uploading the file, the procedure fails as soon as the page is not exactly where the prompt expected it to be.

State-conditioned skill packages make the branching explicit. That does not remove the need for model judgement, but it gives the runtime something more concrete than memory. It also gives auditors a way to ask better questions. Which state card matched? Which keyframe was used as evidence? Which branch fired? Was the selected branch within the approved package version? Those questions are difficult to answer when the whole skill is an unstructured prompt fragment.

The paper is not a compliance paper, and it does not claim to solve auditability for regulated deployment. But its representation points towards the missing middle layer between raw model capability and enterprise workflow control.

§3The regulated-buyer version is a skill registry

If this idea is taken seriously, the next production object is not just the visual agent. It is the skill registry around the visual agent.

A regulated deployment should treat each reusable visual skill as a controlled artefact. The registry should record where the skill came from, which trajectories generated it, which model or generator produced the package, who reviewed it, which systems it may operate in, which data classes may appear in its keyframes, and which user roles may call it. It should also record negative scope: screens it must not act on, transaction types it must not complete, and states that require human review.

That sounds bureaucratic only if we pretend visual agents are toys. In production, they will touch claims systems, CRM records, case files, education management systems, procurement portals and clinical administration tools. The difference between read a record and submit a change is not a UX nuance. It is a permissions boundary.

The paper's skill-package design gives substrate builders a cleaner place to enforce that boundary. A runtime can refuse to load an unsigned package. It can restrict packages to particular applications. It can require a human approval gate before branches that submit, delete, disclose or transfer. It can log the package identifier and branch decision alongside the screenshot hash and tool action. It can quarantine packages whose keyframes no longer match the live UI after a vendor update.

For FCA-regulated firms, this maps directly onto model risk management and operational resilience concerns. If an AI assistant is completing regulated servicing tasks through a GUI, the firm needs more than a transcript. It needs evidence of controlled operation: what capability was invoked, why that capability was considered applicable, and how exceptions were handled.

For public bodies, the same pattern matters under UK GDPR accountability, ICO explainability expectations and FOIA-sensitive record keeping. If an agent changes a benefits case, updates a housing record or routes a safeguarding referral, the authority needs to know which procedural object drove the action. It should not have to reverse-engineer the answer from a screenshot video and a model trace.

The paper's multimodal package format is not yet that governance layer. But it is compatible with it in a way that ordinary prompt reuse is not. A package can be signed. A package can be diffed. A package can be deprecated. A package can be denied at runtime.

§4Keyframes create evidence, and risk

The visual keyframe part of the paper is especially important. Text-only procedures are brittle because many GUI decisions are visual. A button's label, a banner, a disabled field, a selected tab or a warning icon can change what the agent should do. Keyframes give the agent grounded visual references rather than relying on language alone.

That grounding is also a data-governance problem. Keyframes may contain personal data, commercially sensitive layouts, supplier names, internal workflow labels or security-relevant interface details. A skill generated from trajectories may accidentally preserve information that should never leave a controlled environment. If the package is shared across teams or tenants, the risk grows.

Substrate builders therefore need redaction and minimisation at the package layer, not only at the conversation layer. A keyframe should be treated as evidence and as data. It needs provenance, retention rules and access control. If it contains a real customer record from a demonstration, that is not harmless training residue. It is part of the operational artefact.

There is also a drift problem. Visual agents depend on screens that change. A portal redesign, A/B test, new cookie banner, revised warning label or accessibility update can make an approved keyframe stale. A runtime that loads visual skill packages should be able to detect low visual confidence and move to a safe state, rather than forcing a match because the package exists.

This is where the paper's state cards and keyframes should be read together. The state card is the declarative claim about when a branch applies. The keyframe is supporting visual evidence. A governed runtime should require both to be sufficiently satisfied before allowing high-impact actions.

That is a different mental model from autonomous web browsing. It is closer to controlled procedure execution under uncertain perception.

§5What the paper does not solve

The paper gives a useful representation for external reusable skills, but it does not give regulated buyers a finished control plane.

It does not, from the public summary available in today's haul, establish the audit schema needed for procurement or assurance. It does not define how a skill package should be signed, how approvals should be recorded, how package versions should be retired, or how runtime branch choices should be exposed to a compliance reviewer. Those are not minor implementation details. They are the difference between a clever agent and a deployable one.

It also does not remove the need for policy separation. The same component that generates a skill from a trajectory should not be the only component allowed to approve that skill for production use. Generation, review, permissioning and execution should be separate concerns. Otherwise an error in the demonstration path can become an authorised operational habit.

Nor does the approach make visual grounding safe by itself. A keyframe can be misleading. A state card can be underspecified. A branch can be correct for one tenant and wrong for another. A GUI can expose hidden consequences behind an ordinary-looking button. Visual resemblance is not authorisation.

The paper should therefore be read as a substrate prompt, not a deployment recipe. It shows that agent skills can be lifted out of model memory and represented as multimodal packages. The next question for builders is whether those packages can be governed like operational software: registered, reviewed, scoped, monitored and withdrawn.

That is the point Workloft Labs will keep pressing. The future of visual agents in regulated environments is not simply better screenshot reasoning. It is the boring machinery around reusable skills: registries, policies, attestations, drift checks, event logs and refusal paths. Without that machinery, skill reuse becomes another way to make opaque automation travel faster than the controls around it.

Methodology note. This Note takes arXiv:2605.13527, a paper on multimodal procedural knowledge for visual agents, as a substrate signal rather than a benchmark story. Triggers: substrate-relevant, because reusable text, state-card and keyframe packages are governable runtime artefacts; non-duplicative, because the angle is skill-package control rather than GUI performance; regulated-buyer link, because FCA firms, local authorities, healthcare suppliers and education platforms need evidence for agent actions in real systems. Forthcoming: a Workloft-side pattern for signed visual-skill registries and runtime branch logging.