§1The interface is the thing, not the model
Most of the conversation about agentic spatial reasoning is about the model. Can the VLM see depth? Can it count objects? Cho et al. (arXiv:2606.13673) make a quieter and more useful claim: the bottleneck is not the model's perception, it is the interface through which the agent invokes its tools. Change the interface, and you get +11.2 points average accuracy across 20 benchmarks, with no retraining and the same six backbones from two model families.
That is a substrate finding dressed up as a benchmark result. SpatialClaw does not add a smarter VLM or a new perception module. It changes how the agent is allowed to call the modules it already has. The lesson generalises well beyond 3D geometry, and it matters for anyone running agents under audit obligations.
§2Single-pass code versus structured tool calls, and why neither was working
The paper sets up two existing designs as the things it is beating. The first is single-pass code execution: the agent writes one block of code up front that encodes its whole analysis strategy, then runs it. The problem is obvious once stated. The agent commits to a plan before it has seen any intermediate result. If the first perception call returns something surprising, there is no room to adapt; the plan is already written.
The second is the structured tool-call interface, the JSON-schema function-calling pattern that has become the default in production agent frameworks. Here the agent picks one named tool per turn, fills in typed arguments, gets a typed result, repeats. This is the design most regulated buyers are actually deploying today, because it is the one the major vendors ship and the one that produces clean, parseable logs. Cho et al. find it too rigid for open-ended composition: you cannot freely combine operations, hold intermediate state, or tailor the analysis to a task that the schema designer did not anticipate.
SpatialClaw splits the difference and lands somewhere more interesting. It keeps a stateful Python kernel pre-loaded with the input frames and a suite of perception and geometry primitives. The agent writes one executable cell per step, conditioned on every prior output, text and visual. It is iterative like a tool-call loop, but the unit of action is arbitrary code, not a fixed function signature. The agent can compose primitives in ways nobody specified in advance.
§3The bit the regulated buyer should not skim past
Here is the uncomfortable consequence. The structured tool-call interface is rigid for a reason, and that reason is auditability. When an agent emits detect_objects(frame=3, class="vehicle") and gets back a typed result, your audit log has a clean record of intent and outcome. You can replay it, you can diff it, you can show the ICO exactly what the system asked for and what it received. The schema is the contract, and the contract is the evidence.
Code-as-action throws that away. When the agent writes an arbitrary Python cell against a stateful kernel, the audit question changes from "which tool did it call" to "what did this code do, and what state did it mutate." Those are not the same question and they are not equally answerable. A stateful kernel means the result of step 7 depends on hidden variables set in step 3. The flexibility that buys +11.2 points is the same flexibility that makes the decision trail harder to reconstruct after the fact.
For an FCA-regulated firm under SS1/23, the model risk management expectation is that you can explain and reproduce a decision. For a Local Authority making an automated determination, ICO guidance on explainability (the AI guidance, §11 onwards) expects you to articulate what the system did in terms a data subject can challenge. "It wrote some Python and the kernel state evolved" is not an explanation you want to defend at appeal. The SpatialClaw result is genuine and the gains are real; the point is that the interface choice that produces those gains moves work onto the audit layer that the structured interface was quietly doing for you.
This is not an argument against code-as-action. It is an argument that the action interface is the audit surface, and most teams pick the interface for capability and discover the audit consequences in production. If you adopt a code-execution interface, you have to instrument the kernel: log every cell, snapshot state between steps, capture the full execution trace including what got read and written, not just what the agent intended. That is a substrate build, and it is not optional for a regulated deployment. The paper gives you the capability case. It does not give you the instrumentation, because that was never its job.
§4Why training-free matters more than the headline number
The detail worth holding onto is that SpatialClaw is training-free and works across six backbones without model-specific adaptation. That tells you the gain lives in the harness, not the weights. For a buyer, that is the good news and the warning in one. Good news: you can improve agent capability by changing the runtime, not by retraining a model you cannot retrain anyway. Warning: capability improvements that live in the harness are exactly the ones your existing model-governance process will miss, because nothing about the model changed. Your model card is identical. Your risk assessment, if it is pinned to the model, never fires.
The substrate keeps eating the model's lunch. The thing that changed the outcome here is not on any leaderboard you are tracking. It is the shape of the loop the agent runs inside. If your governance only inspects the model, you are auditing the wrong layer.
§5What the paper does not solve
SpatialClaw is evaluated on accuracy across 20 spatial benchmarks. It is not evaluated on, and makes no claims about, the reproducibility, logging, or explainability of the code-as-action trace. The paper does not address what happens when the agent writes code that errors, loops, or mutates kernel state in ways that corrupt later steps; the stateful kernel is presented as an asset, and its failure modes are out of scope. There is no treatment of sandboxing or the security surface of executing model-generated Python in production, which is a live concern the moment this leaves a benchmark harness. And it says nothing about the cost of the per-step VLM calls, which matters for anyone pricing a deployment. The capability case is well made. The substrate case, runtime safety, audit, reproducibility, is left entirely to whoever ships it.
