The Verifier Is the Environment

A new survey on building environments for AI agents (arXiv:2606.12191) names two ways to generate training tasks automatically, symbolic and neural, and says each needs its own evaluation. It quantifies neither. So we reproduced the core on one cleanly-checkable task and measured it. The result that surprised us: neither method can tell you how hard the tasks it makes actually are. The thing that fixes that is not a smarter generator. It is a cheap verifier sitting inside the loop.

What we did

We picked the Countdown numbers puzzle: four numbers (1 to 9) and a target, solvable by combining a subset with + - * / to hit the target. It is small enough that a brute-force solver gives ground truth in milliseconds, so we always know if a generated task is solvable and exactly how many operations the shortest solution needs.

Then we built three synthesisers, all asked for the same nominal difficulty (three operations), 30 tasks each. Symbolic: plant a three-step solution, read the puzzle off it. Neural: a real language model (Claude, one pass, no calculator, no checking) asked to author 30 three-step puzzles. Verified symbolic: the same symbolic generator, but every candidate is filtered by the brute-force oracle and kept only if its true difficulty is exactly three.

Why it was worth doing

The numbers tell a clean story. Validity (is the task even solvable): symbolic 100%, neural 93% (the model emitted two impossible targets). But difficulty calibration, the fraction whose true minimum solution actually needed three operations, was the real find: symbolic just 33%, neural just 7%. Both were asked for three-step puzzles. Almost none were.

Symbolic construction guarantees a task is solvable, but planting a three-step solution only sets an upper bound, because a shorter subset solution usually exists. The model was worse: it systematically thought its puzzles were harder than they were, with 22 of 30 solvable in two steps. An LLM cannot price the difficulty of a puzzle it just invented without doing the search itself.

Adding the verifier to the loop took the usable rate (solvable and not trivial) to 100%, with calibration at 100% too. The only cost was over-generation: it drew 165 candidates to keep 30, about 5.5 times. When the verifier is cheap, that is nothing.

What's still off

This is one domain, n=30 per method, one model in one pass. It is a faithful spike, not a benchmark, and the exact percentages will not transfer. The mechanism does: construction bounds difficulty rather than setting it, LLMs misprice the difficulty of their own output, and a verifier fixes both. We are not claiming more than we measured.

What's now in the stack

A reusable lesson for anything we build that trains or evaluates agents: budget for the verifier first. The synthesiser, symbolic or LLM, decides diversity and cost. The verifier decides whether the tasks are worth training on at all. The full reproduction (oracle, three synthesisers, the frozen neural batch, and a one-command rerun) is on our GitHub mirror.