Verify Only the Answers You Doubt

§1The waste both papers attack

Two papers landed this week that look unrelated. One is about optimising multi-step LLM pipelines. The other is about when to make a model double-check its own answer. Read them together and they are making the same argument, and it is an argument about waste.

The waste is uniform effort. We build pipelines that spend the same care on every prompt in the chain, and reasoning systems that spend the same verification on every answer, whether or not the spend buys anything. Both papers say the same thing: stop spreading effort evenly. Attribute it. Spend where it bites, skip where it does not.

§2Selective verification: think again only when it helps

Think Again or Think Longer? (Dip, Zhou and Zhang, arXiv:2606.19808) treats test-time verification as a control knob rather than a default. Their observation is blunt: extra reasoning is not uniformly valuable. It can repair a failed answer, but it can also waste compute on an answer that was already right, or worse, flip a correct answer to a wrong one. So they put a controller in front of the verifier that decides, per answer, whether to verify at all.

The numbers are the point. On a grade-school maths set, their policy verified just 3.0% of examples, cut verification tokens by 91.2% against always-on verification, and still nudged accuracy up, from 93.4% to 94.5%. On a harder maths set, selective verification beat always-verifying on accuracy (76.3% against 75.5%) while spending 26.8% fewer post-generation tokens, and it roughly halved the rate of harmful answer changes. Always verifying is not just expensive. On some tasks it actively makes things worse.

§3FAPO: fix the bottleneck, not every prompt

FAPO (Kassianik et al., arXiv:2606.19605) makes the structural version of the same move. It lets a coding agent optimise a multi-step pipeline by inspecting the intermediate steps, diagnosing where the chain actually fails, and proposing scoped changes. The discipline is the interesting bit. It tries prompt edits first, and only changes the chain structure when attribution shows a structural bottleneck that a prompt cannot fix. It does not rewrite everything. It finds the step that is costing you and works there.

It pays off. The authors report beating GEPA, a strong prompt-optimisation baseline, in 15 of 18 model-benchmark comparisons, with double-digit mean gains, including on a security task that maps real CVEs to weakness categories. The lesson is not that a coding agent can tune prompts. It is that uniform prompt tuning misses bottlenecks that live in the shape of the pipeline, and you only find them by attributing failure to a step.

§4Why this lands for us

We run a three-juror panel as our quality gate. Every candidate answer that climbs our model router's escalation ladder gets judged by three models from different lineages. The panel is good, but it is not cheap, and we were paying it on every rung, including the easy ones where the answer was obviously fine or obviously weak.

So we did the selective thing to ourselves. This week the router learned to put one cheap juror in front of the panel as a screen. A confident pass from the screen accepts the answer outright. A confident reject escalates to the next tier outright. Only when that cheap signal is genuinely unsure do all three jurors convene. An easy answer now clears in one model call instead of three, a clearly weak one escalates in one, and the full panel fires only in the ambiguous middle band where it actually earns its cost. The paper's result of verifying 3% of cases is the same shape as what the screen does to our panel traffic.

The honest caveat is the one the paper implies and we feel. A cheap screen has worse calibration than the full panel, so a confident-but-wrong screen can wave through a weak answer. We hold the line by setting the bar for acting on the screen alone high, and by always convening the full panel on the final rung, where there is nowhere left to escalate. Selective does not mean careless. It means the expensive check is reserved for the cases that are actually in doubt.

§5The pattern worth stealing

Put the two papers in one sentence and it is a design rule: effort should be attributed to where it changes the outcome, not spread evenly to feel thorough. Verifying every answer feels rigorous and is often waste. Tuning every prompt feels diligent and often misses the one step that is broken. The skill is building the cheap signal, an uncertainty estimate or a failure attribution, that tells you where to spend the dear effort, and then trusting it enough to skip the rest. Most stacks do not have that signal yet. The ones that build it will run at a fraction of the cost and, on the evidence here, often do better.

Methodology note. This Note reads two June 2026 papers together: Dip, Zhou and Zhang's selective verification (arXiv:2606.19808) and Kassianik et al.'s FAPO (arXiv:2606.19605). Picked because both formalise the same substrate move, attribute effort rather than spread it, and because we could test the claim directly: we shipped the selective-verification pattern into our own model router's escalation gate this week. Numbers quoted are the authors' own.

§1The waste both papers attack

§2Selective verification: think again only when it helps

§3FAPO: fix the bottleneck, not every prompt

§4Why this lands for us

§5The pattern worth stealing

▸ Related