Workloft
▸ WORKLOFT RESEARCH NOTE №41 · 18 JUNE 2026

Seven Agents Fact-Checked What One Cheap Call Just Guessed

We rebuilt our hand-rolled classifier with the native multi-agent feature. It went and checked the facts, refused to pad its picks, and cost two orders of magnitude more. Here is when that trade is right.

THE EXPERIMENT ●●● · 1 CHEAP CALL vs 7 AGENTS · 155,425 TOKENS · RIGHT TOOL FOR THE RIGHT JOB

§1The frontier shipped our workaround

Anthropic shipped dynamic workflows this month, a way for Claude Code to spawn a crew of subagents that fan out, do the work, and check each other before reporting back. Read the description and it is the thing we have been building by hand all year: a coordinator handing tasks to a panel, and a reviewer pulling it together. One engineer in the write-up put it plainly, he spent weeks on his version, and a single release made it look clumsy. We had the same feeling, because we run on Claude Code and we have a pile of hand-rolled multi-agent flows. So we did the obvious test. We took one of ours and rebuilt it with the native feature to see what actually changes.

§2The before: one cheap call

The flow we picked is our daily Hacker News triage. Every morning it pulls the front page and decides, per story, whether we should build something off it, write about it, or skip it. The hand-rolled version is 194 lines of plain Python and, at its heart, a single call to a cheap model that scores all the stories in one go. It costs a fraction of a penny, runs in a couple of seconds, and gives a serviceable answer.

It has one limit baked in: it cannot leave the building. It judges each story off the headline and a one-line hook, because a cheap batched call has no way to go and check anything. It is a fast, blind, cheap guess, and for a morning triage that a human skims anyway, a fast cheap guess is exactly enough.

§3The after: seven agents that went and checked

The native version is a 35-line script. It fans out one agent per story in parallel, each rendering its own verdict, then a synthesis agent ranks the picks. We ran it on the same six stories. Seven agents, 155,425 tokens, nine tool calls, forty-nine seconds.

The difference was not the format, it was that the agents went and checked. Faced with "GLM-5.2 is the new leading open-weights model", the cheap call would have restated the headline and moved on. The agent did not. It went out, pulled the actual Artificial Analysis ranking (51, fourth overall, behind Fable 5, Opus 4.8 and GPT-5.5), cross-checked a second source, applied our own rule about not restating leaderboard noise, and produced a verdict with a real reason: not "new top model" but "a near-frontier open-weight model with a 1M context, close enough to shift a router's default tier." Then the synthesis agent did the thing you never quite trust a model to do. Of six stories it picked one as worth writing about and flatly refused to pad the list to three. "I am not padding," it said, in those words.

§4The honest reckoning

So the native crew was unmistakably better. It also cost two orders of magnitude more. The cheap call is fractions of a penny; the crew burned 155,000 tokens to triage six links. For a funnel that runs every morning and exists only to surface candidates a human looks at anyway, that is mad. The cheap call is the right tool, and we are not replacing it.

But we have a second job that is nothing like a daily funnel. Once a fortnight we pick the three stories for our show, and those picks have to be defensible on camera, genuinely the strongest, and not embarrassingly wrong about a fact. That is precisely the job where "the agents went and checked, and refused to pad" is worth two orders of magnitude. Same flow, same idea, two completely different answers to "is the expensive version worth it", decided entirely by the stakes.

§5Permission not to migrate

The honest lesson is the one the frontier keeps teaching and builders keep forgetting. A feature being more powerful is not a reason to put it everywhere. Dynamic workflows did not make our cheap classifier obsolete, it gave us a premium tier to reach for when the stakes justify it, and left the cheap tier exactly where it was. The discipline the feature actually demands is not "use the shiny thing", it is knowing when a hundred times more rigour is worth a hundred times the bill, and when it is just a bigger bill.

So our daily triage stays one cheap call, and the fortnightly show pick is moving to the crew. And the part of us that spent the year hand-rolling the orchestration is not annoyed it got shipped natively. It is relieved, because now we only pay for the rigour on the days that actually need it.


Methodology note. This is a first-person write-up of an experiment we ran on 18 June 2026. The flow is our daily Hacker News triage. The hand-rolled version is 194 lines of Python with one batched call to a cheap model. The native version is a 35-line dynamic-workflow script using Claude Code's multi-agent feature: six verdict agents in parallel plus one synthesis agent, run on the same six front-page stories. Recorded usage on the native run: seven agents, 155,425 tokens, nine tool calls, forty-nine seconds. The GLM-5.2 verdict agent independently pulled the Artificial Analysis Intelligence Index ranking and a second source before judging; the cheap call has no tools and cannot. Triggers: substrate-relevant (when native multi-agent orchestration is worth its cost is a live build decision, not a vibe); non-duplicative (the coverage of dynamic workflows is all "it is powerful", none of it on the cost line or when not to use it); broad-builder (anyone on Claude Code now has this exact choice). Sits beside today's Note №38 on independent review and Note №40 on what our gateway will actually route. Forthcoming: moving the fortnightly show-story pick onto the native crew while the daily triage stays one cheap call.