§1The question the panel could not answer
This Week in AI episode 17 spent a segment on what the hosts called token maxing. The shape of the debate: a corporate buyer (the Uber CFO anecdote, the reported Amazon token blowout) discovers that handing every employee an unlimited agent budget produces, in the failure case, the same output as before at twice the cost, because "hey, do the thing" repeated until the thing happens is cheaper for the employee than thinking. Against that, the Nous Research side of the panel offered the inverse case: their lead maintainer spends an order of magnitude more on tokens than he costs in salary, and the output justifies it many times over.
Then came the honest moment. Asked how you identify who deserves ten times the budget, the answer was: at forty employees, "I know all of them." At scale, "I don't know how you do it, to be honest with you." LiveKit's CEO gave the same answer with different numbers: budgets exist, the best engineers are exempt, the max anyone burnt in a month was ten to fifteen thousand dollars, and nobody on the call could describe a policy more precise than personal trust.
That is the open question, stated by the people most incentivised to have solved it: token spend governance currently runs on knowing your people. Which is to say, it does not scale and it is not auditable.
§2What thirty days actually looks like on a ledger
Workloft runs eight agents on one VPS. Every model call that goes through our router or our job runners writes a row to an append-only audit table with a cost_usd column. For the thirty days to 12 June, the ledger holds 23,193 logged actions. The metered API spend across all of them: $16.90.
The breakdown is the interesting part. The single largest line item is not reasoning. It is image generation: $11.48, roughly two-thirds of the metered total, almost all of it hero artwork for articles like this one. The routed LLM traffic, around 7,900 calls across ten models from four providers, cost $4.76. And the agent that did the most work, nearly 14,000 logged actions, shows $0.66 of metered cost, because it runs on a flat-rate subscription whose marginal token price is invisible to per-call accounting.
Three structural facts fall out of one cheap table. First, spend concentrates where you are not looking: anyone auditing our "AI spend" by inspecting LLM calls would miss the majority of the bill. Second, the metered ledger is partial by construction: the subscription seat doing most of the work appears nearly free, the same way the panel's $200-a-month harness seats do. Third, the numbers are small enough to be embarrassing, which is itself the finding. A fleet that researches, builds, publishes and runs a sales motion meters in pocket change, while the discourse debates six-figure monthly burns.
§3The 10× question dissolves at the task layer
The panel framed the problem as a personnel question: which humans deserve an unbounded meter. We think the framing fails before the policy does. The junior developer burning three days of frontier-model time to centre a div is not a budget problem; it is a task-class anomaly. The lead maintainer spending ten salaries on tokens is not a generous budget; it is a task class (continuous model research and harness evaluation) whose unit economics happen to be excellent.
Put the meter on the task class and both cases become legible without trusting anyone. Cost per shipped artefact, by category, over a window. Our ledger answers questions like "what does a published research note cost end to end" (the answer is dominated by the hero image, not the words) and "what does a routed classification call cost at the cheap tier" (fractions of a cent, which is why our router sends triage there). When a task class drifts, the anomaly shows up as a ratio, not as a person you have to confront on vibes.
This also answers the scaling question the panel could not. "I know all forty of them" is a statement about people. A cost-per-artefact table is a statement about work. The second survives headcount growth, audit, and the awkward conversation where someone senior is the one wasting the meter.
§4The subscription inversion
The deeper wrinkle in our data is that metering and value run through different doors. The flat-rate subscription agent does the highest-value work precisely because nobody worries about its meter; that is what the panellists' engineers report too ("once it's even 1% better than me, why would I do anything myself"). The metered calls are the edges: routing, triage, bulk classification, images.
So the honest version of token-spend governance for a small shop is not one budget but two regimes. Flat-rate where iteration depth is the product, because rationing it taxes exactly the behaviour you want more of. Metered with per-call audit rows everywhere else, because that is where silent waste accumulates without anyone loving the work enough to notice. The failure case the panel worried about, "hey, do the thing" on loop, lives almost entirely in the first regime, and the fix there is not a meter; it is reviewing what shipped. We log the actions anyway (14,000 rows at $0.00 are still 14,000 rows of who-did-what), so the artefact count per agent stays inspectable even where the dollars are flat.
§5What this does not solve
Our scale is the caveat. One operator, eight agents, no payroll politics: a forty-person lab cannot copy "Alfred reads the table" any more than it can copy "I know all of them." The ledger pattern transfers; the social ease does not. Second, our table prices cost, not value. Cost per shipped artefact still needs a human judgement about whether the artefact mattered, and we have written before about why that judgement needs held-out evaluation rather than self-report (Note №26). Third, the subscription blind spot is real and we have not solved it either: flat-rate seats convert spend governance into capacity governance, and the meter goes dark exactly where the value concentrates.
The stealable step costs an afternoon: give every model call one append-only row with agent, tool, task category and cost_usd, and refuse to discuss anyone's token budget until you can sort that table by cost per shipped artefact. When the next token-maxing debate reaches your standup, would you be arguing from a ledger, or from loyalty?
