Local Models Don't Save You Money

§1The instinct, and the arithmetic

Every few weeks someone sends me a video: run a frontier model on your own hardware, stop paying the API, unlimited intelligence on your desk. The instinct is right and I have had it too. This week I nearly acted on it. Then I did the arithmetic, and the arithmetic says the opposite of what the videos say. Running models locally does not save a small shop money. It buys something else entirely, and most of the people paying for it have not noticed which thing they are buying.

§2The number that ends most of the argument

Start with the bill, because almost nobody does. The metered model spend across our whole agent fleet, eight agents doing real work every day, runs at about a dollar a day. That is not a typo. The cheap, high-volume work, classification, scoring, routing, drafting, goes to small models that cost fractions of a cent a call, so it adds up to roughly the price of a coffee a week.

Against that, a maxed local box is five to ten thousand pounds. Even if a local model displaced the entire metered bill, which it cannot, the payback is measured in decades, and the box depreciates and needs babysitting the whole time. The one genuinely expensive part of our stack is the premium model doing the hard interactive building, and that is a deliberate quality choice you would not hand to a slower local model anyway. So the thing local is supposed to save you is already nearly free, and the thing that actually costs money is the thing you would never move.

§3Fitting a model is not running it

Say you ignore the bill and buy the box anyway. Here is the part the desk-superintelligence videos quietly skate over: fitting a model is not running it. Unified memory capacity lets you load a four-hundred-billion-parameter model. It says nothing about speed. Speed is set by memory bandwidth, and a top-end desktop machine moves memory at roughly eight hundred gigabytes a second, where a datacentre GPU moves it at three terabytes a second and beyond.

So yes, you can load a frontier-sized model on your desk, and then watch it generate at a few tokens a second. That is fine for an overnight loop and painful for anything you sit and wait on. It is not a generation you can wait out either. It is physics. The biggest models on local hardware are slow by construction, today and in three years.

Capacity tells you the model fits. Bandwidth tells you whether you will enjoy using it. The videos quote the first number and never the second.

§4Rent the frontier, route the cheap stuff

If the goal is to run the big models, renting wins on both axes that matter. Ten thousand pounds of rental is something like a thousand hours of an eight-GPU box that runs those models many times faster than any desktop will, and you pay only when you use it. Owning is the worst of both: you pay up front, it runs slow, and it is obsolete when the next generation lands.

And if the goal is simply to spend less, the real levers are not hardware at all. They are software, and they are nearly free. Route every task to the cheapest model that clears the quality bar. Cache the repeated context instead of paying for it on every call. Use the batch endpoints for anything that is not urgent. Screen cheaply first and only escalate the expensive model when the cheap signal is genuinely unsure. Those compound, they need no capital, and they take money out of the bill today. A box takes money out of your account today and gives it back over a decade.

§5So what does local actually buy

None of this means never run a model yourself. It means be honest about what you are buying, because there is a real thing here and it is not cost. Local buys sovereignty. The data never leaves the box, which matters enormously the moment you are handling anything you cannot legally or commercially hand to someone else's cloud. It buys resilience, an open model you control still answers when the premium API is throttled, refused, or down, and we have watched all three happen in a single month. And it buys a claim you can make with a straight face, that you run a near-frontier model entirely on infrastructure the customer controls, which is worth more than the hardware if you sell to anyone who cares about that.

Those are real, and they are worth paying for. Just pay for them knowing that is the purchase. Local does not make the work cheaper. It makes it yours. Decide which one you actually need, and do not let a video sell you the wrong one.

Methodology note. First-person write-up of a real buy-versus-rent decision we worked through on 21 June 2026, prompted by the current wave of "run a frontier model on your desk" videos. The fleet figure is our own metered model spend read off our audit log (the recent run of logged actions totalled about a dollar across roughly a day); it excludes the premium subscription model that does our heavy interactive work, which is deliberately a quality spend, not a metered one. Bandwidth figures are approximate and directional (top-end desktop unified memory around 800GB/s, datacentre GPU memory above 3TB/s). Rental and box prices are ballpark, drawn from current GPU-cloud listings and top-config desktop pricing, not exact quotes. The point does not hinge on the decimals.

§1The instinct, and the arithmetic

§2The number that ends most of the argument

§3Fitting a model is not running it

§4Rent the frontier, route the cheap stuff

§5So what does local actually buy

▸ Related