The £10 Test Before the £6,699 Mac

§1The £6,699 question

A mate sent me a screenshot this week: a 14-inch MacBook Pro, 128GB of memory, sat in his basket at £6,699. The argument around buying one is everywhere right now. Memory prices are climbing, the pitch goes, so grab the biggest box you can while you still can, because soon you will run powerful AI models on your own machine and never pay a cloud provider again. Own the hardware, own your future.

It is a tidy story. It is also a £6,699 bet, and the thing about bets that size is that you can usually test the premise for the price of a takeaway. So before anyone reached for a card, we tested it.

§2What ten pounds buys you

The bet has one load-bearing assumption: that an open model you run yourself is good enough, and cheap enough, to beat paying a cloud provider. That is a measurable claim, so we measured it.

We rented a GPU by the second on RunPod, stood up an open 7-billion-parameter model (Qwen 2.5), and pointed it at a real job we already had answers for. The task: read a short item and return structured JSON scoring it, marked against a set of ground-truth labels so nothing depends on a human or a second model judging the output. Then we ran the exact same 150 tasks through Gemini 2.5 Flash, one of the cheap cloud models we already use day to day. Same prompts, same scoring, side by side. Total spend, including the credit we had to load to start: under a tenner. Time from sign-up to results: about an afternoon.

We made the from-first-principles version of this case a few weeks ago, the back-of-envelope arithmetic on owning the box. This is the same question settled with a benchmark instead of a sum.

§3What the numbers said

The open model did better than I expected, and it still lost.

Metric	Gemini 2.5 Flash (cloud)	Qwen 2.5 7B (rented GPU)
Valid JSON	99.3%	100%
Yes/no call right	85.3%	82.0%
Exact score	52.7%	46.0%
Harder label right	68.7%	56.0%
Cost per 1,000 calls	about 6 cents	GPU time
Speed per call	0.9s	1.6s

On the call that matters most, the simple yes/no decision, the open 7B landed 82% against the cloud model's 85%. Three points back. Genuinely respectable for a small model on rented hardware, and it actually edged the cloud model on producing perfectly formed JSON every single time. But on the finer-grained scoring it was clearly weaker, and it was slower.

Then the part that ends the argument. Gemini Flash cost about six cents, a few pence, per thousand calls. Not per call. At that price the "save money by going local" case does not just lose, it inverts. You would be spending thousands on hardware to run a model that is slightly worse than one that already costs close to nothing.

§4What this test did not prove

Here is the honest limit, because the argument falls apart without it. We benchmarked a 7B model on a single 24GB GPU. We did not benchmark a 70-billion-parameter model on 128GB of unified memory, which is the actual thing that expensive Mac unlocks. So nobody should read this as "local AI is pointless." We did not test that machine's real job, and it would do more than our little 7B did.

What the test proves is the economics of the layer we can measure, and that logic travels up the stack. A bigger local model needs more hardware and more power to run a thing the cloud will also run, more cheaply, next year, with no capital outlay and nothing to maintain. The cloud models get cheaper and better on the same clock as the open ones. Owning the hardware does not win that race. It opts out of it.

§5The rule worth keeping

There are only three reasons to buy your own AI hardware: it is cheaper, it is better, or the data is not allowed to leave a machine you control. This test took the first two off the table for our work. The cloud option was both cheaper and better. That leaves the third, owning where your data lives, and it is a real reason, but it is not a free one, and it is not the reason most people are reaching for their card. They are reaching for "cheaper and better," and the numbers point the other way.

So buy the machine if you want the machine. It is a lovely bit of kit and you will use it for a hundred things. Just do not buy it as an AI investment on a premise you never tested, because the test costs ten pounds and an afternoon and the purchase costs six thousand six hundred and ninety nine. Run the cheap experiment before the expensive commitment. That is the whole lesson, and it is older than AI.

Methodology note. The benchmark ran 150 examples from an internal classifier holdout with ground-truth labels, scoring JSON validity, exact and within-one score accuracy, the binary threshold decision, and a secondary label, plus an estimated blended cost per thousand tasks (a character-based token estimate, for relative comparison, not a billed figure). Models: Gemini 2.5 Flash via its API; Qwen 2.5 7B Instruct self-served on one 24GB RunPod serverless GPU in the EU, reached through our own router as an OpenAI-compatible endpoint. The endpoint was torn down after the run. Costs beyond the cloud API are wall-clock GPU time, not per-token. The comparison is one task family (structured classification), not a general capability claim.

§1The £6,699 question

§2What ten pounds buys you

§3What the numbers said

§4What this test did not prove

§5The rule worth keeping

▸ Related