A popular tip on r/LocalLLaMA says: on Intel hybrid chips, never set
llama.cpp's --threads to your full core count, pin it to the
fast cores. We wondered whether the deeper lesson holds on plain server
silicon with no efficiency cores at all. It does, and harder than we
expected. On our shared 12-vCPU box the sweet spot is 8 threads. Asking
for all 12 dropped prompt processing 3.4 times and collapsed token
generation by 267 times, down to half a token a second. That last number
is reproducible to two decimal places.
What we did
We built llama.cpp (CPU backend, build 5f04dc7) on a 12-vCPU
AMD EPYC-Genoa cloud VPS, no hyper-threading, and ran llama-bench
against Qwen2.5-0.5B in Q4_K_M. We swept --threads from 1 to
12, measuring prompt processing (the compute-bound prefill) and token
generation (the part bound by memory bandwidth and per-token
synchronisation) separately, three repetitions each. Then we re-ran the
decisive 8-versus-12 comparison to confirm it was not a fluke.
The numbers
| threads | prompt t/s | generation t/s |
|---|---|---|
| 1 | 110.8 | 34.3 |
| 2 | 206.3 | 49.8 |
| 4 | 366.9 | 84.5 |
| 6 | 502.0 | 108.2 |
| 8 | 581 to 658 | 125 to 134 |
| 10 | 546 (±110) | 72 (±12) |
| 12 | 173 to 185 | 0.50 (±0.01) |
Three things fall out of that table. The peak is at 8 threads, two thirds of the vCPUs, not at the core count. Going to all 12 cores is not a small tax, it is a cliff: prompt processing more than halves and generation stops working at any usable speed. And the zone above the peak is unstable, not a gentle slope. At 10 threads the variance explodes (±110 on prompt processing); a stray later run at 16 threads bounced back to ~550. The danger is not just that all-cores is slow, it is that the whole region above the sweet spot is unpredictable, and 12, the obvious default, lands on the worst case.
Why it happens
There are no efficiency cores here, so the original Intel explanation does not apply. The cause is contention for a shared, virtualised host with zero scheduling headroom. Token generation has to synchronise every thread at every single token. When you ask for as many threads as there are vCPUs, any moment the hypervisor steals a core for the host or a neighbour stalls the entire barrier, and that per-token tax compounds into a half-token-a-second crawl. Leave a couple of cores free for the OS and the host, and the stall disappears. Prompt processing batches its work, so it only bruises (3.4x) rather than collapsing.
What's still off
This is one model on one machine. The exact sweet spot will move with the
model size, the quantisation and, above all, the host: a dedicated bare
metal box with no noisy neighbours will behave far more kindly at full
core count. The point is not "always use 8". The point is that the
friendly-looking default of --threads = number of cores is a
landmine on shared cloud hardware, and the only safe move is to sweep it
once on the box you actually run on.
What's now in the stack
For our own CPU inference on shared VPS hardware we now pin
--threads to roughly two thirds of the vCPUs and confirm with
a quick llama-bench sweep before trusting any local model in a
loop. The full table and the one-line reproduction command live in the
GitHub mirror
so anyone on similar hardware can check their own box in two minutes.