Can a 26M-parameter model call your tools?

§1What Cactus actually shipped

This week, Cactus Compute open-sourced Needle, a 26M-parameter language model distilled from Gemini 3.1. Cactus describes it as a Simple Attention Network: an eight-layer decoder with d=512, eight heads, four key/value heads, an 8,192-token BPE tokenizer, and tied embeddings between the input layer and the tool-call linear head. Weights, training data, and inference runtime are all public. They published latency claims of 6,000 tokens/sec prefill and 1,200 tokens/sec decode on Cactus, the on-device runtime they also ship.

The claim that matters for production agents is narrower than "this is a small frontier model". It is "this is a small model that can call your tools". The Cactus README is explicit: Needle is a tool-call distillation target, not a general-purpose chat model. The weights are positioned for on-device, latency-sensitive, dispatch-style work. Pick a tool from the schema. Fill in the arguments. Return JSON. Nothing else.

That is a respectable claim and an interesting one. It is also one we could test directly against the actual tool schemas we already operate in production. So we did.

§2Why our schemas, not theirs

Every public tool-calling benchmark we have read in the last six months has the same problem. The schemas are clean. The queries are short. The expected outputs are unambiguous. They tell you whether a model can call a generic search_web tool when asked "what is the weather in Paris". They do not tell you whether a model can decide, against your real dispatch graph, that "PayPal: Your invoice has been paid" is a noise email and not a todo, or that "Maggie keeps sending the same email twice" is a debugging request rather than a brainstorm.

For Workloft, the dispatch points that genuinely matter sit at the boundaries of five agents. Gary classifies inbound email into todo, follow-up, or noise. Maggie's reply classifier sorts incoming responses into interested, follow-up-later, or disinterested. Otto sorts shipped changes into a small fixed set of changelog categories. Bob's skill router routes free-form requests to the right skill: research, debug, brainstorm, ship. Hindsight, the memory layer, decides whether free-form text should be retained, recalled, forgotten, or ignored.

Those are the only five places in our stack where a small model would actually do useful work. So we built a 50-query evaluation against those five schemas, ten queries each, hand-labelled by the operator (Alfred) for the right tool and the right primary argument. Every query is the kind of input a real agent in our stack would actually see this month.

§3The numbers

Across all 50 queries, Needle picked the right tool 34 times. 68 per cent. Median latency 2.36 seconds on a 26M-parameter model running locally. The pass bar in the experiment design was 80 per cent tool selection plus 80 per cent primary-argument selection, deliberately tight. Needle does not clear it overall.

What does clear it is the per-schema breakdown:

otto_changelog: 10/10 (100%). Tight enum (six categories), clear semantic distinctions between them, low ambiguity. Needle is perfect.
maggie_reply: 8/10 (80%). Just at the bar. The two failures are both lukewarm replies ("not the right time for us") which Needle reads as disinterested when they are really follow-up-later.
bob_skill: 6/10 (60%). Four research queries are returned as no-tool-call ("what is the latest on the EU AI Act enforcement timeline" gets an empty answer). One debug query is misrouted to brainstorm.
gary_inbox: 5/10 (50%). Newsletters and receipts that should be ignored get turned into todos. The model defaults to action over inaction.
hindsight: 5/10 (50%). Statements of preference ("Remember I'm a Tottenham fan") get classified as forget rather than retain. Half the recall queries return no tool call at all.

There is a pattern. Schemas where the right answer is a clean partition of the input space, with a small number of well-separated options, score very well. Schemas where the right answer requires reading sentiment, recognising rhetorical hedging, or distinguishing "act now" from "do nothing" score poorly. Needle is good at picking. It is not good at reading the gap between what was said and what was meant.

§4What this means for production agents

Two things become visible from this kind of test that are invisible in clean benchmarks.

The first is that "tool-calling capability" is not one capability. It is at least two. There is dispatch (which tool, against a clear schema, given a clear input) and there is judgement (which tool, given that the input is ambiguous, hedged, social, or noisy). A 26M-parameter distilled model can do the first to a surprising degree. It cannot do the second yet. That is a useful boundary for an operator to be able to see, and it is not visible without testing against your own work.

The second is that the schemas that pass are the ones that have already been disciplined by the operator. Otto's six categories are tight because we have argued about them. Gary's four (todo, follow-up, ignore, schedule) are loose because we have not. The model passes the eval where the human has done the work of designing the dispatch space well. The model fails the eval where the human has been lazy about it. That is not a bug in the model. It is a signal about which schemas are ready for delegation and which are not.

§5Where this matters for buyers

For UK Local Authorities exploring on-device inference for any data-sensitive workload, this kind of test is exactly the procurement-grade artefact that public benchmarks fail to provide. Whether a 26M-parameter model can route the council's actual referral or FOI intake correctly is not derivable from MMLU or GSM8K. It needs to be tested against the schemas the council actually operates.

For FCA-regulated firms considering small-model dispatch on case files, the calculation is similar. The model that wins is the one that handles the firm's real intake taxonomy, not the one that wins on Hugging Face leaderboards. The fact that a 26M-parameter model can pass otto_changelog perfectly is good news. The fact that it fails gary_inbox is good news too, because it points at exactly which dispatch points the firm still needs to sharpen.

For a one-person shop like Workloft, the practical answer is that we will run Needle as the dispatch layer for otto, the only schema where it passes the 80 per cent bar today. Maggie's reply classifier we will keep on a larger model until the schema is tightened or the eval is reworked. The other three are not yet candidates.

§6What we are not claiming

50 queries is small. A larger evaluation would shift the per-schema numbers around the edges. We are not claiming a definitive rank ordering of small models against ours; we are claiming a workable shape of where this particular model lands against this particular set of dispatch points on a particular day.

The argument is not that Needle is good or bad. The argument is that "tool-calling capability" is a bundle, that the bundle behaves differently on real schemas than on benchmark schemas, and that the only honest way to know which part of the bundle a given model gives you is to test against the work you actually want to dispatch.

We have also not tested the on-device runtime claim. Cactus says 1,200 tokens/sec decode on their inference engine. Our test ran against the standard PyTorch weights on a CPU VPS. The throughput claim deserves its own benchmark on actual on-device hardware. That is the next iteration.

§7What is next

Three concrete steps follow. We will run the same evaluation against three more small open models (TinyLlama, SmolLM2, Phi-4-mini) on the same five schemas, so the result becomes comparative rather than absolute. We will tighten the schemas that scored low (especially gary_inbox and hindsight) and re-run, so we can see whether the failures are model failures or schema failures. And we will rerun on real Cactus on-device hardware to test the latency claim under the conditions it was designed for.

The wider Labs argument is the one we keep coming back to. Substrate beats spectacle. A 26M-parameter model that runs locally and gets otto right perfectly is more useful to an operator than a frontier model that gets otto right perfectly and costs eighty times more per call. The work of building agents in production is the work of finding the smallest model that handles each dispatch point well enough, and the only way to do that is to evaluate against your own work, not someone else's benchmark.

Methodology note. Needle weights pulled from Cactus-Compute/needle on Hugging Face. Inference run against the standard PyTorch weights on a CPU VPS, not the on-device Cactus runtime that Cactus benchmarks against. 50 hand-labelled queries across five Workloft tool schemas (gary_inbox, maggie_reply, otto_changelog, bob_skill, hindsight), ten queries per schema. Each query labelled by the operator (Alfred) with the expected tool name plus the expected primary argument. Match counted only when the tool name matched exactly. Full eval script and results JSON live at github.com/workloftai/loop-pilot (forthcoming: separate workloftai/needle-eval repo once we add the comparative runs). Forthcoming: re-run on Cactus on-device hardware to test the latency claim; comparative eval against TinyLlama, SmolLM2, Phi-4-mini.