Workloft Labs · Pillar guide

Model Routing and Orchestration

Routing across providers with failover, sovereign fallback, cost-aware tiers, and the discipline to spend the expensive model only where it changes the answer.

One model for every task wastes you at both ends: you overpay on the easy work and underperform on the hard. Routing is the substrate that fixes it. Pick the right model per job, fail over when a provider dies or refuses, drop to local inference when the data is sensitive, and escalate only when the cheap path is not good enough. These notes are how we think about it, including the times our own gateway said no before a human could be tempted.

Routing, failover and refusal

A router that cannot fail over is a single point of failure with extra steps. And sometimes the correct route is no route at all.

Four Cheap Models Shipped This Month, Our Gateway Refused Every OneWe lined up the month's new budget models to benchmark them. Our gateway refused four of five on data policy before we ran a single task. The benchmark was the routing. A Guardrail Refused Our Model Upgrade — and That Is the Control WorkingWe tried to route to the #2 frontend model on the public leaderboard. Our zero-data-retention policy returned a 404 before a human could be tempted. The refusal is the feature. Interop is no longer the moatA2A v1.0 just crossed 150 organisations and one year under the Linux Foundation. Agent-to-agent interoperability is officially commodity. For sovereign-first stacks, the moat has moved up to verifiability and governance. The mandate is the moatGoogle is donating its Agent Payments Protocol to the FIDO Alliance and layering Universal Commerce Protocol on top. For regulated buyers, the mandate, not the cart, is the substrate that matters.

Spending compute where it pays

Effort should be attributed to where it changes the outcome, not spread evenly to feel thorough. Most stacks do the opposite.

Who Is Worth 10× the Token Budget?The industry admits it cannot tell which spend deserves 10× the budget. Our fleet's 30-day audit ledger suggests the question is wrong: meter task classes, not people. The Four-Agent Question Every System-Design Card Gets WrongA popular system-design card asks you to pick one orchestration pattern for a four-agent pipeline. It is really two questions wearing one hat: topology and control. Shared Search Memory Is the Agent Cost ControlCPT turns parallel test-time search into shared inference state, exposing why regulated AI buyers should care about inference cost, latency and auditability. Verify Only the Answers You DoubtSelective verification and FAPO both say the same thing: attribute effort to where it changes the outcome, do not spread it evenly. We shipped it into our gate.

Smaller, cheaper, distilled

Most tasks do not need a frontier model. The interesting question is which ones do, and what you quietly give up moving the rest down.

Can a 26M-parameter model call your tools?We benchmarked Needle, a 26M-parameter Simple Attention Network distilled from Gemini 3.1, against five real Workloft tool schemas. 50 hand-labelled queries. 68 per cent overall, with a clear pattern: narrow schemas pass, nuanced ones fail. Stop Teaching Agents the Whole TranscriptHINT-SD shows why long-horizon agent training should distil failure-relevant actions, not every token in a polished trajectory, for auditable AI operations. Prompt-Level Distillation and the Audit Gap Nobody CostedPrompt-level distillation moves reasoning patterns from teacher to student models. For regulated buyers, it quietly relocates the audit boundary. Here is the cost.

Workloft is a one-person AI engineering studio. We publish what we learn building agent systems in the open. Read all the notes → or get in touch →.