Ruby learned routing: a bandit that stops overpaying

Our model router has always obeyed the caller. Ask for the premium tier, it buys the premium tier, every time, whether or not the cheap one would have answered just as well. We put a small learner on top of it that watches outcomes and stops paying for the dear tier when the cheap one keeps getting the job done. It is live, it is additive, and on our most expensive category the gap it can close is roughly seventeen-fold.

What we did

Ruby, our router, maps a job category (classify, extract, reason_hard, code, and so on) to a tier (cheap, balanced, premium) and picks the first reachable model in that tier. It is deterministic and correct. It is also blind: it never notices that a classify answer is identical on the cheap tier and the balanced one, so it keeps paying balanced rates forever.

The new piece, ruby/learned_router.py, is an epsilon-greedy contextual bandit. The context is the category, the actions are the three tiers, and the reward is blunt: one point if the call succeeded, minus a penalty for what it cost. It samples every tier once (optimism under uncertainty), then mostly exploits the highest-reward tier while still exploring a sliver of the time so it can notice if a tier starts to slip.

We lifted the shape from a recent paper on reinforcement-learned orchestration of expert skills (arXiv:2605.22177). The "skills" in our case are the tiers, the policy is the bandit, the reward is success-minus-cost. We deliberately did not reach for a policy network: three actions and a noisy signal is a bandit problem, not a deep-RL one. Matching the method to the size of the problem is the honest move.

Why it was worth doing

The cost spread between tiers is not small. On reason_hard at a typical two-thousand-in, five-hundred-out call, the premium tier (Opus 4.8) runs about $0.0225 a call and the cheap tier (DeepSeek V4 Pro) about $0.0013. That is roughly seventeen times. On code the spread is wider still, north of fifty times, because cheap there is a flash model. Anywhere a caller has been defaulting to premium out of caution and the cheap tier actually answers, a learned downshift claws back up to about ninety-four per cent of that call's cost.

In the live A/B over a small battery of classify, extract and reason tasks, the policy explored all three tiers and then settled on the cheap tier for every one, because on easy work every tier passed and the cost penalty broke the tie downward. That is exactly the behaviour we wanted: do not buy the expensive answer when the cheap answer is the same answer.

What's still off

The reward in this version is success-minus-cost, where "success" means the call returned without erroring. It does not yet judge whether the answer was good. That is fine for categories where any tier is competent and the only question is price, but it would happily downshift a hard category whose cheap tier returns confident rubbish. The quality signal already exists elsewhere in our stack (the Vera panel that gates escalation), and wiring its verdict in as the reward is the obvious next step. Until then this is opt-in: ruby.chat is untouched and every existing caller keeps the deterministic router.

The dollar saving on a trivial battery is also noise: the tasks are too cheap to separate. The mechanism is proven and the per-call spread is real, but the headline saving only shows on a workload that genuinely leans on the premium tier. That is the workload we will point it at next.

What's now in the stack

learned_router.chat_learned(messages, category=..., default_tier=...) — opt-in wrapper that picks the tier by policy and feeds the outcome back as reward.
learned_router.suggest_tier(category) and record(...) — the bandit surface, if you want the decision without the call.
learned_router.ab_compare(tasks) — A/B harness reporting cost and tier mix for static versus learned routing.
python3 learned_router.py stats — per-category tier rewards and the tier the policy would currently exploit.
State persists to ruby/learned_router_state.json; ruby.chat and the static router are unchanged.