Adaptive Auto-Harness Reproduction

A new paper (arXiv:2606.01770) says the same thing we keep running into: an agent harness that you tune once looks great for a while, then quietly rots as the work drifts away from what you tuned it for. The fix it proposes is a self-improving harness that keeps a tree of specialised configs and routes each task to the right one. We rebuilt that mechanism and ran it. It holds where the one-shot version decays, and the way it splits the gap is the part worth keeping.

What we did

We rebuilt the paper's architecture in about 230 lines of plain Python: a stateful evolver that grows and prunes a tree of specialised harnesses, plus solve-time routing that sends each task to the nearest one. Then we ran it on a controlled drifting task stream of 4,000 tasks where the mix of work rotates over time and a brand-new region of tasks only switches on at the halfway mark (the "open-ended" bit). No model calls, so it is deterministic and runs in under a second across five seeds.

Three contenders. A static harness tuned once on a warmup window. A single online harness that chases the recent task average but cannot be in two places at once. And the adaptive tree-plus-routing harness.

method	overall	peak	final	decline
Static (construct-once)	0.820	1.000	0.818	0.182
Single online (no routing)	0.860	1.000	0.835	0.165
Adaptive Auto-Harness	0.986	1.000	0.994	0.006

Why it was worth doing

Both baselines reproduce the headline phenomenon: accuracy peaks early and then declines, shedding 16 to 18 points from their best window, with a visible sag exactly when the new task region opens up. The adaptive harness climbs to 0.99 and holds it. That is the claim, reproduced from scratch on our own stream.

The genuinely useful result is the paper's gap decomposition, which we made directly measurable. The distance to a perfect oracle splits into two parts: evolution loss (can the tree even represent this task) and adaptation loss (did routing pick the best branch it had). For the adaptive harness the residual gap is almost all evolution loss (0.011) with adaptation loss near zero. So routing is essentially solved, and the only thing left on the table is building richer branches. That tells a deployer which knob to turn instead of guessing.

Why it matters for us

The Bob/Gary build loop is an auto-harness over an open-ended stream: the queue keeps drifting as new papers and trends land. The decomposition gives a principled read on why a long-running loop stalls. Is it missing a capability (evolution loss, so build a new tool) or routing work to the wrong existing one (adaptation loss, so fix selection)? Those need opposite fixes, and conflating them is how self-improvement loops plateau.

What's still off

This is a mechanism reproduction on a synthetic stream, not a re-run of the paper's private prediction-market and forecasting benchmarks, which were not released. Solve rates come from a distance kernel, not real model behaviour, so the absolute numbers are illustrative. What transfers is the shape of the result (tree plus routing sustains where construct-once decays) and the decomposition as a deployable diagnostic. The code and seeds are public so anyone can run it.