§1Change one word in the URL, get an agent
alphaXiv (a site that hosts research papers and the discussion around them) shipped a feature called autoarxiv. Take any paper's web address, change the word arxiv to autoarxiv, and an AI agent goes to work on it: it fetches the paper's published code, fixes the setup so the code actually installs and runs, executes a small "does this work at all" test, and estimates what a full re-run would cost. A paper, turned into running code, by editing one word.
This lands right on a sore spot for us. We keep a backlog of around thirty "implement this paper" tasks, papers worth turning into something we can use, sitting in a queue that drains slowly because each one is a job. So the question wrote itself: does this robot beat us doing them by hand? We pointed it at our backlog to find out. The honest answer is no, and that the question is the wrong one.
§2What it actually does, and where it stops
First surprise: you cannot just point it. Change the URL and you do not get a result, you get a login wall. The agent runs on their servers, behind an account. That sounds like a footnote, but it decides everything for us: a tool you have to log into by hand is a manual step, not something we can wire into our automated loop. For a backlog we drain on a schedule, a human-only door is a wall.
Second, it is honest about its own ceiling, and so are we. alphaXiv say it plainly: today's models are "subpar for end-to-end autonomous research", and the tool is "excellent for resolving implementation issues and carrying out reproductions". Read that carefully. It fixes setup, runs a minimal check, prices the rest. It does not do the full build, and it does not make the call on whether the paper was worth building in the first place. It does the typing, not the thinking.
Third, how good is the engine underneath? The most-cited academic version of this idea, a system called AutoReproduce that alphaXiv themselves showcase, publishes its numbers. It gets the code running in roughly 82 to 90 percent of cases, at about $1.87 a paper. But "running" lands 22 to 30 percent off the paper's own reported results, and it works one experiment at a time, not whole codebases. So: cheap and fast at making code execute; a quarter to a third short of reproducing the actual finding.
§3Scoring it against what we actually needed
We did not need a robot that types. We needed one that decides. Here is autoarxiv scored against the four things a paper backlog actually asks of you. It is strong on the first, shaky on the second, and was never built for the last two, which are the ones that cost us time.
None of that is a knock on the tool. It is very good at the thing it does. It is just that the thing it does is the part we were never stuck on.
§4What we built instead: a gate that says KILL
The real lesson is older than the tool: reproduction is not implementation. "The code ran" is not "the result matched" is not "this is worth our week". Machines are now good at the first, getting good at the second, and nowhere near the third. So the leverage is not a better robot. It is a cheaper filter in front of the queue, so our hand-time only ever lands on papers that earn it.
We wrote that filter, small enough to actually run: a triage gate that checks the same cheap signals autoarxiv surfaces (is there official code, is the environment pinned, is there a way to run it, how heavy is it) and returns one of four verdicts: BUILD, PROBE, PARK, or KILL. No login, no key, no GPU, ten seconds a paper. It does not claim anything reproduces, that still needs a real run. It just decides whether that run is worth booking. The gate that needs an account is the gate nobody runs.
The result on our own backlog was bracing. Most of those thirty papers were never worth a full build, and now we have a one-word verdict that says so out loud instead of a vague guilt that we should get to them. autoarxiv did not replace the hand-build. It replaced the part of the hand-build that was never the point, and in doing so it made the case for the cheapest line of code we wrote this week: the one that says KILL.
