We tried to hand our paper backlog to a robot, Workloft Research Note №54

§1Change one word in the URL, get an agent

alphaXiv (a site that hosts research papers and the discussion around them) shipped a feature called autoarxiv. Take any paper's web address, change the word arxiv to autoarxiv, and an AI agent goes to work on it: it fetches the paper's published code, fixes the setup so the code actually installs and runs, executes a small "does this work at all" test, and estimates what a full re-run would cost. A paper, turned into running code, by editing one word.

This lands right on a sore spot for us. We keep a backlog of around thirty "implement this paper" tasks, papers worth turning into something we can use, sitting in a queue that drains slowly because each one is a job. So the question wrote itself: does this robot beat us doing them by hand? We pointed it at our backlog to find out. The honest answer is no, and that the question is the wrong one.

§2What it actually does, and where it stops

First surprise: you cannot just point it. Change the URL and you do not get a result, you get a login wall. The agent runs on their servers, behind an account. That sounds like a footnote, but it decides everything for us: a tool you have to log into by hand is a manual step, not something we can wire into our automated loop. For a backlog we drain on a schedule, a human-only door is a wall.

Second, it is honest about its own ceiling, and so are we. alphaXiv say it plainly: today's models are "subpar for end-to-end autonomous research", and the tool is "excellent for resolving implementation issues and carrying out reproductions". Read that carefully. It fixes setup, runs a minimal check, prices the rest. It does not do the full build, and it does not make the call on whether the paper was worth building in the first place. It does the typing, not the thinking.

Third, how good is the engine underneath? The most-cited academic version of this idea, a system called AutoReproduce that alphaXiv themselves showcase, publishes its numbers. It gets the code running in roughly 82 to 90 percent of cases, at about $1.87 a paper. But "running" lands 22 to 30 percent off the paper's own reported results, and it works one experiment at a time, not whole codebases. So: cheap and fast at making code execute; a quarter to a third short of reproducing the actual finding.

§3Scoring it against what we actually needed

We did not need a robot that types. We needed one that decides. Here is autoarxiv scored against the four things a paper backlog actually asks of you. It is strong on the first, shaky on the second, and was never built for the last two, which are the ones that cost us time.

PASSGet the code running. This is the real win. Around 85 percent of the time it resolves the setup mess (mismatched versions, missing dependencies, the usual rot) and gets a published repo to execute. That is hours of fiddly work, gone, for about two dollars.

PARTIALMatch the paper's numbers. Running is not matching. The engine class lands 22 to 30 percent off the reported result, one experiment at a time. Good enough to see if a method is plausible. Not good enough to trust the claim, or to build a product on it.

FAILDecide which papers are worth it. Not its job, and it does not pretend otherwise. The expensive part of our backlog was never the code. It was judging which of thirty papers deserved a build at all. The robot hands that decision straight back to you.

FAILSlot into an automated loop. Gated behind a login, run on someone else's servers. Brilliant as a thing a human opens in a browser. Useless as a step in a pipeline that runs without one.

None of that is a knock on the tool. It is very good at the thing it does. It is just that the thing it does is the part we were never stuck on.

§4What we built instead: a gate that says KILL

The real lesson is older than the tool: reproduction is not implementation. "The code ran" is not "the result matched" is not "this is worth our week". Machines are now good at the first, getting good at the second, and nowhere near the third. So the leverage is not a better robot. It is a cheaper filter in front of the queue, so our hand-time only ever lands on papers that earn it.

We wrote that filter, small enough to actually run: a triage gate that checks the same cheap signals autoarxiv surfaces (is there official code, is the environment pinned, is there a way to run it, how heavy is it) and returns one of four verdicts: BUILD, PROBE, PARK, or KILL. No login, no key, no GPU, ten seconds a paper. It does not claim anything reproduces, that still needs a real run. It just decides whether that run is worth booking. The gate that needs an account is the gate nobody runs.

The result on our own backlog was bracing. Most of those thirty papers were never worth a full build, and now we have a one-word verdict that says so out loud instead of a vague guilt that we should get to them. autoarxiv did not replace the hand-build. It replaced the part of the hand-build that was never the point, and in doing so it made the case for the cheapest line of code we wrote this week: the one that says KILL.

Methodology note. We tested alphaXiv's autoarxiv against our live "implement paper" backlog on 25 June 2026. The access wall (authkit then openresearch.sh) is our own observation from the redirect. The engine-class numbers are from AutoReproduce (arXiv:2505.20662): ~82 to 90% executable, 22 to 30% performance gap, ~$1.87 per experiment with o3-mini, on a 13-paper benchmark, with repo-level automation and data preprocessing named as open by its authors. The "subpar for end-to-end autonomous research" framing is alphaXiv's own. The triage gate is open source in our ships mirror; verdicts are our read, not the tool's.

We tried to hand our paper backlog to a robot

§1Change one word in the URL, get an agent

§2What it actually does, and where it stops

§3Scoring it against what we actually needed

§4What we built instead: a gate that says KILL

▸ Related