Workloft
▸ WORKLOFT RESEARCH NOTE №58 · 1 JULY 2026

One-Pass AI Review Caught the Loud Bugs and Missed the Quiet Ones

We seeded twelve bugs into an AI-written pull request and reviewed it two ways. A single agentic pass caught ten. Three focused lenses caught all twelve. The two it missed were the silent ones.

REG FIT ●●○ · MEDIUM · ANY AGENT-WRITTEN CODEBASE

§1The bottleneck everyone is now describing

Addy Osmani's note on agentic code review puts a clean name on a problem the whole industry is feeling. Writing code has gone nearly free. Reading it has not. So review is the new bottleneck, and the numbers he gathers are ugly. Four independent 2026 datasets he cites, from Faros AI, CodeRabbit, GitClear and GitHub, line up: AI roughly quadruples the volume of code shipped while delivering about a 12% real productivity gain, defect rates climb from 9% to 54%, review times are up 441%, and zero-review merges are up 31%.

His core point is the one worth sitting with. Agent-written code carries no human intent behind it, so a reviewer has to reconstruct reasoning that never existed. That is a slower job than reading code a person thought through. The tempting fix is to throw another agent at it and let AI review AI. We wanted to know whether that actually works, so we ran a small, honest test on our own review tooling.

§2What we tested

We built a fixture: a small Python billing service, then an "agent" pull request that adds invoicing, admin auth and payment capture in 97 lines. Into that diff we planted twelve realistic bugs of the kind agent-written code actually produces. Three were critical security holes: a SQL injection, a hardcoded live secret key, and an auth check that fails open and returns True on any error. Six were plain correctness or resource bugs: an off-by-one in pagination, a missing None check that crashes, a mutable default argument, an is used for string equality, a file opened and never closed, and a swallowed exception. Three were subtle: a check-then-act file race, a naive-versus-aware datetime comparison, and money handled as a binary float.

Then we reviewed the same diff two ways, with the answer key kept outside the repo so no reviewer could see it. Condition A was a single agentic pass, one reviewer reading the whole change the way a busy engineer would. Condition B split the same work across three reviewers, each given one lens: security, correctness, and resource-and-concurrency. We took the union of what the three found.

§3The result

ConditionSetupCaught
A — single passOne generalist reviewer, one read10 / 12
B — lens splitThree reviewers, one lens each, union12 / 12

The single pass was not bad. It caught all three critical security holes and every crash-inducing correctness bug. What it missed were exactly two: the exception that gets swallowed with a bare except, and the money handled as a float. The lens split caught both. The correctness lens alone found the float bug; the resource lens explicitly flagged the swallowed exception.

§4Why the quiet ones slip

Look at what the single pass missed and a pattern falls out. A SQL injection shouts. A hardcoded sk_live_ key shouts. A None dereference crashes on the first test. Those are loud, and a single reader triages by salience, so they surface first and they surface reliably. A swallowed exception produces no output and no crash. Money as a float is correct to the penny almost every time and wrong occasionally, silently. Neither trips an alarm, so a reader sweeping the whole diff for "what looks dangerous" glides past them.

A lens does not sweep for danger. It sweeps for a class. The reviewer told to think only about resources is looking for unclosed handles and swallowed errors, so a bare except is the first thing it sees, not the last. That is the whole mechanism. You do not fix the quiet-bug problem by adding reviewers. You fix it by giving a reviewer a narrow enough remit that the quiet bug is the thing it is hunting.

§5The cost, stated honestly

The lens split was not free. It also produced about four findings that were not in our seeded set: a missing request timeout, a possible path traversal, an unencoded parameter. Most were real but lower severity, so more coverage came with more noise. The single pass produced one such extra. If you route every lens finding straight to a human, you have traded a miss problem for a triage problem. That is a fair trade for regulated or high-blast-radius code and a poor one for a throwaway script, which is Osmani's own point about matching review depth to blast radius.

§6The pattern worth stealing

Our own /code-review already fans a change out across dimensions rather than reading it once, and this is the small piece of evidence for why. The headline finding is narrow and we will not oversell it: one fixture, twelve bugs, one run per condition, so treat it as directional, not a benchmark. But the direction is clear and it matches the papers. Adding AI to review does not lift catch rate on its own. Structure does. If agent-written code is going to quadruple your volume, the review that keeps up is not one bigger model reading faster. It is several narrow readers, each hunting one class of quiet bug the fast pass was always going to miss.


Methodology note. Fixture: a 97-line Python "agent PR" with twelve hand-seeded defects across security, correctness, resource and concurrency classes. Reviewers were run blind, with the answer key held outside the git repository. Two conditions: a single generalist review pass, and a three-lens split (security / correctness / resource-and-concurrency) scored as a union. One run per condition on 1 July 2026, so the catch-rate figures are ours and directional, not a benchmark. The volume, defect-rate, review-time and zero-review-merge figures are from Addy Osmani's summary of the Faros AI, CodeRabbit, GitClear and GitHub 2026 datasets, not our own measurement.