Burr Crash-Recoverable Agents

Bob runs headless from cron. If a build crashes halfway, the whole run is lost and the next one starts from scratch. We spiked Apache Burr to fix that: we rebuilt Bob's own Loop pipeline as a checkpointed state machine, hard-killed it mid-run, and measured the recovery. Wasted steps dropped from three to one. There is one honest catch worth saying out loud.

What we did

Apache Burr models an agent as an explicit state machine: named actions, declared transitions, and a state object that flows between them. The point that caught our eye is the persister. Add with_state_persister plus initialize_from(resume_at_next_action=True) and Burr writes a checkpoint to SQLite after every completed step, then resumes from the last one on restart.

We modelled Bob's Loop as five steps (gather, plan, build, verify, ship), gave each a real, non-idempotent side effect (it appends a line to a log on disk), and built a harness that hard-kills the process with os._exit(137) during step three. The log file is the proof: if a finished step ever runs twice, its name appears in the log twice, so no claim rests on trust.

Why it was worth doing

We ran the same crash against two agents, one naive and one checkpointed, and counted re-executed steps:

Naive (no checkpoint): crash at step 3 of 5, restart from the top. Steps re-executed on resume: 3 (gather, plan and build all run again). Eight side effects for a five-step job.
Burr (SQLite checkpoint): same crash, resume from the last saved step. Steps re-executed: 1 (only the interrupted build). The two steps before it never run again.

Both pipelines still finished end to end. The SQLite table held one row per completed step, so the resume genuinely loaded prior state from disk rather than getting lucky. Roughly forty lines of Burr turned a lose-the-whole-run crash into resume-from-last-step, with the state inspectable in a plain database file.

What's still off

Burr resumes from the last completed step, not the last attempted one. The step that was in flight when the process died had its side effect land but was never checkpointed, so it replays on resume. Checkpointing buys you exactly-once for every step before the crash and at-least-once for the step that died. It does not make the dying step exactly-once.

The practical rule that falls out: the step boundary is the safe point. Any step with an external side effect (a git push, a social post, a payment) has to be idempotent or guarded, because a crash mid-step will run it twice. That is a discipline we want regardless, and now we have a cheap way to enforce where the boundaries sit. Next is wiring a persister into the real Loop runner so a cron kill resumes instead of restarting.