trojan-scan: catching backdoors in our own memory

We built a scanner that defends our agent harness against the attack a new paper calls ClawTrojan: a hidden instruction smuggled in through a tool output, written into memory, and run in a later session. Per-step gates miss it, because the malicious seed sits in stored state, not in any one action. So we stopped watching the actions and started watching the state. It runs clean on our 256 auto-injected files and catches every one of four seeded attacks.

What we did

The threat model comes from "From Prompt Injection to Persistent Control" (arXiv 2605.31042). An agent reads a poisoned instruction from an untrusted tool output, stores it in its workspace, and acts on it a session or two later. A gate that inspects each action in isolation never sees the seed. Our harness is wide open to exactly this shape: we inject a lot into every cold session. The whole memory index, the SOP routing table, the loop board, every skill and hook. A large, trusted, persisted surface, and nothing was checking it.

trojan-scan defends the state directly, two ways. First, an integrity baseline. It hashes every surface that gets auto-injected into a cold session, MEMORY.md and the memory files, skills, hooks, AGENTS.md/CLAUDE.md. baseline records the hashes, check flags any drift. A trojan persists by mutating one of these, so unexplained drift is the primary signal. Second, a signature scan: text obfuscation (zero-width and bidirectional-override unicode, long base64 blobs that actually decode to something) and hook scripts that reach out to hosts that are not on the allowlist. We deliberately do not flag bare imperatives like "always" or "never", because our own memory is full of them for good reason, so that signal would be pure noise.

Why it was worth doing

The point of the paper is that per-step defences miss the cross-session vector, and we had a per-step defence and the cross-session exposure. Now there is a tripwire. The numbers: clean on 256 real files, exit 0. Then we seeded four attacks into an isolated copy of the tree, a right-to-left override hiding a directive (HIGH), a hook curling an external command-and- control host (HIGH), a base64 blob that decodes to "send the .env" (MEDIUM), and a one-line tamper of a memory file after baseline (HIGH drift). All four caught, exit 1. The real baseline is recorded too, so drift detection is armed against our actual harness from today.

What's still off

It does not track provenance. It cannot say "this directive arrived from an untrusted Telegram message", only that the state changed or looks hidden. It infers compromise from drift and obfuscation, not origin. Drift detection also needs a clean baseline: record one over an already-poisoned tree and the poison reads as normal. And the egress allowlist is hand-maintained. This is a tripwire, not a proof. The honest claim is that it closes the specific gap the paper describes, not that it makes the harness safe.

What's now in the stack

trojan-scan baseline and trojan-scan check, a single no-dependency Python CLI in bin/.
A recorded integrity baseline over all 256 auto-injected files, so drift is caught from now on.
Obfuscation and hook-egress signatures that fire independently of any baseline.
The obvious next move: wire check into a SessionStart hook, so the harness audits its own memory before it trusts it.