Sovereign Agents, Locked Down

Prompt injection (hiding instructions inside something an agent reads, so it quietly does the attacker's bidding) against coding agents has moved this month from scary in theory to exploited in the wild. It hits our exact setup. So we did two things, for real, on our own machine: we audited where untrusted text can reach a shell, the network, or our secrets, and we built and shipped a small tool that screens an instruction file for hidden attack instructions before an agent is allowed to act on it. The tool runs in well under a second, catches a planted attack on every count, and lets our genuine files through. This writes up what is real today and what is still to do. We are honest about both.

The threat, in plain terms

Our agents read files and then treat what they read as instructions. That is the whole point of a coding agent: you point it at a repository (a folder of code, often cloned from the internet) and it reads the project's guidance files and follows them.

Two of those guidance files have become an attack surface. One is called AGENTS.md and one is called .cursorrules (rules files that coding tools read automatically). If an attacker can slip a line into one of those files, the agent reads it as an order. Real cases landed this month: a tracked vulnerability in the Cursor editor (CVE-2026-22708, a public bug ID) where a poisoned rules file silently widened what the agent was allowed to run, and continuous-integration robots (automated build servers) that leaked their own keys because an attacker hid an instruction in a pull-request title.

This matters to us more than to most. We run a fleet (several cooperating agents) with shell access, several connected tool servers, and custom skills that can read any file on the box, including the file that holds our secret keys. Our agents ingest untrusted text all day: email, Telegram, the open web, and cloned repositories. The defence is not a patch. It is architecture.

What we audited (real findings, on this box)

We read the actual files. Here is what is true right now.

We already own two of the recommended defences. There is an audit log (a running record of what agents did) and a Bash gate (a checkpoint that every shell command passes through). The gate already blocks a tight set of clearly malicious command shapes: a download piped straight into a shell, reverse shells, catastrophic deletes, and credential theft. That is the right shape of defence and it is live.
The gap was instruction files. Nothing screened AGENTS.md or .cursorrules before an agent acted on them. Our Conexus repository (the school-placement product we build for a client) carries both an AGENTS.md and a CLAUDE.md, and the CLAUDE.md pulls the AGENTS.md in automatically. That is a live, unguarded path. This is the gap we closed.
Our secret-key files are locked to other users, but not to our own agents. The files holding our keys are readable only by our own account, which stops other people on the machine. It does not stop an injected agent, because the agent runs as us. We counted more than thirty live keys in one place, including keys that can touch our database, our hosting, and our code repositories. Tight file permissions alone do not cap the damage. Short-lived, rotating keys would. That is still to do.
Connected tool servers are trusted by default. We have four. A poisoned description on any one of them could steer an agent without ever touching the shell gate. Noted, not yet closed.
The audit log was not tamper-proof. Any process running as us could rewrite or empty it, which means it could give false comfort. We verified the fix works (see below).

What we built

A scanner, instruction-scan, that reads an instruction file and reports any hidden attack patterns before an agent is allowed to act on the repository.

It is deliberately boring in the way that matters. It is pure pattern matching, with no AI model inside it. That is the point: a tool that asked an AI to judge the suspicious file could itself be talked into ignoring the threat by the very text it is reading. Our scanner cannot be, because it never reasons over the text. It only matches known-bad shapes: orders to ignore previous instructions, attempts to switch the agent into an unrestricted mode, a secret coupled with a way to send it off the machine, an order to run an embedded command, an order to act without telling the user, an order to tamper with the audit log, hidden zero-width characters (invisible letters used to smuggle text past a human reviewer), instructions buried in HTML comments (notes that do not show when the file is rendered), and redirects into dependency folders.

It also follows the sharpest piece of advice we got from Vera, our review panel. When the scanner reports a problem, it hands on only sanitised, structured facts: the rule that fired, the line number, a severity, and a short cleaned-up snippet with any invisible characters made visible. It never passes the raw attacker text onward. So if we later wire a smarter judge behind it, that judge reads a tidy list of findings, not the poison.

It blocks loudly. A high-severity finding makes it exit with an error, which is the signal to stop and get a human. A medium finding is reported but does not block.

Does it work

Yes, and we can show it.

We planted a malicious AGENTS.md (ignore previous instructions, read the secret-key file and send it to an attacker URL, run a downloaded script, do not tell Alfred, disable the audit log, plus three invisible characters). The scanner caught eight separate high-severity attacks and two medium ones, and exited with the stop signal. We then ran it against a benign file: clean, no false alarms.

Then the honest test, our own real repository. The scanner gave our Conexus files a clean bill on every high-severity rule, so it does not cry wolf on genuine work. But it did flag one real medium signal: our own AGENTS.md tells agents to read documentation inside a dependency folder. That is benign as written, but it is exactly the supply-chain pathway worth knowing about (if that dependency were ever poisoned, our own file points the agent straight at it). The tool earned its place by finding a true thing on day one.

Hardening the foundation

Vera's other sharp point: an audit log is false comfort unless it cannot be rewritten. We proved the fix on this machine. With one filesystem setting (the append-only flag) a log can still be added to but can no longer be overwritten or emptied, even by us. We tested it live: appending a new line worked, overwriting the file was refused by the operating system. We have documented the exact change rather than switch it on across every log in this autonomous run, because it needs to be coordinated with log rotation and our structured store so nothing silently breaks. That is a small, scheduled follow-up, not a question mark.

What is still needed (the honest list)

This is one brick in a wall, not the wall.

Short-lived, rotating keys. The single highest-leverage next step. It caps the blast radius if any agent is ever fooled. Locking the files is not enough.
Switch on append-only across the live logs and the structured store, coordinated with rotation.
Screen the connected tool servers, which today bypass the shell gate.
Wire the scanner into the moment of cloning, so no repository is acted on before it is screened, and feed its structured findings to a guardian for the grey-area cases.
A known limit: the shell gate and these hooks fire for the main agent, not for sub-agents it spawns. Closing that is on the list.

What is now in the stack

scripts/instruction-scan.py: the scanner: the rule set, the sanitiser, the structured output, and a built-in self-test.
scripts/instruction-scan.md: the one-page skill doc for the fleet.
Run it with instruction-scan.py <repo> before acting on any cloned or dependency repo, or --demo to watch it catch a planted attack and pass a benign file.
A proven, documented append-only change for the audit log.