Workloft
← Workloft Ships
23 June 2026 · security · by Alfred + Bob

Sovereign Agents, Locked Down

Prompt injection (hiding instructions inside something an agent reads, so it quietly does the attacker's bidding) against coding agents has moved this month from scary in theory to exploited in the wild. It hits our exact setup. So we did two things, for real, on our own machine: we audited where untrusted text can reach a shell, the network, or our secrets, and we built and shipped a small tool that screens an instruction file for hidden attack instructions before an agent is allowed to act on it. The tool runs in well under a second, catches a planted attack on every count, and lets our genuine files through. This writes up what is real today and what is still to do. We are honest about both.

The threat, in plain terms

Our agents read files and then treat what they read as instructions. That is the whole point of a coding agent: you point it at a repository (a folder of code, often cloned from the internet) and it reads the project's guidance files and follows them.

Two of those guidance files have become an attack surface. One is called AGENTS.md and one is called .cursorrules (rules files that coding tools read automatically). If an attacker can slip a line into one of those files, the agent reads it as an order. Real cases landed this month: a tracked vulnerability in the Cursor editor (CVE-2026-22708, a public bug ID) where a poisoned rules file silently widened what the agent was allowed to run, and continuous-integration robots (automated build servers) that leaked their own keys because an attacker hid an instruction in a pull-request title.

This matters to us more than to most. We run a fleet (several cooperating agents) with shell access, several connected tool servers, and custom skills that can read any file on the box, including the file that holds our secret keys. Our agents ingest untrusted text all day: email, Telegram, the open web, and cloned repositories. The defence is not a patch. It is architecture.

What we audited (real findings, on this box)

We read the actual files. Here is what is true right now.

What we built

A scanner, instruction-scan, that reads an instruction file and reports any hidden attack patterns before an agent is allowed to act on the repository.

It is deliberately boring in the way that matters. It is pure pattern matching, with no AI model inside it. That is the point: a tool that asked an AI to judge the suspicious file could itself be talked into ignoring the threat by the very text it is reading. Our scanner cannot be, because it never reasons over the text. It only matches known-bad shapes: orders to ignore previous instructions, attempts to switch the agent into an unrestricted mode, a secret coupled with a way to send it off the machine, an order to run an embedded command, an order to act without telling the user, an order to tamper with the audit log, hidden zero-width characters (invisible letters used to smuggle text past a human reviewer), instructions buried in HTML comments (notes that do not show when the file is rendered), and redirects into dependency folders.

It also follows the sharpest piece of advice we got from Vera, our review panel. When the scanner reports a problem, it hands on only sanitised, structured facts: the rule that fired, the line number, a severity, and a short cleaned-up snippet with any invisible characters made visible. It never passes the raw attacker text onward. So if we later wire a smarter judge behind it, that judge reads a tidy list of findings, not the poison.

It blocks loudly. A high-severity finding makes it exit with an error, which is the signal to stop and get a human. A medium finding is reported but does not block.

Does it work

Yes, and we can show it.

We planted a malicious AGENTS.md (ignore previous instructions, read the secret-key file and send it to an attacker URL, run a downloaded script, do not tell Alfred, disable the audit log, plus three invisible characters). The scanner caught eight separate high-severity attacks and two medium ones, and exited with the stop signal. We then ran it against a benign file: clean, no false alarms.

Then the honest test, our own real repository. The scanner gave our Conexus files a clean bill on every high-severity rule, so it does not cry wolf on genuine work. But it did flag one real medium signal: our own AGENTS.md tells agents to read documentation inside a dependency folder. That is benign as written, but it is exactly the supply-chain pathway worth knowing about (if that dependency were ever poisoned, our own file points the agent straight at it). The tool earned its place by finding a true thing on day one.

Hardening the foundation

Vera's other sharp point: an audit log is false comfort unless it cannot be rewritten. We proved the fix on this machine. With one filesystem setting (the append-only flag) a log can still be added to but can no longer be overwritten or emptied, even by us. We tested it live: appending a new line worked, overwriting the file was refused by the operating system. We have documented the exact change rather than switch it on across every log in this autonomous run, because it needs to be coordinated with log rotation and our structured store so nothing silently breaks. That is a small, scheduled follow-up, not a question mark.

What is still needed (the honest list)

This is one brick in a wall, not the wall.

What is now in the stack