Workloft Labs · Pillar guide

Verifying and Evaluating LLM Systems

Pre-send verification, panels of judges, selective verification, and the rule under all of them: a producer cannot mark its own homework.

An LLM tells you it is right with the same confidence whether it is or not. So the question that decides whether an agent is safe to ship is not how good the model is. It is what checks the output before it reaches the world, and whether that check is independent of the thing that produced it. These notes are our running argument about verification, from the gate in front of send() to the panel that judges a candidate answer, to knowing which answers are even worth doubting.

Verify before it leaves the building

When an agent speaks for you, "the model was careful" is not a control. Something independent has to be allowed to say no.

Agents Don't Need to Be Evil, Just ChattyA dealer chatbot and AI legal briefs failed the same way: outbound text with no verifier in front of send(). The boring channel beats the clever attack. Pre-send verification: when an agent speaks for the firm, "the model was careful" is not a controlWhen an agent sends external comms on the firm's behalf, the producer model is not a control. Multi-axis pre-send verification — deterministic gates plus a semantic guardian — is the substrate pattern that survives an audit. Workloft Research Note №05. TrustFall and the procurement question for any council buying agentic coding toolsThe TrustFall disclosure shows that all four major agentic coding CLIs (Claude Code, Gemini CLI, Cursor CLI, GitHub Copilot CLI) execute unsandboxed MCP servers from a malicious repo on a single Enter keypress. Read through the regulated-buyer lens, this is a procurement question — not a developer-hygiene one. Workloft Research Note №04. The Missing Middle: What Apodex 1.0 VerifiesApodex 1.0 ships verification as a teammate, not a postcheck. Every claim traces back to an evidence graph before delivery. That's the layer mandate-based stacks don't cover.

Panels, guardians and the missing reviewer

One judge is a single point of failure. And an agent that approves its own work has quietly deleted the reviewer entirely.

Seven Agents Fact-Checked What One Cheap Call Just GuessedWe rebuilt our hand-rolled classifier with the native multi-agent feature. It went and checked the facts, refused to pad its picks, and cost two orders of magnitude more. Here is when that trade is right. Self-Improving Agents Need a Guardian, Not a LogbookA self-improving AI framework updates both weights and agent architecture via an LM feedback agent. For regulated buyers, the real problem is who controls the change boundary. When an Agent Rewrites and Approves Its Own Harness, You Have Removed the ReviewerSelf-Harness lets an LLM diagnose its own failures, edit its own scaffolding, and accept the change after a regression test it set itself. A real capability gain, with the sign-off step quietly deleted. Claim Drift Is the Audit Problem Nobody NamedXcientist externalises research synthesis into inspectable artifacts and names claim drift. The same gap sits under every regulated agent deployment.

Spend the check where it counts

Verifying everything feels rigorous and is mostly waste. The skill is knowing which answers to doubt, and how to evaluate when no benchmark exists.

Verify Only the Answers You DoubtSelective verification and FAPO both say the same thing: attribute effort to where it changes the outcome, do not spread it evenly. We shipped it into our gate. When no benchmark exists: the methodology your Risk function was already going to needA Norwegian-led paper formalises 'benchmarkless comparative safety scoring' for LLMs and ships SimpleAudit, a local-first scoring instrument. It hands UK Local Authorities and FCA-supervised buyers the methodology a Risk function will defend — long before a labelled benchmark exists for their sector. Workloft Research Note №02. ARIS: the executor-reviewer pattern the regulated AM was always going to needARIS is an open-source research harness pairing an executor LLM with an adversarial reviewer. It describes the substrate pattern that an FCA-supervised asset manager will need before any agent ships in fund accounting. Workloft Research Note №01. Measure Before You TuneTwo-level autoresearch from arXiv 2605.30003 says the outer loop (do my policies even predict outcomes) must run before the inner loop (re-prompt them). Workloft has the autoresearch panel; tonight we wired the outer loop on Walt.

Workloft is a one-person AI engineering studio. We publish what we learn building agent systems in the open. Read all the notes → or get in touch →.