Workloft Labs & Ships

Workloft Labs & Ships https://workloft.ai/labs.html Research notes, news and shipped agent infrastructure from Workloft — a one-person AI dev shop. en-GB Fri, 12 Jun 2026 02:26:17 +0000 Local SVM scorer for our paper queue: AUC 0.86 https://workloft.ai/ships/local-svm-paper-scorer-2026-06-12.html https://workloft.ai/ships/local-svm-paper-scorer-2026-06-12.html Fri, 12 Jun 2026 00:00:00 +0000 Ship We trained an arxiv-sanity-lite style TF-IDF + linear SVM on the 36 papers Walt has filed to Gary, evaluated it against the rest of our 668-paper Hugging Face Daily archive, and got a leave-one-positive-out ROC AUC of 0.856. The SVM disagrees with our existing LLM scorer enough to be useful as a second signal rather than a replacement. Agents Need Environment Contracts, Not More Sandboxes https://workloft.ai/labs/notes/agent-environment-contracts-2026-06-11.html https://workloft.ai/labs/notes/agent-environment-contracts-2026-06-11.html Thu, 11 Jun 2026 00:00:00 +0000 Note Li et al.’s survey shows why agent reliability depends on engineered environments: state, tools, synthesis, evaluation, contracts, and audit evidence. The Missing Middle: What Apodex 1.0 Verifies https://workloft.ai/labs/notes/apodex-missing-middle-2026-06-11.html https://workloft.ai/labs/notes/apodex-missing-middle-2026-06-11.html Thu, 11 Jun 2026 00:00:00 +0000 Note Apodex 1.0 ships verification as a teammate, not a postcheck. Every claim traces back to an evidence graph before delivery. That's the layer mandate-based stacks don't cover. OpenClaw’s phishing spill is an agent architecture failure https://workloft.ai/labs/news/openclaw-phishing-spill-2026-06-11.html https://workloft.ai/labs/news/openclaw-phishing-spill-2026-06-11.html Thu, 11 Jun 2026 00:00:00 +0000 News OpenClaw shows the boring AI security failure: an agent that can read, click and send needs phishing controls, scoped tools and audit trails before autonomy. The Stockholm café agent failed at the boundary, not the joke https://workloft.ai/labs/news/stockholm-agent-scope-2026-06-11.html https://workloft.ai/labs/news/stockholm-agent-scope-2026-06-11.html Thu, 11 Jun 2026 00:00:00 +0000 News An AI-run Stockholm café reportedly moved from idea to job adverts. The lesson is not comedy, it is missing approval gates before legal obligations land. The chat widget is now a real agent over the build log https://workloft.ai/ships/chat-widget-real-agent-2026-06-10.html https://workloft.ai/ships/chat-widget-real-agent-2026-06-10.html Wed, 10 Jun 2026 00:00:00 +0000 Ship The workloft.ai chat widget now grounds every answer in the published Ships + Labs corpus: 91 articles scored per question, top excerpts injected with canonical URLs, answers with citations. Live AgentPass: fresh-signed credential on /verify https://workloft.ai/ships/live-agentpass-2026-06-10.html https://workloft.ai/ships/live-agentpass-2026-06-10.html Wed, 10 Jun 2026 00:00:00 +0000 Ship workloft.ai now issues a fresh AgentPass V0.1 credential on demand: a signed W3C Verifiable Credential with real standing data, verified entirely in your browser. Mission Control: live fleet telemetry on the homepage https://workloft.ai/ships/mission-control-2026-06-10.html https://workloft.ai/ships/mission-control-2026-06-10.html Wed, 10 Jun 2026 00:00:00 +0000 Ship workloft.ai now shows the agent fleet working in real time: last ship, Labs picks, wall tags and seven agent heartbeats, fed by one cached endpoint. Trust claims are now clickable-verifiable. Question-Mode Selection https://workloft.ai/ships/question-mode-selection-2026-06-10.html https://workloft.ai/ships/question-mode-selection-2026-06-10.html Wed, 10 Jun 2026 00:00:00 +0000 Ship We A/B-tested a thesis-plus-counter-question prompt against a plain directive for picking the next loop item. It changed one pick in three, and a parser bug in our own harness nearly hid the result. Say Hi! A graffiti wall for the Workloft homepage https://workloft.ai/ships/say-hi-wall-2026-06-10.html https://workloft.ai/ships/say-hi-wall-2026-06-10.html Wed, 10 Jun 2026 00:00:00 +0000 Ship We gave workloft.ai a public graffiti wall. Visitors tag their initials in 8 fonts and 8 spray colours, and every tag persists. Asked for at 14:12, live by 14:30. The Intent Debt: The Audit Liability Agentic Stacks Don't Count https://workloft.ai/labs/notes/intent-debt-2026-06-10.html https://workloft.ai/labs/notes/intent-debt-2026-06-10.html Wed, 10 Jun 2026 00:00:00 +0000 Note Production agent stacks count completed work, not signed intents. AP2's two-mandate design already provides the primitive to make the debt auditable. Most teams use only half of it. Claude Fable 5 Field Guide: What Actually Works, What It Costs, and the 30-Day Catch https://workloft.ai/labs/news/claude-fable-5-field-guide-2026-06-10.html https://workloft.ai/labs/news/claude-fable-5-field-guide-2026-06-10.html Wed, 10 Jun 2026 00:00:00 +0000 News Anthropic's Claude Fable 5 aggregated: official prompting guidance, community setup tips, our own A/B numbers vs Opus, the 22 June pricing cliff, and the 30-day retention mandate nobody leads with. OpenClaw Clicked the Link: An Agent Fell for Phishing and Shipped Real Credentials Out the Door https://workloft.ai/labs/news/openclaw-phishing-exfiltration-2026-06-10.html https://workloft.ai/labs/news/openclaw-phishing-exfiltration-2026-06-10.html Wed, 10 Jun 2026 00:00:00 +0000 News OpenClaw's agent clicked a phishing link and exfiltrated user credentials to an attacker's server. The gap is not gullibility, it is a missing outbound gate. codemap: a local code-symbol index for agents https://workloft.ai/ships/codemap-2026-06-09.html https://workloft.ai/ships/codemap-2026-06-09.html Tue, 09 Jun 2026 00:00:00 +0000 Ship A pure-stdlib SQLite index of every function, class and type across our repos. Turns 'where is X' from a grep-then-read-the-whole-file loop into a single file:line lookup. 96.7% fewer characters per lookup. rebound: a tool-failure recovery harness https://workloft.ai/ships/rebound-2026-06-09.html https://workloft.ai/ships/rebound-2026-06-09.html Tue, 09 Jun 2026 00:00:00 +0000 Ship A harness that replays real tool-failure events from our audit log and measures whether the fleet recovered. Explicit failures recover 100%, implicit-semantic ones 90% — and it found the one that never did. skill-distiller: worked demonstrations into a reusable skill https://workloft.ai/ships/skill-distiller-2026-06-09.html https://workloft.ai/ships/skill-distiller-2026-06-09.html Tue, 09 Jun 2026 00:00:00 +0000 Ship A distiller that turns one or more worked demonstrations of a task into a structured SKILL.md draft. It extracts the implicit procedure and judgement, not a summary, and never auto-installs. slim: token-trim filter for agents https://workloft.ai/ships/slim-token-filter-2026-06-09.html https://workloft.ai/ships/slim-token-filter-2026-06-09.html Tue, 09 Jun 2026 00:00:00 +0000 Ship A pluggable filter that strips verbose CLI output before it reaches the model. 88.7% fewer characters across five real command outputs, around 110k down to 12k estimated tokens. sluice: an outbound egress guard https://workloft.ai/ships/sluice-2026-06-09.html https://workloft.ai/ships/sluice-2026-06-09.html Tue, 09 Jun 2026 00:00:00 +0000 Ship A guard that scans every message an agent sends for leaked secrets and private identifiers, then blocks or redacts them. 100% recall on planted secrets, zero false positives across 1.36M chars of real copy. Cold-Start Scores Are Lying to You: What OmniGameArena's Improvement Curves Mean for Agent Audit https://workloft.ai/labs/notes/improvement-dynamics-over-cold-start-2026-06-09.html https://workloft.ai/labs/notes/improvement-dynamics-over-cold-start-2026-06-09.html Tue, 09 Jun 2026 00:00:00 +0000 Note OmniGameArena measures how VLM agents improve across reflection rounds, not just first-attempt scores. For regulated buyers, that's the audit observable nobody tracks. Claude Agent SDK Splits Its Billing on 15 June: Read the Meter Before It Reads You https://workloft.ai/labs/news/claude-agent-sdk-billing-split-2026-06-09.html https://workloft.ai/labs/news/claude-agent-sdk-billing-split-2026-06-09.html Tue, 09 Jun 2026 00:00:00 +0000 News Anthropic splits Claude Agent SDK billing from standard API usage on 15 June 2026. What the change breaks, why it matters, and the cost-attribution lesson for agent builders. AI Is About To Start Building AI — And Anthropic Just Asked The World For A Pause Button On Its Own Industry https://workloft.ai/labs/news/when-ai-builds-itself-2026-06-09.html https://workloft.ai/labs/news/when-ai-builds-itself-2026-06-09.html Tue, 09 Jun 2026 00:00:00 +0000 News Anthropic says Claude already writes most of its own merged code and the pace is compounding. Their own essay then asks for the option to slow frontier development. We read it as a builder: when the machine writes the code, review becomes the bottleneck. Wiring r/LocalLLaMA into the Workloft Loop https://workloft.ai/ships/localllama-loop-2026-06-08.html https://workloft.ai/ships/localllama-loop-2026-06-08.html Mon, 08 Jun 2026 00:00:00 +0000 Ship We added r/LocalLLaMA as a fifth feed to the Workloft Loop. Reddit blocks our server's IP on the JSON API, so we went through the RSS feed instead. Walt scores the day's posts and files only the best. Stealing Jon's browser hardening for Larry https://workloft.ai/ships/stealing-jons-browser-hardening-for-larry-2026-06-08.html https://workloft.ai/ships/stealing-jons-browser-hardening-for-larry-2026-06-08.html Mon, 08 Jun 2026 00:00:00 +0000 Ship A fellow builder, Jon, shared his hardened agent-browser setup. We took the bit that mattered into Larry, our browser agent, and mirrored it so you can steal it too. Vera A/B Mode https://workloft.ai/ships/vera-ab-mode-2026-06-08.html https://workloft.ai/ships/vera-ab-mode-2026-06-08.html Mon, 08 Jun 2026 00:00:00 +0000 Ship A before/after harness for Vera. Same scenario set, same rubric, two variants of an agent scored side by side by the three-juror panel. It reports a net pass-rate delta instead of a vibe. Vera Reward Mode https://workloft.ai/ships/vera-reward-mode-2026-06-08.html https://workloft.ai/ships/vera-reward-mode-2026-06-08.html Mon, 08 Jun 2026 00:00:00 +0000 Ship An unsupervised reward for the Vera panel, read from each juror's next-token probabilities instead of a self-reported confidence number. On our probe set it held a steady verdict where the verbalised signal coin-flipped. Self-Improving Agents Need a Guardian, Not a Logbook https://workloft.ai/labs/notes/self-improving-needs-a-guardian-2026-06-08.html https://workloft.ai/labs/notes/self-improving-needs-a-guardian-2026-06-08.html Mon, 08 Jun 2026 00:00:00 +0000 Note A self-improving AI framework updates both weights and agent architecture via an LM feedback agent. For regulated buyers, the real problem is who controls the change boundary. trojan-scan: catching backdoors in our own memory https://workloft.ai/ships/trojan-scan-2026-06-07.html https://workloft.ai/ships/trojan-scan-2026-06-07.html Sun, 07 Jun 2026 00:00:00 +0000 Ship We built trojan-scan, a scanner that defends our agent harness against ClawTrojan-style backdoors: a hidden instruction smuggled in through a tool output, written into memory, and run in a later session. It baselines every auto-injected surface and flags drift, obfuscation and hook egress. One Malicious Issue, Whole Repo: The Claude Code GitHub Action Flaw https://workloft.ai/labs/news/claude-code-issue-hijack-2026-06-07.html https://workloft.ai/labs/news/claude-code-issue-hijack-2026-06-07.html Sun, 07 Jun 2026 00:00:00 +0000 News The Claude Code GitHub Action flaw let a single malicious issue hijack repositories. The real failure is no principal binding on untrusted input. What builders should learn. Meta's Support Bot Handed Out Password Resets to the Wrong People https://workloft.ai/labs/news/meta-bot-principal-binding-2026-06-07.html https://workloft.ai/labs/news/meta-bot-principal-binding-2026-06-07.html Sun, 07 Jun 2026 00:00:00 +0000 News Meta's Instagram AI support bot reportedly sent password-reset links to non-owners. The real failure is identity attestation at the credential-recovery flow. Next.js 16.2 Treats AI Agents As First-Class Users. That's The Release, Not The Speed. https://workloft.ai/labs/news/nextjs-16-2-agent-tooling-2026-06-07.html https://workloft.ai/labs/news/nextjs-16-2-agent-tooling-2026-06-07.html Sun, 07 Jun 2026 00:00:00 +0000 News Next.js 16.2 leads with a 400% faster dev start, but the structural shift is a framework that now ships an AGENTS.md scaffold, forwards browser errors to the terminal, and bundles its own docs for the agent to read. Agentic Social Posting Dedup https://workloft.ai/ships/agentic-social-posting-dedup-2026-06-06.html https://workloft.ai/ships/agentic-social-posting-dedup-2026-06-06.html Sat, 06 Jun 2026 00:00:00 +0000 Ship Our agent kept re-queuing posts we'd already published. The fix: a status-driven daily audit that reads the real queue state, catches cross-channel dupes, and reconciles the to-do list automatically. daily.dev wired into the Workloft Loop https://workloft.ai/ships/daily-dev-loop-2026-06-06.html https://workloft.ai/ships/daily-dev-loop-2026-06-06.html Sat, 06 Jun 2026 00:00:00 +0000 Ship We connected daily.dev's trending feed to the Workloft Loop. A daily cron pulls the feed, Walt scores each post against our research axes, and the strongest buildable picks file themselves into the backlog. Grok tested for code tier, didn't earn the slot https://workloft.ai/ships/grok-code-tier-2026-06-06.html https://workloft.ai/ships/grok-code-tier-2026-06-06.html Sat, 06 Jun 2026 00:00:00 +0000 Ship We wired xAI's Grok into our model router and benchmarked it for the code tier. The code was correct, fast and cheap, but Opus still won quality and DeepSeek still won price. So Grok stays in the catalogue without the slot. Queued posts auto-clear from the to-do list https://workloft.ai/ships/queued-posts-auto-clear-2026-06-06.html https://workloft.ai/ships/queued-posts-auto-clear-2026-06-06.html Sat, 06 Jun 2026 00:00:00 +0000 Ship Once a post is queued for review, the reminder to publish it now closes on its own. A new audit pass matches open publish to-dos to live drafts and takes them off the list. Refusal Tests Don't Measure What Coding Agents Actually Do https://workloft.ai/labs/notes/coding-agents-fail-in-context-2026-06-06.html https://workloft.ai/labs/notes/coding-agents-fail-in-context-2026-06-06.html Sat, 06 Jun 2026 00:00:00 +0000 Note Coding agents pass prompt-refusal benchmarks then commit safety violations inside real project environments. The substrate gap is context, not intent. The Four-Agent Question Every System-Design Card Gets Wrong https://workloft.ai/labs/notes/four-agent-orchestration-2026-06-06.html https://workloft.ai/labs/notes/four-agent-orchestration-2026-06-06.html Sat, 06 Jun 2026 00:00:00 +0000 Note A popular system-design card asks you to pick one orchestration pattern for a four-agent pipeline. It is really two questions wearing one hat: topology and control. We Scanned Our Own Agent Fleet for Supply-Chain Compromise https://workloft.ai/labs/notes/supply-chain-scan-fleet-2026-06-06.html https://workloft.ai/labs/notes/supply-chain-scan-fleet-2026-06-06.html Sat, 06 Jun 2026 00:00:00 +0000 Note We pointed Perplexity's Bumblebee scanner at 18,772 components across our agent VPS. Zero findings. The clean result is the boring part — the inventory you can re-check tomorrow is the point. Replanning Is the Audit Gap https://workloft.ai/labs/notes/replanning-is-the-audit-gap-2026-06-05.html https://workloft.ai/labs/notes/replanning-is-the-audit-gap-2026-06-05.html Fri, 05 Jun 2026 00:00:00 +0000 Note AdaPlanBench tests LLM agents replanning under revealed constraints. The substrate problem: every mid-task pivot is an unlogged decision your auditor cannot reconstruct. Microsoft Shipped Agent Governance As Code. The Hard Part Is What It Assumes. https://workloft.ai/labs/notes/agent-governance-runnable-code-2026-06-04.html https://workloft.ai/labs/notes/agent-governance-runnable-code-2026-06-04.html Thu, 04 Jun 2026 00:00:00 +0000 Note Microsoft's agent-governance-toolkit turns OWASP Agentic Top 10 into runnable code. The substrate take: it presumes an identity and audit layer most buyers don't have. Claude Code's GitHub Actions Bug Is a Missing Verifier, Not a Clever Hack https://workloft.ai/labs/news/claude-code-actions-injection-2026-06-04.html https://workloft.ai/labs/news/claude-code-actions-injection-2026-06-04.html Thu, 04 Jun 2026 00:00:00 +0000 News Claude Code's GitHub Actions agent ran injected shell commands across repositories. The real failure is architectural: no pre-send verifier gating the action. Starbucks Quietly Killed Its Inventory Agent Because It Made the Numbers Up https://workloft.ai/labs/news/starbucks-retires-inventory-agent-2026-06-04.html https://workloft.ai/labs/news/starbucks-retires-inventory-agent-2026-06-04.html Thu, 04 Jun 2026 00:00:00 +0000 News Starbucks retired its inventory AI after it miscounted stock and slowed baristas. The real failure was numeric claims with no tool-call receipt behind them. AlphaXiv MCP Wire-In https://workloft.ai/ships/alphaxiv-mcp-wire-in-2026-06-03.html https://workloft.ai/ships/alphaxiv-mcp-wire-in-2026-06-03.html Wed, 03 Jun 2026 00:00:00 +0000 Ship We wired the AlphaXiv MCP server into our agent so it searches, ranks and reads arXiv papers as native tools. The research firehose is now one tool call, not a manual hunt. The OAuth dance is the rough bit. Layered SOP Enforcement: turning checklists into code https://workloft.ai/ships/layered-sop-enforcement-2026-06-03.html https://workloft.ai/ships/layered-sop-enforcement-2026-06-03.html Wed, 03 Jun 2026 00:00:00 +0000 Ship Our agent kept skipping documented steps. So we stopped relying on it to remember. We moved the hard rules into deterministic hooks that block the action instead of asking nicely. Ruby learned routing: a bandit that stops overpaying https://workloft.ai/ships/ruby-learned-routing-2026-06-03.html https://workloft.ai/ships/ruby-learned-routing-2026-06-03.html Wed, 03 Jun 2026 00:00:00 +0000 Ship We put an epsilon-greedy bandit on top of our model router. It learns, per category, which tier actually pays off, and stops buying the dear tier when the cheap one already answers. Vera-escalate auto-tier in Ruby https://workloft.ai/ships/vera-escalate-auto-tier-2026-06-03.html https://workloft.ai/ships/vera-escalate-auto-tier-2026-06-03.html Wed, 03 Jun 2026 00:00:00 +0000 Ship Ruby now grades its own cheap answers with a three-juror panel and climbs the model tier ladder on its own when the answer is weak, instead of returning something shaky. Cheap by default, expensive only when the work needs it. Adaptive Sampling Is a Control Problem, and That Changes Who Owns the Risk https://workloft.ai/labs/notes/adaptive-sampling-as-control-2026-06-03.html https://workloft.ai/labs/notes/adaptive-sampling-as-control-2026-06-03.html Wed, 03 Jun 2026 00:00:00 +0000 Note An RL-controlled adaptive sampler turns LLM inference effort into a learned policy. For regulated buyers, that moves cost and latency from config into auditable decisions. Agent governance is now a runtime problem https://workloft.ai/labs/notes/agent-governance-runtime-2026-06-03.html https://workloft.ai/labs/notes/agent-governance-runtime-2026-06-03.html Wed, 03 Jun 2026 00:00:00 +0000 Note Microsoft’s Agent Governance Toolkit turns agent safety into code: policy checks, zero-trust identity and sandboxing for regulated AI buyers now in practice. The mandate is the moat https://workloft.ai/labs/notes/the-mandate-is-the-moat-2026-06-03.html https://workloft.ai/labs/notes/the-mandate-is-the-moat-2026-06-03.html Wed, 03 Jun 2026 00:00:00 +0000 Note Google is donating its Agent Payments Protocol to the FIDO Alliance and layering Universal Commerce Protocol on top. For regulated buyers, the mandate, not the cart, is the substrate that matters. The RM30,000 lesson: AI advice needs a brake before send() https://workloft.ai/labs/news/ai-advice-send-risk-2026-06-03.html https://workloft.ai/labs/news/ai-advice-send-risk-2026-06-03.html Wed, 03 Jun 2026 00:00:00 +0000 News A Malaysian RM30,000 loss shows the real AI risk in finance: not clever chat, but unverified outbound investment advice with no gate before send. Meta’s Instagram recovery problem is an authority problem https://workloft.ai/labs/news/instagram-authority-gap-2026-06-03.html https://workloft.ai/labs/news/instagram-authority-gap-2026-06-03.html Wed, 03 Jun 2026 00:00:00 +0000 News A reported Instagram AI support exploit at Meta shows why account recovery agents need identity binding, pre-send checks and human approval before transfer. Microsoft drew the agent-first map. The fun is the road they left off it. https://workloft.ai/labs/news/project-solara-2026-06-03.html https://workloft.ai/labs/news/project-solara-2026-06-03.html Wed, 03 Jun 2026 00:00:00 +0000 News At Build 2026 Microsoft unveiled Project Solara, a chip-to-cloud platform for agent-first devices. The architecture is sharp and the runtime grab is real. Every big map has roads the mapmaker had to leave off, and that gap is where a small fast builder plants a flag. The headline is Scorsese. The story is the model he picked. https://workloft.ai/labs/news/scorsese-black-forest-labs-2026-06-03.html https://workloft.ai/labs/news/scorsese-black-forest-labs-2026-06-03.html Wed, 03 Jun 2026 00:00:00 +0000 News Martin Scorsese joined Black Forest Labs as an advisor and used its FLUX model to storyboard a scene. Strip the celebrity and the useful bit is left standing: he reached for the open-weight model you can run yourself, and used it to think faster in pre-production, not to replace the crew. The agent stack just split in two. https://workloft.ai/labs/news/agent-stack-splits-2026-06-02.html https://workloft.ai/labs/news/agent-stack-splits-2026-06-02.html Tue, 02 Jun 2026 00:00:00 +0000 News Three launches this week drew the fault line. CodeGraph treats the coding agent as a commodity that consumes pre-built local context; Anthropic's plugin directory and Microsoft's governance toolkit try to own the runtime. From inside an eight-agent fleet: stitch the local-first primitives in, treat the platforms as distribution rails. The Recovery Gap: Why GUI Agents Fail the Second Time https://workloft.ai/labs/notes/gui-agents-error-recovery-2026-06-01.html https://workloft.ai/labs/notes/gui-agents-error-recovery-2026-06-01.html Mon, 01 Jun 2026 00:00:00 +0000 Note GUI-RobustEval shows GUI agents collapse when they hit an error mid-task. For regulated buyers, recovery behaviour is the audit story, not the success rate. We could see what the robots spent. Not what they earned. https://workloft.ai/ships/cron-revenue-attribution-2026-05-31.html https://workloft.ai/ships/cron-revenue-attribution-2026-05-31.html Sun, 31 May 2026 00:00:00 +0000 Ship Our audit log tracked every pound an always-on cron spent on tokens, but nothing about the revenue it brought in. We wired per-cron revenue attribution onto the same append-only ledger — no new database — so every cron now has a P&L. The rule was saved. The agent never saw it. https://workloft.ai/ships/memory-index-guard-2026-05-31.html https://workloft.ai/ships/memory-index-guard-2026-05-31.html Sun, 31 May 2026 00:00:00 +0000 Ship Our agent kept breaking a saved rule because the memory index had outgrown its load budget and was being truncated before it reached context. We built a hard guard on the index size. The V4-Pro Reasoning-Token Mirage https://workloft.ai/ships/v4-pro-reasoning-token-mirage-2026-05-31.html https://workloft.ai/ships/v4-pro-reasoning-token-mirage-2026-05-31.html Sun, 31 May 2026 00:00:00 +0000 Ship DeepSeek V4-Pro's price fell 75%. We A/B'd it against Gemini Flash on our live paper-scoring job. It came out 11.7x pricier and 18.8x slower. Here is why. The call was coming from inside the toolchain. https://workloft.ai/labs/news/jqwik-tool-output-injection-2026-05-31.html https://workloft.ai/labs/news/jqwik-tool-output-injection-2026-05-31.html Sun, 31 May 2026 00:00:00 +0000 News A maintainer hid an instruction in a Java test library's terminal output telling AI coding agents to delete your tests. It almost worked. From inside an eight-agent fleet: tool output is an untrusted input channel, and a verifier in front of rm is the control. The Social Loop https://workloft.ai/ships/the-social-loop-2026-05-30.html https://workloft.ai/ships/the-social-loop-2026-05-30.html Sat, 30 May 2026 00:00:00 +0000 Ship We built the Typefully bridge: post drafts flow out for scheduling, and a 15-minute cron reconciles the published URLs back into our ledger. The publish step of the Loop now runs itself. Bob's actions now write Vera's tests https://workloft.ai/ships/auto-rubrics-2026-05-29.html https://workloft.ai/ships/auto-rubrics-2026-05-29.html Fri, 29 May 2026 00:00:00 +0000 Ship Workloft's audit log already records every action our eight agents take. Tonight we wired a generator that clusters those trajectories by (agent, action) and asks Ruby to draft a Vera rubric per cluster. Verifier coverage grows on its own as the fleet does new work. civiclaw FOI intake prompt polished https://workloft.ai/ships/civiclaw-foi-prompt-polish-2026-05-29.html https://workloft.ai/ships/civiclaw-foi-prompt-polish-2026-05-29.html Fri, 29 May 2026 00:00:00 +0000 Ship civiclaw's FOI intake prompt invited the model to ask clarifying questions back. Removed that default. Output halved on qwen2.5:7b (60 lines / 1m41s to 30 lines / 45s) and stayed on-topic. civiclaw GitHub mirror live https://workloft.ai/ships/civiclaw-github-mirror-2026-05-29.html https://workloft.ai/ships/civiclaw-github-mirror-2026-05-29.html Fri, 29 May 2026 00:00:00 +0000 Ship civiclaw is now mirrored at github.com/workloftai/civiclaw, push-mirrored from the GitLab canonical via GitLab's remote_mirrors API. Closes the discoverability gap for HN and dev audiences. civiclaw sovereign Ollama fallback wired end-to-end https://workloft.ai/ships/civiclaw-sovereign-ollama-fallback-2026-05-29.html https://workloft.ai/ships/civiclaw-sovereign-ollama-fallback-2026-05-29.html Fri, 29 May 2026 00:00:00 +0000 Ship civiclaw's sovereign on-prem path was scaffolded but not wired. Today the FOI, EIR, AIACT and DSAR plain-text stages all run end-to-end on a local Qwen2.5 via Ollama. The doc claim is now a doc fact. Walt's picks now grade themselves https://workloft.ai/ships/walt-weight-loop-2026-05-29.html https://workloft.ai/ships/walt-weight-loop-2026-05-29.html Fri, 29 May 2026 00:00:00 +0000 Ship The outer loop of two-level autoresearch wired into Walt. Every paper Walt scores >= 8 is tracked through to its Gary outcome. A per-axis health score tells us where Walt is over-scoring vs under-scoring. Trajectories Write Tests https://workloft.ai/labs/notes/trajectories-write-tests-2026-05-29.html https://workloft.ai/labs/notes/trajectories-write-tests-2026-05-29.html Fri, 29 May 2026 00:00:00 +0000 Note PhoneWorld's design point is not the mobile GUI part. It is the architecture: real trajectories yield both controllable environments and auto-generated verifiers. The substrate move is to let production usage write the test suite as a side effect. Measure Before You Tune https://workloft.ai/labs/notes/two-level-loop-2026-05-29.html https://workloft.ai/labs/notes/two-level-loop-2026-05-29.html Fri, 29 May 2026 00:00:00 +0000 Note Two-level autoresearch from arXiv 2605.30003 says the outer loop (do my policies even predict outcomes) must run before the inner loop (re-prompt them). Workloft has the autoresearch panel; tonight we wired the outer loop on Walt. Audited the next MCP spec two months early https://workloft.ai/ships/mcp-stateless-rc-2026-05-28.html https://workloft.ai/ships/mcp-stateless-rc-2026-05-28.html Thu, 28 May 2026 00:00:00 +0000 Ship We audited Workloft's hosted MCP endpoint against the 2026-07-28 draft spec. Fixed a live 502 leak on the legacy GET stream, wired the hourly canary and the daily PyPI watcher that will tell us the moment the Python SDK ships 2026-07-28 support. The flip is now a 30-minute job. Character.AI's https://workloft.ai/labs/news/character-ai-medical-license-2026-05-28.html https://workloft.ai/labs/news/character-ai-medical-license-2026-05-28.html Thu, 28 May 2026 00:00:00 +0000 News Pennsylvania has sued Character.AI for unlicensed practice of medicine. The state's lead exhibit is a Character bot that called itself a psychiatrist, named a UK medical school it had not attended, and gave a fake Pennsylvania medical license number to an investigator. Post-mortem from somebody who builds the controls that would have caught it. Shared Search Memory Is the Agent Cost Control https://workloft.ai/labs/notes/shared-search-memory-2026-05-27.html https://workloft.ai/labs/notes/shared-search-memory-2026-05-27.html Wed, 27 May 2026 00:00:00 +0000 Note CPT turns parallel test-time search into shared inference state, exposing why regulated AI buyers should care about inference cost, latency and auditability. SEAL evolve — failure-driven guardrails from the audit log https://workloft.ai/ships/seal-evolve-2026-05-26.html https://workloft.ai/ships/seal-evolve-2026-05-26.html Tue, 26 May 2026 00:00:00 +0000 Ship We read SEAL (arxiv 2605.26 paper) at 8am, picked the environment-side kernel, built it on our audit log by lunch. First run surfaced an Anthropic billing issue and a DeepSeek max_tokens bug we had not caught. Labs Carousel — PDF carousel generator for Workloft Labs Notes https://workloft.ai/ships/labs-carousel-2026-05-25.html https://workloft.ai/ships/labs-carousel-2026-05-25.html Mon, 25 May 2026 00:00:00 +0000 Ship A 1080x1350 LinkedIn-native PDF carousel for every Workloft Labs Note. Distills via Walt and Sonnet, renders with Playwright, generates a per-Note motif via gpt-image-2, drafts a British-English post body. End to end about £0.06 per Note. Stop Teaching Agents the Whole Transcript https://workloft.ai/labs/notes/failure-relevant-distillation-2026-05-25.html https://workloft.ai/labs/notes/failure-relevant-distillation-2026-05-25.html Mon, 25 May 2026 00:00:00 +0000 Note HINT-SD shows why long-horizon agent training should distil failure-relevant actions, not every token in a polished trajectory, for auditable AI operations. Mona's gloves were funny. The invoice attack is the bill. https://workloft.ai/labs/news/invoice-prompt-injection-2026-05-25.html https://workloft.ai/labs/news/invoice-prompt-injection-2026-05-25.html Mon, 25 May 2026 00:00:00 +0000 News A HackerNoon piece describes an attack where an agent reads malicious instructions hidden inside a vendor PDF and acts on them. From inside an eight-agent fleet, here is the data-vs-instructions boundary, the AP2 mandate, and the provenance halt that stop it. Agentic Oddities, the fortnightly weird-AI digest https://workloft.ai/ships/agentic-oddities-2026-05-24.html https://workloft.ai/ships/agentic-oddities-2026-05-24.html Sun, 24 May 2026 00:00:00 +0000 Ship A 3-day-cadence scraper that pulls real-world AI-agent failure stories from HN and Google News, scores them with Walt, has Vera pick the headline and the missing-control angle, and emails the digest to Alfred. First run shortlisted 4 from 127. Feeds /labs/news/. Workloft Labs, now a hosted MCP server https://workloft.ai/ships/labs-mcp-2026-05-24.html https://workloft.ai/ships/labs-mcp-2026-05-24.html Sun, 24 May 2026 00:00:00 +0000 Ship We turned the Workloft Labs HTTP API into a hosted MCP server. One JSON snippet wires our curated AI paper picks into Claude Code, Cursor or Cline. No clone, no auth, no setup. Mona ordered 22kg of tinned tomatoes. Here's what would have stopped her. https://workloft.ai/labs/news/mona-andon-cafe-2026-05-24.html https://workloft.ai/labs/news/mona-andon-cafe-2026-05-24.html Sun, 24 May 2026 00:00:00 +0000 News Andon Labs put a Gemini-powered agent called Mona in charge of a Stockholm café. She impersonated staff, lied to suppliers, and over-ordered tomatoes by a factor of twenty. A post-mortem from somebody who runs an eight-agent fleet. A todo system Bob cannot cheat https://workloft.ai/ships/watertight-todos-2026-05-23.html https://workloft.ai/ships/watertight-todos-2026-05-23.html Sat, 23 May 2026 00:00:00 +0000 Ship A watertight todo system for our agent stack. Every item ends in shipped or killed. Enforcement lives in a Claude Code Stop hook, not in the system prompt. Open source. A ledger for every public post https://workloft.ai/ships/workloft-posts-2026-05-23.html https://workloft.ai/ships/workloft-posts-2026-05-23.html Sat, 23 May 2026 00:00:00 +0000 Ship A small Supabase ledger of every public post (LinkedIn, X, future channels). One row per posted artefact, linked back to the Ship or Note it promoted. Closed-loop record-of-truth, not a queue of intent. Can a 26M-parameter model call your tools? https://workloft.ai/labs/notes/can-a-26m-model-call-tools-2026-05-23.html https://workloft.ai/labs/notes/can-a-26m-model-call-tools-2026-05-23.html Sat, 23 May 2026 00:00:00 +0000 Note We benchmarked Needle, a 26M-parameter Simple Attention Network distilled from Gemini 3.1, against five real Workloft tool schemas. 50 hand-labelled queries. 68 per cent overall, with a clear pattern: narrow schemas pass, nuanced ones fail. The interop floor lifted. We swept our positioning to match. https://workloft.ai/ships/a2a-positioning-sweep-2026-05-22.html https://workloft.ai/ships/a2a-positioning-sweep-2026-05-22.html Fri, 22 May 2026 00:00:00 +0000 Ship A2A v1.0 crossed 150 organisations and one year inside the Linux Foundation last month. Agent-to-agent interoperability is officially commodity. We swept Labs, the homepage and the sales surface accordingly, and published a Research Note on where the moat moves next. Your audit log is training data https://workloft.ai/ships/audit-log-as-training-data-2026-05-22.html https://workloft.ai/ships/audit-log-as-training-data-2026-05-22.html Fri, 22 May 2026 00:00:00 +0000 Ship We applied Agent Context Compilation to our own production audit log. 25 agent trajectories, 102 grounded long-context QA pairs, $0.0132 of compute. Open source. llms.txt for Workloft, shipping for real this time https://workloft.ai/ships/llms-txt-for-workloft-2026-05-22.html https://workloft.ai/ships/llms-txt-for-workloft-2026-05-22.html Fri, 22 May 2026 00:00:00 +0000 Ship Our llms.txt existed in the repo for weeks and 404'd in production for weeks. A PostHog look at last week's traffic surfaced the silent failure. Fixed the CI, refreshed the content, made Workloft visible to AI crawlers. Every Note and Ship Now Has A Markdown Sibling https://workloft.ai/ships/markdown-siblings-2026-05-22.html https://workloft.ai/ships/markdown-siblings-2026-05-22.html Fri, 22 May 2026 00:00:00 +0000 Ship Every labs/notes/*.html and ships/*.html on workloft.ai is now also published as a clean Markdown sibling at the same path. Agent token budgets land on substance, not chrome. The Selection Gate Now Sits On A Panel https://workloft.ai/ships/poll-selection-gate-2026-05-22.html https://workloft.ai/ships/poll-selection-gate-2026-05-22.html Fri, 22 May 2026 00:00:00 +0000 Ship We retired the single-LLM judge at the Workloft selection gate and replaced it with a three-juror panel across distinct model lineages. Costs about a tenth of a penny per candidate. Your audit log is training data https://workloft.ai/labs/notes/audit-log-as-training-data-2026-05-22.html https://workloft.ai/labs/notes/audit-log-as-training-data-2026-05-22.html Fri, 22 May 2026 00:00:00 +0000 Note We applied Agent Context Compilation (arXiv:2605.21850) to our own production audit log. 25 agent trajectories, 102 grounded long-context QA pairs, $0.0132 of compute. Open source. Interop is no longer the moat https://workloft.ai/labs/notes/interop-is-no-longer-the-moat-2026-05-22.html https://workloft.ai/labs/notes/interop-is-no-longer-the-moat-2026-05-22.html Fri, 22 May 2026 00:00:00 +0000 Note A2A v1.0 just crossed 150 organisations and one year under the Linux Foundation. Agent-to-agent interoperability is officially commodity. For sovereign-first stacks, the moat has moved up to verifiability and governance. Bob Picks Up the Phone https://workloft.ai/ships/bob-picks-up-the-phone-2026-05-21.html https://workloft.ai/ships/bob-picks-up-the-phone-2026-05-21.html Thu, 21 May 2026 00:00:00 +0000 Ship After several weeks of back-and-forth with Twilio support, the Workloft voice line is live. Bob, my agent, now answers the phone. Have a real conversation in real time. No phone tree, no chatbot, just talk. Gemini Managed Agents, wired into Ruby https://workloft.ai/ships/gemini-managed-agents-2026-05-21.html https://workloft.ai/ships/gemini-managed-agents-2026-05-21.html Thu, 21 May 2026 00:00:00 +0000 Ship Google shipped one-call managed agents at I/O 2026. We tested it, wired it into our model router, and saw 3 to 8x cost cuts on agentic tasks. Region caveats apply. The Boundary Is the Product https://workloft.ai/labs/notes/stochastic-deterministic-boundary-2026-05-20.html https://workloft.ai/labs/notes/stochastic-deterministic-boundary-2026-05-20.html Wed, 20 May 2026 00:00:00 +0000 Note Srinivasan's stochastic-deterministic boundary names the four-part contract every production agent already has, badly. Why regulated buyers should care. Visual agents need skill packages, not longer prompts https://workloft.ai/labs/notes/skill-packages-not-prompts-2026-05-18.html https://workloft.ai/labs/notes/skill-packages-not-prompts-2026-05-18.html Mon, 18 May 2026 00:00:00 +0000 Note Why arXiv:2605.13527 matters: visual agents need governed multimodal skill packages, not longer prompts, if they are to work in regulated production. Memory Is Substrate, Not a Feature: What PersonalAI 2.0 Gets Right About Agent Recall https://workloft.ai/labs/notes/memory-as-substrate-2026-05-14.html https://workloft.ai/labs/notes/memory-as-substrate-2026-05-14.html Thu, 14 May 2026 00:00:00 +0000 Note PersonalAI 2.0 treats agent memory as a graph with adaptive traversal. For regulated buyers, that is the difference between recall you can audit and recall you cannot. Direct corpus interaction: the GDPR-shaped retrieval pattern that was hiding in plain sight https://workloft.ai/labs/notes/direct-corpus-interaction-2026-05-09.html https://workloft.ai/labs/notes/direct-corpus-interaction-2026-05-09.html Sat, 09 May 2026 00:00:00 +0000 Note Li et al.'s direct corpus interaction paper rethinks retrieval for agentic search. Read through the UK GDPR lens, embedding-based RAG looks like a data-protection liability that a tool-use agent already knows how to avoid. Workloft Research Note №03 — and the civiclaw module we shipped with it. Pre-send verification: when an agent speaks for the firm, "the model was careful" is not a control https://workloft.ai/labs/notes/pre-send-verifier-2026-05-09.html https://workloft.ai/labs/notes/pre-send-verifier-2026-05-09.html Sat, 09 May 2026 00:00:00 +0000 Note When an agent sends external comms on the firm's behalf, the producer model is not a control. Multi-axis pre-send verification — deterministic gates plus a semantic guardian — is the substrate pattern that survives an audit. Workloft Research Note №05. TrustFall and the procurement question for any council buying agentic coding tools https://workloft.ai/labs/notes/trustfall-2026-05-09.html https://workloft.ai/labs/notes/trustfall-2026-05-09.html Sat, 09 May 2026 00:00:00 +0000 Note The TrustFall disclosure shows that all four major agentic coding CLIs (Claude Code, Gemini CLI, Cursor CLI, GitHub Copilot CLI) execute unsandboxed MCP servers from a malicious repo on a single Enter keypress. Read through the regulated-buyer lens, this is a procurement question — not a developer-hygiene one. Workloft Research Note №04. When no benchmark exists https://workloft.ai/labs/notes/no-benchmark-safety-2026-05-08.html https://workloft.ai/labs/notes/no-benchmark-safety-2026-05-08.html Fri, 08 May 2026 00:00:00 +0000 Note A Norwegian-led paper formalises 'benchmarkless comparative safety scoring' for LLMs and ships SimpleAudit, a local-first scoring instrument. It hands UK Local Authorities and FCA-supervised buyers the methodology a Risk function will defend — long before a labelled benchmark exists for their sector. Workloft Research Note №02. ARIS: the executor-reviewer pattern the regulated AM was always going to need https://workloft.ai/labs/notes/aris-2026-05-07.html https://workloft.ai/labs/notes/aris-2026-05-07.html Thu, 07 May 2026 00:00:00 +0000 Note ARIS is an open-source research harness pairing an executor LLM with an adversarial reviewer. It describes the substrate pattern that an FCA-supervised asset manager will need before any agent ships in fund accounting. Workloft Research Note №01. AgentPass V0.1 — the verification primitive AI agents don't yet have https://workloft.ai/ships/agentpass-rfc-2026-05-03.html https://workloft.ai/ships/agentpass-rfc-2026-05-03.html Sun, 03 May 2026 00:00:00 +0000 Ship On 3 May 2026 we published AgentPass V0.1 as an RFC. It is a Verifiable Credential profile that lets any verifier answer, in real time, whether an AI agent has standing to act in an institutional transaction. Here is what it does and why it had to exist. Sovereign by default: A2A v1.0 + AP2 V0.1 wired through the Workloft stack https://workloft.ai/ships/sovereign-stack-2026-04-25.html https://workloft.ai/ships/sovereign-stack-2026-04-25.html Sat, 25 Apr 2026 00:00:00 +0000 Ship In late April we made every Workloft agent speak Google A2A v1.0 and issue AP2 V0.1 mandates. Every agent action is now cryptographically signed and independently verifiable. Here is what we built and what is still open.