Workloft
▸ WORKLOFT RESEARCH NOTE №01 · 7 MAY 2026

ARIS: the executor-reviewer pattern that the regulated AM was always going to need

An open-source research harness picked up 82 Hugging Face upvotes this week. The interesting thing isn't the upvotes — it's that the architecture mirrors, almost word for word, the substrate pattern an FCA-supervised asset manager will need before any agent goes live in fund accounting.

By Alfred Churchill · Workloft Labs · ~1,150 words · 5 min read
REG FIT ●●● · STRONG · APPLIES DIRECTLY TO REGULATED AGENT DEPLOYMENT

§1What ARIS actually is

ARIS is a research harness — a framework for running long-horizon research tasks reliably with multiple LLMs in the loop. Rather than wiring one big model to a tool belt and praying, ARIS pairs an executor (the model doing the actual work) with a reviewer (a second model whose only job is to push back) and an orchestrator that mediates the two. It exposes its skills as MCP tools, persists what it learns in a research wiki, and produces deterministic figures so a paper can be reproduced rather than just admired.

The headline architectural choice is the executor-reviewer split. The two models read the same task and the same partial outputs but optimise for different things. The executor is graded on producing the answer. The reviewer is graded on finding what's wrong with the answer. They argue. The orchestrator decides when the argument has converged enough to move on.

That's the whole pattern. And it's been done before in academic settings under various names — multi-agent debate, self-critique, constitutional AI variants. ARIS makes it a runtime, not a paper. That's the news.

§2Why it lands harder for regulated buyers than the AI-Twitter take suggests

The community read on ARIS, judging by the Hugging Face thread, is "ooh, autonomous research harness, now Claude can write papers." That misses what the architecture actually buys you, which becomes obvious the moment you hold it next to an FCA model risk management framework — specifically, SS1/23 for banks and the equivalent expectations being threaded through SYSC for asset managers.

The standing model risk principle is straightforward, and old: any model whose output influences a regulated decision must be challenged by a function that did not produce it. Not advised by, not cross-checked by — challenged. The challenger has independent authority and an independent reporting line. Banks have been doing this with quantitative analysts for thirty years. Asset managers are now being asked to do it with LLM-driven tools.

The straightforward problem: a single LLM, no matter how good, is the executor and the reviewer in one body. Asking it to grade itself is not an independent challenge function under any reasonable supervisory reading. It is a self-attestation.

ARIS's executor-reviewer split is, structurally, an independent challenge function. The reviewer model reports on the executor without sharing its objective. That's not a feature of ARIS. That's the thing the FCA was always going to require, expressed in code.

§3Where this maps to actual workflows we've all been pretending we had a plan for

Three places this pattern eats real workload, in increasing order of regulatory pressure:

§4What ARIS gets right, and where it stops just short of usable

What ARIS gets right: the orchestration is the right abstraction. Most people think the magic is in the executor model or the reviewer prompt; it isn't. It's in the policy that decides when the argument is over. ARIS exposes that as a configurable component. That's exactly where a regulated firm's policy lives, and it's exactly the thing your Risk function will want to write and version-control.

Where it stops short — the assurance layer logs the outputs, but it doesn't sign them. A regulator does not want a markdown file describing what the executor and reviewer said. They want a tamper-evident chain that proves no one rewrote the conversation after the fact. That's a gap a Workloft-style audit chain (append-only, signed, timestamped, replayable) closes in about a day of integration work.

The frame for any agent we ship for a regulated buyer should now be: ARIS-style runtime + a signed audit chain on top. The first half is the regulator's expectation. The second half is the regulator's evidence requirement. They are not the same thing. Most vendors will sell the first as if it covers the second.

§5What we're going to do with this

ARIS is open source. It clears all four of our implementation triggers — substrate-relevant, not duplicating our existing stack, tractable in under a week, and with a clear customer link via the civiclaw DSAR pipeline. So it goes onto the implementation candidate list.

Plan, in the order we will publicly cop to: (1) replicate the executor-reviewer-orchestrator triplet against our existing audit chain, (2) measure on a real DSAR redaction case from civiclaw with both quantitative agreement metrics and a human-in-the-loop comparison, (3) publish the code, the run, and what broke. Target: end of June.

If the replication ships and survives a real redaction case, it becomes the reference architecture we walk Risk teams through. If it doesn't, we publish what didn't work, which is the more useful Note anyway.


§6The honest caveat

ARIS is one week old at time of writing. Citation count: zero. Hugging Face upvotes: 82. The community sometimes oversells papers that don't survive replication. The architecture, however, is older than the paper — multi-agent challenge has been a recurring move in the literature for three years. ARIS is the version that arrived at runtime maturity at the moment regulated buyers started asking hard questions, which is what makes it interesting. If it doesn't survive, the pattern will, and someone else's framework will fill the slot in eight weeks.

We'll re-check this paper's Semantic Scholar citation count in two months. If it's still on zero, we'll know the framework didn't take, and we'll write up what filled the gap instead.

Methodology note. Walt (Workloft's classification agent, Gemini 2.5 Flash) screened 71 papers for the 7 May 2026 batch; ARIS scored 10/10 on the agent infrastructure axis with a Hugging Face boost of 2 for community curation. Semantic Scholar lookup returned 0 citations as of writing. The four-trigger implementation rule (substrate-relevant, non-duplicative, ≤1 week, customer link) was applied to the top 5 picks; ARIS and BRIGHT-Pro both pass; the other three were filed as Notes-only. The civiclaw mapping was verified against the project's existing DSAR redaction skill.