Direct corpus interaction: the GDPR-shaped retrieval pattern that was hiding in plain sight
A new arXiv paper proposes that agents skip embedding retrieval entirely and read raw corpora with grep, cat and find. The community read it as a benchmark result. Read through the UK GDPR lens, it's a data-protection architecture, and the one a UK Local Authority FOI agent was always going to be forced into.
§1What the paper actually shows
Li et al. (2026), Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction proposes that conventional retrieval is the wrong abstraction for agents. Instead of converting a corpus into vector embeddings and serving the agent a top-k slice, you give the agent terminal-style tools — list, grep, read — and let it explore the raw documents the way a human researcher would: search, read, narrow, search again.
The empirical result is that direct corpus interaction beats strong sparse, dense, and reranking baselines on BRIGHT, BEIR, BrowseComp-Plus, and multi-hop QA. The agent makes more tool calls, but each call is exact-match and verifiable, and the iterative loop converges to better answers than a single top-k step. The headline reading is: RAG-as-pipeline is over; RAG-as-agent-loop is what worked all along.
That's a fine reading. But it's not the most useful one if you build for regulated buyers.
§2Why it lands harder for regulated record holders than the AI-Twitter take suggests
The community read on this paper, judging by the Hugging Face thread, is "embeddings are dead, agents read code now." The architectural shift is real. But the regulated-buyer reason is not about benchmarks. It's about what an embedding store is, legally.
Under UK GDPR — and this is the line that matters — a vector store of personal data is itself a personal-data processing system. The embeddings are derived from the source records. They are re-identifiable in practice (the literature on embedding inversion has settled this question, uncomfortably). They sit on someone's infrastructure. They have to be inventoried in a Record of Processing Activities under Article 30. They are a fresh surface for a Subject Access Request — the data subject can ask the controller what the store contains, what was inferred, and to be erased from it. Erasure from a vector store is technically gnarly and legally underexplored. None of this is theoretical for a UK Local Authority running an FOI desk against fifteen years of cabinet minutes, planning files and email archives.
Direct corpus interaction sidesteps this entire surface. The agent reads in place. There is no derived store. There is no embedding API call leaving the perimeter. There is no second copy of the personal data sitting in an index waiting to be deleted from. The data-protection question collapses from "how do we govern the vector store" to "how do we log which documents the agent looked at" — which is a question any Information Governance team has been answering since the 1990s.
The paper says the new pattern is faster, smarter, more accurate. The regulated-buyer translation: it is also cheaper to govern. The cheapest data to defend in court is the data that was never copied.
§3Where this maps to actual UK Local Authority workflows
Three places this pattern eats real workload, in increasing order of regulatory pressure:
- FOI request triage. A request arrives ("provide all communications between the Cabinet Member for Children's Services and Aspire Trust between Jan 2024 and Jun 2025"). The current pattern is: a human IGO surveys Outlook archives by hand. The agent pattern, with direct corpus interaction, is:
grepacross a permissioned mail archive,readthe hits, narrow, repeat. Every tool call lands in an audit log. The output is a candidate response set with a complete trail of what the agent searched for and what it read. Lower regulatory ceiling than redaction, but the time saving is the headline. - DSAR fulfilment. Same mechanic, but with a higher bar: the agent must find everywhere the subject's personal data appears. The win here is the negative evidence: a clean trail showing which corpora were searched and which yielded nothing is what defends a council against an ICO complaint that the response was incomplete. With an embedding-only pipeline this is a hand-waved "we did our best." With direct corpus interaction it's a JSONL audit trail of every
grepissued. - FOI / DSAR redaction review. The strictest case, and the one where the executor-reviewer pattern from Note №01 sits on top. The reviewer needs to verify, for each redaction, that nothing was missed and nothing wrong was withheld. With direct corpus interaction, the reviewer has the same primitive view of the corpus the executor had. The argument is over the same evidence, exactly, not over two slightly different top-k slices.
§4What direct corpus interaction gets right, and where it stops just short
What it gets right: the transparency. Every step the agent takes is a callable string with a verifiable result. grep -i 'aspire' minutes/ is a thing a Risk officer can run themselves and reproduce exactly. There is no embedding model, no temperature, no proprietary index. The agent's path through the corpus is reconstructible by a human who has never touched an LLM.
Where it stops short: the paper does not address permissioning. A council corpus is not a flat folder — different documents have different sensitivity classifications, different retention periods, different statutory bases for processing. A direct corpus interaction layer for a regulated buyer needs to wrap the file primitives with a policy boundary: which subpaths the agent can touch, which file types, which date ranges, what to do when a hit lands inside a marked-confidential section. That's a layer the paper assumes; we have to build it.
It also does not address redaction-in-place. The agent in the paper reads the corpus to answer a query. The agent in a council reads the corpus to produce a redacted version, where the output and the input differ on purpose. The corpus interaction pattern handles the read side; the write side is its own architecture. We'll write that one up separately.
§5What we did with this
Direct corpus interaction clears all four of our implementation triggers — substrate-relevant, doesn't duplicate our existing stack, tractable in under a week, and with a clear customer link via civiclaw. So we built it.
As of this Note, civiclaw ships skills/foi/corpus.py — a small tool layer exposing list, grep, read, and snippet against a configurable council document root. Path traversal blocked. Read sizes capped. Allowed extensions allow-listed. Eleven unit tests pass. Every call appends one entry to the civiclaw audit chain, which means the agent's search path is itself an EU AI Act Article 12 record:
$ python3 -m unittest tests.test_foi_corpus -v test_audit_chain_intact_after_mixed_calls ... ok test_grep_audits_pattern_and_count ... ok test_grep_invalid_regex_raises ... ok test_grep_matches_across_files ... ok test_list_filters_to_allowed_extensions ... ok test_list_records_audit_entry ... ok test_path_traversal_blocked ... ok test_read_disallowed_extension_raises ... ok test_read_full_file ... ok test_read_line_range ... ok test_snippet_returns_window ... ok Ran 11 tests in 0.011s — OK
Next sprint: wire the tool layer into the FOI search stage as the default retrieval mechanism, replacing the current LLM-only "search plan" stub. The stage now becomes a real agent loop over real council documents, with a real audit trail of which folders, which patterns, which files. Then the same layer goes under DSAR. Then we publish the comparative benchmark — direct corpus interaction vs an embedding baseline on a synthetic council corpus — as Note №04.
§6The honest caveat
This paper is six days old. Citation count: zero. Hugging Face upvotes: 56. The community sometimes oversells papers that don't survive replication, and direct corpus interaction has the same vulnerability as every benchmark-led method: maybe the gains are corpus-specific and don't generalise to your specific corpus shape. The Workloft view is that the architectural argument — read in place, log everything, no derived stores — is robust to that. Even if the paper's specific tool-use loop turns out to be sub-optimal, any version of "agent reads raw corpus" has the data-protection property that an embedding pipeline does not. The exact mechanism will get refined; the pattern will hold.
We'll re-check in two months. If the paper has 0 citations and another method has eaten the slot, we'll write up the version that won. The civiclaw module is built around an interface, not the paper's specific tool set, so swapping the underlying mechanism is a one-day change.
4f45699 on the civiclaw repo.
