← Blog
May 18, 2026 · 14 min

DELEGATE-52: why frontier LLMs corrupt 25% of your work over long chains, and what it takes to fix it

If you have ever asked an AI assistant to perform a long sequence of edits on a document and watched silent damage accumulate in its output — a deleted clause here, a wrong numeric coefficient there, a subtly-rewritten function signature you only noticed two iterations later — you have an intuition for what Microsoft Research just measured at scale. DELEGATE-52, published on 11 May 2026 by Philippe Laban, Tobias Schnabel and Jennifer Neville, ran 19 frontier LLMs through round-trip relay simulations across 52 professional domains. Frontier models corrupted an average of 25% of document content over 20 delegated interactions. Adding agentic tool use made the problem worse, not better. Only Python programming cleared the team's 98% readiness threshold across every model tested. This is the post that has to be read before delegating any long-horizon workflow to an autonomous agent.

What DELEGATE-52 actually measures

The premise is deceptively simple. Take a real document from a professional domain — a Python source file, a crystallography CIF, a music notation file, an accounting ledger, a screenplay. Ask the LLM to perform a structural edit. Then ask a second LLM call to reverse that edit. If the model is faithful, the output of the round trip should match the input. Compose multiple such round trips and you get a delegated workflow that simulates what happens when a user farms a sequence of edits out to an agent and trusts the agent to keep the document intact.

Formally, the benchmark defines a round trip as a pair of instructions (σ, σ⁻¹). The forward instruction maps the seed document s to an intermediate state t = LLM(s; σ). The backward instruction attempts the inverse: ŝ = LLM(t; σ⁻¹). Multiple round trips compose sequentially. The main experiment uses N=10 consecutive round trips — 20 LLM interactions in total — with round-robin task scheduling.

Reconstruction Score is the headline metric:

// RS@k = similarity between the seed document s
// and the reconstruction ŝ after k LLM interactions
// (k/2 round trips), measured by a domain-specific similarity function

RS@k(s) = sim(s, ŝ_{k/2})

// Domain-specific sim() functions account for:
// - AST equivalence (Python, Docker, JSON, DNS)
// - Numeric tolerance (crystallography, accounting, weather)
// - Structural equivalence (music notation, screenplay, vector graphics)
// - Token-level diff with semantic weighting (fiction, emails, recipes)

A score of 1.0 means the round trip was lossless. A score of 0.5 means roughly half the document content failed to survive 20 interactions. The team defines readiness for a domain as RS@20 ≥ 98%. The reasoning is that anything below near-lossless is unsafe for unsupervised delegation: a 90% score on a financial spreadsheet means one in ten numbers is wrong, which is worse than not delegating at all.

The 52 domains

Domain selection was the most underrated decision in the paper. Most agent benchmarks are dominated by software engineering, math, and a thin slice of business workflows. DELEGATE-52 deliberately covered work the way it actually exists in the labor market, across five categories:

Science & Engineering (11 domains): Circuit, Quantum, Robotics, Molecule, Star Catalog, Crystallography, Math Lean, Satellite, Weather, Aviation, Protein. These are domains where outputs are partially structured (often a custom file format like CIF or PDB) and where a single corrupted coefficient can invalidate the entire artifact.

Code & Configuration (11 domains): Python, Malware, Docker, Makefile, Database Schema, Infrastructure, Filesystem, JSON, Translation, DNS, Graphviz. These are the domains where current frontier models are strongest, partly because the training data is dense and partly because verifiers (parsers, compilers, linters) provide implicit reward signal during fine-tuning.

Creative & Media (11 domains): Screenplay, Fiction, Font Engineering, Vector, Music, Slides, Subtitles, Weaving, LaTeX, Audio Synthesis, 3D Objects. Structured formats with semantic content where surface-level changes can break the underlying artifact (a music notation file with one swapped accidental, a vector file with one mistyped path command).

Structured Records (10 domains): Library Catalog, Emails, Ham Radio, Treebank, EDIFACT, Geodata, Geotracking, Calendar, Accounting, Genealogy. Tabular and graph-shaped data where the integrity constraints are external — foreign keys, well-formed RFC formats, timezone-correct timestamps.

Everyday (9 domains): Chess, Transit, Food Menu, Recipe, Landmarks, Earnings, Job Board, Playlist, Spreadsheet. The domains a non-technical user is most likely to delegate to an assistant. The ones where users are also least likely to notice when something is silently broken.

The full dataset contains 310 work environments across the 52 domains, each environment consisting of real documents averaging around 15K tokens, with five to ten editing tasks per environment that simulate the kinds of requests an actual professional would issue. This is not synthetic data shaped to break models. It is the work itself.

The numbers

Nineteen LLMs were evaluated across six families: OpenAI, Anthropic, Google, Mistral, xAI and Moonshot. The headline averages over 20 interactions:

Final Reconstruction Score (RS@20), frontier models:
  Gemini 3.1 Pro       80.9%
  Claude 4.6 Opus      73.1%
  GPT 5.4              71.5%

Average across all 19 models:
  RS@20                ~50%   // half the document gone

Token economics under tool-augmented (agentic) mode:
  Input tokens         2.1x – 4.6x   // vs chat-only
  Output tokens        0.6x – 1.2x   // fewer
  Net effect           ~6% additional degradation

Three findings deserve to be read together.

Only one domain cleared the readiness bar. Python programming: 17 of 19 models achieved RS@20 ≥ 98%. The best overall model, Gemini 3.1 Pro, was ready for only 11 of 52 domains. Catastrophic corruption — defined as RS@20 ≤ 80% — occurred in more than 80% of model-domain combinations.

Agentic tool use made things worse. The four models tested with a multi-turn agentic harness — file access, shell, code execution — degraded 6% more on average than the same models in chat-only mode. The intuition that giving a model more affordances should help long-horizon tasks turned out to be wrong for this benchmark. Two mechanisms drove the regression: tool calls inflate input token consumption two-to-five times (which itself worsens long-context fidelity), and models preferentially wrote files instead of using code execution for verification, so the tool surface added overhead without exercising its strongest affordance.

The failure pattern is not gradual. Errors clustered. Models did not lose 5% of the document each round trip; they kept the document near-intact for several rounds and then dropped 10-30 percentage points in a single interaction. Weaker models tended to fail by deletion — sections vanished outright. Frontier models tended to fail by corruption — content stayed plausible, kept its surface form, but was no longer correct. The frontier models' failure mode is harder to detect with cheap heuristics like length or schema validation.

The agentic paradox, explained

The result that agentic harnesses degrade rather than improve performance contradicts a year of marketing material and deserves a careful breakdown.

Three structural reasons emerged from the paper.

Context bloat. Each tool call adds tokens. Tool definitions, arguments, results, intermediate file contents, error traces — all flow through the same context window the model is using to reason about the task. By interaction 20, an agentic run can have consumed 3-5x more context tokens than a chat-only run on the same task. Long-context performance degrades non-linearly: doubling the input does not double the error rate; it can quintuple it. The agentic harness was paying for tools by burning the long-context budget.

Tool selection bias toward writing rather than verifying. Models with shell and code execution access overwhelmingly used those tools to commit changes rather than to check them. Given the choice between running a parser on the previous output to verify schema integrity and overwriting the file with a new version, models chose to overwrite. The most valuable agentic capability — closing the loop on its own output with a deterministic verifier — was the one models used least.

Latent action conflation. A textual reasoning step and a tool call are two different surfaces, but the model treats them as compositional. When the task requires careful intermediate reasoning (which describes most non-trivial document editing), pushing reasoning into tool arguments truncates it. The model spends its reasoning budget formulating the call instead of thinking about the edit.

None of this is a verdict on agentic systems in general. DELEGATE-52 measures one specific axis — round-trip fidelity on document editing — and the agentic harness offered was deliberately basic. But it is a verdict on the prevailing assumption that "more tools is always better." For long-horizon document workflows, the marginal tool call has to pay for the long-context cost it imposes, and most marginal tool calls do not.

Failure modes you should be able to recognize

The paper's qualitative analysis (Appendix F.2) catalogs failure patterns that any practitioner running long workflows should learn to spot.

Silent deletion. The most common failure on weaker models. A section, table row, or list item disappears between rounds with no acknowledgement. Often confined to subordinate structures — footnotes, comments, optional fields.

Hallucinated correction. The model "fixes" content that was correct. A valid SMILES string is rewritten into another valid SMILES string for a different molecule. A correct chess move is replaced by a different legal but unrelated move. The output looks structured; only a domain-specific check catches it.

Drift-then-collapse. RS@k stays above 95% through rounds 1-6, drops to 70% on round 7, then stabilizes at the new lower level. The model finds a degenerate fixed point and rounds-trips between it and itself. Once it crosses the cliff, additional rounds do not recover.

Reformatting attacks on itself. The model "improves" the document's formatting between rounds — converting tabs to spaces, normalizing quotes, alphabetizing keys. Domain-specific evaluators count these as substantive changes when they break downstream parsers (Makefiles, EDIFACT, certain CIF dialects).

Cross-domain leakage. In agentic mode specifically, the model occasionally pulled patterns from one tool call into the document of another. A previous JSON file's keys appeared in a later screenplay's stage directions. This pattern was rare but unambiguous.

What it would take to fix this

The honest answer is that no single technique solves the long-horizon delegation problem. Closing the gap toward 100% autonomy is a stack of mitigations, each addressing a different failure surface. Here is the technique inventory the field is converging on, in roughly the order an engineering team should adopt them.

1. Structured outputs and schema-validated round trips

The cheapest win. If the artifact has a parseable schema — JSON Schema, Protobuf, AST, SMILES, CIF — wrap every model output in a validator and reject any output that fails. The validator is deterministic, cheap, and catches the entire class of "reformatting attacks." It does not catch semantic corruption, but it eliminates the syntactic failure surface entirely. Frameworks like Instructor, Outlines, and OpenAI's structured outputs API exist for exactly this.

2. Independent verifier sub-agents

Run a second, smaller model whose only job is to compare round n output to round n-1 input under a verification prompt. If the verifier flags the round, either retry (with the failed output included as a counterexample) or escalate to a human. The verifier is not the same instance that produced the output; ideally it is a different model family entirely to reduce correlated failures. This is the equivalent of the "two-pass" review in human workflows and consistently buys 10-20 RS@20 points in internal experiments at LLM4Agents.

3. State checkpointing with cryptographic content hashes

Hash every intermediate artifact (SHA-256 of canonicalized content). Keep the hash chain alongside the document. If a downstream round produces content whose diff against the chain is structurally implausible — e.g. it edits regions outside the intended scope of the round's instruction — refuse the round and roll back to the last valid hash. This converts "silent deletion" into a deterministic alarm. It also produces an audit log that an ERC-8004 validator can later attest against (see our ERC-8004 post).

4. Sliding and rotating context windows

If long context is the primary degradation driver, do not pretend the model has long memory. Rotate explicit summaries of prior rounds through a fixed-size window, evict raw tool call histories aggressively, and keep the document itself as the canonical state — never as a long history of edits the model has to mentally replay. This is the same idea behind Graphiti and Mem0: short, dense, retrievable summaries beat raw chronology in every long-horizon benchmark we have seen.

5. External knowledge-graph anchoring

For domains with stable entity sets — proteins, accounting line items, screenplay characters, music key signatures — bind the document's references to an external graph (or a structured store) and require every round to reconcile against that graph before commit. The model becomes responsible for transformations on the graph nodes, not for keeping every entity correct in free text. Graphiti's bi-temporal knowledge graph is one production-grade implementation; a Postgres table with a foreign-key constraint is another. Both work because they take the integrity burden off the model.

6. Process reward models and process supervision

Most frontier models are trained with outcome supervision: the reward is on the final answer. Process reward models score every intermediate step and reward the trajectory, not just the destination. OpenAI's "Let's Verify Step by Step" line of work and Anthropic's recent process-RLHF results show that process-supervised models exhibit substantially better long-horizon reasoning. The catch is that process supervision is expensive to label and benchmark, which is why it remains under-deployed. DELEGATE-52 is exactly the kind of benchmark that should be used to fine-tune against process-level rewards.

7. Tool-cost-aware planners

The agentic-mode regression is partly a planning failure: the model does not account for the long-context cost of each tool call when deciding to make it. A planner layer that budgets tool calls against context tokens, and that explicitly trades off "use the tool" against "do the reasoning inline," recovers most of the gap. This is the same idea as resource-aware scheduling in operating systems applied to context.

8. Self-consistency with adversarial sampling

Run the same round trip three times with different sampling seeds. If two of the three agree, commit; if they disagree, escalate. Self-consistency is well-studied for math and code; the DELEGATE-52 data suggests it generalizes to document-editing too, though at 3-5x cost. Combine with structured-output validation for cheaper rejection of clearly-broken samples.

9. Validation registries and attestations

For agents whose outputs are consumed by other agents — the substrate that the deAI stack is building toward — record validation results to a registry like ERC-8004 so downstream consumers can filter by attested track record rather than by reputation alone. The agent that has 5,000 round trips on-chain at RS@20 ≥ 95% under independent validators is hireable for long workflows. The agent that has 50 round trips with two corruption attestations is not. This pushes the failure problem from "trust the platform" to "consult the chain."

10. Human-in-the-loop with calibrated stop conditions

The technique nobody wants to put on the slide but every shipping team uses. After every k rounds (calibrated per domain), hand control back to a human. The interesting engineering is in choosing k from the data — domains with cliff-shaped failure curves want small k; domains that degrade smoothly want larger k with statistical stopping. DELEGATE-52's per-domain RS@k curves are exactly the input a stop-condition calibrator wants.

What 100% autonomy would actually require

The honest reflection at the end of this paper is that we are not close to autonomous agents that can run any 20-step workflow without supervision. We are close to autonomous agents that can run specific 20-step workflows in specific domains with specific safeguards. The path from there to general autonomy is not a single algorithmic breakthrough; it is the slow accumulation of the techniques above into compounds that buy back the lost percentage points.

Three structural shifts have to happen before the field can talk seriously about removing the human from long-horizon delegation.

Memory has to graduate beyond the context window. The long-context degradation curve is a hardware fact, not a prompting failure. Until a working memory substrate — external graphs, episodic stores, retrievable summaries — sits underneath every long workflow, models will keep paying the long-context tax and keep losing fidelity to it. The Titans-style architectures, MemOS-style memory operating systems, and graph-anchored stores like Graphiti are the early shape of that substrate.

Verification has to be cheaper than generation. If verifying an output costs more than producing it, no one runs the verifier at scale. The whole point of structured outputs, schema validators, and parser-based round-trip checks is that verification is two-to-three orders of magnitude cheaper than generation. zkML proofs and TEE attestations push verification further: a third party can confirm an agent's claim without rerunning the work. ERC-8004's Validation Registry is the on-chain expression of this principle.

Process reward has to displace outcome reward in training. Frontier models are still optimized for the right final answer. As long as the loss function does not penalize trajectories that look right but drift, models will keep producing trajectories that look right but drift. The DELEGATE-52 corruption pattern — silent, plausible, cliff-shaped — is exactly what outcome-supervised training rewards: the model that gets the answer right 18 times out of 20 looks great on benchmarks even if interactions 19 and 20 silently break the document.

Where this lands for LLM4Agents

We read DELEGATE-52 as engineering reality, not as discouragement. Most of the techniques above — schema-validated outputs, independent verifier sub-agents, state checkpointing, sliding context, graph anchoring, validation attestations — are already on the LLM4Agents roadmap or shipping. The benchmark gives us a coordinate system to measure them against.

The concrete next steps we are taking in response to the paper:

DELEGATE-52 as a CI check for Agent Gen. Every agent generated by Agent Gen will run a domain-relevant subset of DELEGATE-52 as part of its release validation. An agent that fails to clear 95% RS@20 in its declared domain ships with a warning; an agent that fails to clear 80% does not ship without explicit override. The threshold is conservative because corruption is silent.

Structured-output verifiers as a default in the SDK. The SDK ships a wrapper that, for any tool whose output schema is declared, runs the round-trip validator on every model call. Toggle it off if you know what you are doing. On by default because the cost is negligible and the upside is large.

ERC-8004 validation receipts on RS@k. Agents registered through Agent Gen post their RS@k results as validation attestations on the ERC-8004 Validation Registry. This makes long-horizon fidelity visible to counterparties before they hire the agent, which is the only durable mechanism we know of for pushing the field toward better long-context behavior. If your benchmark is on-chain, your incentive is to optimize for it.

microVM-isolated verifier loops. The independent verifier sub-agent runs in a separate Firecracker microVM with no shared state, no network egress, and a hard timeout (we wrote about microVM sandboxes two weeks ago). Correlated failure between the producing model and the verifying model is the main risk to this design; running them in isolated sandboxes with different model families and adversarial prompts is the cheapest way to decorrelate them.

Closing

The most important thing DELEGATE-52 did was give the field a benchmark that does not embarrass itself. The frontier-model corruption rate is not a curiosity — it is the load-bearing fact that explains why long-running agent demos fail in production, why customer-facing autonomous workflows have not arrived, and why every shipping agent team has quietly converged on the same handful of mitigations.

The technical content of the paper is denser than this summary can capture; the full PDF (arXiv:2604.15597) and the public benchmark code (microsoft/DELEGATE52) deserve a careful read. If you are designing a system that will eventually delegate a 20-step workflow to a model, both belong on your desk this week.

Autonomy is not a switch. It is an asymptote. DELEGATE-52 quantifies how far we are from it. The interesting work for the next two years is in the stack of mitigations that close the gap one technique at a time — and in the validation infrastructure that makes the closing visible to the counterparties that have to trust the result.