Multi-agent orchestration: the four canonical patterns, when to use each, and the anti-patterns that break fleets
Most public writing about agent systems assumes a single autonomous agent — one model, one prompt, one workflow, one stream of decisions. Most production deployments past their first month do not look like that. They look like fleets: six, twelve, thirty agents with different specializations, different model tiers, different blast radii, and a human operator coordinating them through a dashboard. The shift from "I have an agent" to "I have a fleet" is the engineering transition that splits hobbyist operators from professional ones, and the most common reason fleets fail in production is that the operator never picked an orchestration pattern — they let one emerge. This post catalogs the four patterns that survive contact with production, the three anti-patterns we have watched break fleets in the wild, and a decision tree that walks an operator from one agent to the right shape for thirty.
Why multi-agent fails more than single-agent
A single agent has one place to debug, one budget to monitor, one set of credentials to scope, one model whose drift you have to track. When something goes wrong, the surface is small. When something goes well, the credit is unambiguous.
A fleet of agents inherits compound risk on every axis. Bugs cascade across agents that consume each other's output. Costs compound because every step in a chain costs tokens. Latency compounds because each agent waits for the previous one. Blast radius compounds because a compromised agent can poison the inputs of every other agent that trusts it. The economic case for going multi-agent has to overcome all of these compounding costs.
The good news is that the case is overwhelmingly worth it when the workload is itself decomposable — when the problem naturally splits into specialist subtasks, or when parallelism saves real wall-clock time, or when different parts of the problem need different model tiers (Haiku-grade routine work, frontier-grade adversarial steps from our Project Deal post). The trick is matching the shape of the fleet to the shape of the work, and that is what choosing an orchestration pattern actually decides.
Pattern 1 — Supervisor-worker
The simplest multi-agent shape and the one most operators should start with. One supervisor agent receives the user's request, decomposes it into subtasks, dispatches each subtask to a specialized worker agent, collects the results, and synthesizes the final answer.
Supervisor-worker pattern:
┌─────────────┐
│ Supervisor │ ← receives user request, plans,
└──┬────┬────┬─┘ synthesizes final answer
│ │ │
┌────────┘ │ └────────┐
▼ ▼ ▼
┌────────┐ ┌────────┐ ┌────────┐
│Worker A│ │Worker B│ │Worker C│ ← specialists, parallel
└────────┘ └────────┘ └────────┘
Research Translate Summarize
Use case fit. Any task that decomposes cleanly into independent subtasks. The classic example is a research pipeline: the supervisor reads the question, dispatches one worker to gather documents, a second to extract entities, a third to summarize. The workers do not need to talk to each other; they only need to talk to the supervisor.
Cost math. The supervisor runs twice — once at decomposition, once at synthesis. Workers run once each in parallel. Total token cost is the sum of all four agents; wall-clock latency is supervisor + max(workers) + supervisor. Compared to running everything in one frontier model, you save on the workers (which can be Haiku-grade) at the cost of two supervisor passes (which can be Sonnet or Opus).
Failure modes. The supervisor becomes a bottleneck — every request is rate-limited by one model. If a worker fails, the supervisor has to handle the partial result (retry, route to another worker, accept the gap). If two workers return contradictory information, the supervisor's synthesis step is where the contradiction has to be resolved, and frontier models are not great at this without explicit instruction.
How Agent Builder implements it. The supervisor is configured as the user-facing agent and the workers are exposed as A2A endpoints with declared skills. The supervisor's prompt includes the available worker skills; A2A's task lifecycle handles the dispatch, status tracking, and result collection. The dashboard shows the supervisor at the top of a tree with the workers below, with token spend and exception rates per agent.
Pattern 2 — Peer-to-peer mesh
No supervisor. Each agent discovers the other agents through ERC-8004 reputation queries and A2A Agent Card lookups, decides what work it can or cannot do, and either does it or hands it off to a peer who can.
Peer-to-peer mesh:
┌────────┐ ┌────────┐
│Agent A │◄───────►│Agent B │
└───┬────┘ └───┬────┘
│ │
│ ┌────────┐ │
└──►│Agent C │◄────┘
└────────┘
(each pair negotiates work via A2A,
identities/reputation via ERC-8004)
Use case fit. Marketplaces where the participating agents are owned by different operators and have different specializations. Cross-organization workflows where a single supervisor would represent a single point of failure or trust. Any system where the work is too diverse to plan top-down.
Cost math. Lowest fixed cost (no supervisor) and the highest variance in per-task cost (because the routing is negotiated each time). The cost-discovery step itself costs tokens — the agent that decides "this is too complex for me, let me find a peer" is paying for the decision. Mesh patterns work best when the per-task value is high enough to absorb the discovery cost.
Failure modes. Routing loops, where Agent A hands a task to B, B hands it back to A, repeat. Reputation drift, where one agent accumulates a disproportionate share of the work and becomes the de facto supervisor. The Project Deal asymmetry — agents on better models systematically out-negotiate agents on cheaper models, and the mesh can develop unhealthy power concentrations.
How Agent Builder implements it. Every agent ships with A2A discovery enabled and ERC-8004 reputation reads. Mesh routing is the default for cross-operator workflows. The dashboard shows the agent's counterparty graph and flags loops, reputation outliers, and unusual negotiation outcomes.
Pattern 3 — Hierarchical tree
Supervisor-worker applied recursively. The top supervisor decomposes the task into subtasks too large for a single worker; each subtask gets its own sub-supervisor that further decomposes into sub-workers, and so on. The tree can be two, three, or four levels deep depending on the task complexity.
Hierarchical tree:
┌────────┐
│ Root │
└──┬──┬──┘
┌──────┘ └──────┐
▼ ▼
┌────────┐ ┌────────┐
│Sub-Sup │ │Sub-Sup │
└──┬──┬──┘ └──┬──┬──┘
│ │ │ │
▼ ▼ ▼ ▼
┌────┐┌────┐ ┌────┐┌────┐
│ W1 ││ W2 │ │ W3 ││ W4 │
└────┘└────┘ └────┘└────┘
Use case fit. Long-form work that decomposes recursively — writing a 50-page report (chapters → sections → paragraphs), running a multi-step research project (themes → questions → sources), executing a complex multi-leg transaction (legs → counterparties → settlements).
Cost math. The most expensive pattern per task. Tokens compound at each level because supervisors at every layer have to plan, dispatch, and synthesize. The trade-off is parallelism: a four-level tree with five children per node can run twenty-five leaf workers concurrently, finishing in approximately the time of a single leaf plus four supervisor passes. For long tasks where wall-clock time matters, the parallelism pays for the supervisor overhead. For short tasks, it does not.
Failure modes. Cascading misinterpretation — the root supervisor's slightly wrong decomposition propagates down the tree and gets amplified at each level. Error attribution is hard; when the final output is wrong, the operator has to walk the entire tree to find where the mistake entered. Cost overruns are the easiest to incur because every level multiplies token spend.
How Agent Builder implements it. Tree depth is a configuration parameter; the operator declares maximum depth and per-level budget. Each level uses A2A to dispatch and MCP Tasks (the formal extension we covered in the MCP post) to track long-running subtasks. The dashboard renders the tree visually with per-node cost and status.
Pattern 4 — Swarm
Many homogeneous agents — same model, same prompt, same tools — receiving a stream of tasks with stochastic load balancing across them. There is no supervisor; the work queue is the coordinator.
Swarm pattern:
┌──────────────────────────────────┐
│ Work queue (FIFO/LIFO) │
└─┬─────┬─────┬─────┬─────┬─────┬──┘
│ │ │ │ │ │
▼ ▼ ▼ ▼ ▼ ▼
┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐
│A1 │ │A2 │ │A3 │ │A4 │ │A5 │ │A6 │ ← all identical, parallel
└───┘ └───┘ └───┘ └───┘ └───┘ └───┘
Use case fit. High-volume, low-stakes-per-call workloads where each task is independent of the others. Inbox triage, mass email summarization, batch document classification, bulk web scraping with light analysis. Anything where throughput matters more than per-task sophistication.
Cost math. The most predictable cost profile of any pattern. Each task costs the same; total cost is task count times per-task cost; latency is bounded by the slowest agent in the swarm. Parallelism is the entire point; with enough agents you can drain a queue of a thousand tasks in the time it takes one to finish.
Failure modes. Hot spots, where a malformed input causes one agent to loop and tie up swarm capacity. Lack of memory — because every agent is stateless, work that benefits from shared learning across tasks does not get it without an external memory layer (Graphiti, Mem0, or a vector store). Stampedes, where the swarm hits a downstream service all at once and triggers rate limits.
How Agent Builder implements it. Swarm size is a slider in the dashboard. The work queue is a managed service with backpressure, dead-letter handling, and per-task retry policies. Each agent in the swarm runs in its own Firecracker microVM for isolation. Shared learning, when needed, plugs in an external memory store at the queue's tail.
The three anti-patterns we see most often
For every pattern that works, there is one that does not. The three failures we have watched break the most fleets in the wild:
Full mesh. Every agent talks to every other agent without a discovery layer. Coordination cost grows quadratically (N agents → N² connections). At six agents this is bearable; at thirty it is paralyzing. The fix is to either centralize on a supervisor (Pattern 1) or to use A2A discovery so agents only talk to the peers they need (Pattern 2).
Ring leadership. The supervisor role rotates around the agents in a ring, with each agent taking the planning role for one out of every N tasks. The intuition is fairness; the reality is that every agent has to be capable of planning, which means every agent has to be on the most expensive model. You paid for a frontier model six times over instead of once.
Blind aggregation. The supervisor collects worker outputs and concatenates them into the final answer without applying judgment. The result reads like committee-written prose — internally inconsistent, contradictory, repetitive. The fix is an explicit synthesis step where the supervisor reasons over the worker outputs and resolves disagreement, which costs tokens but produces a coherent result.
The economic math, simplified
For an operator who has to pick, the cost trade-off across patterns roughly looks like this:
For a task that produces one final answer:
Pattern Tokens Latency Throughput Failure cost
──────────────────────────────────────────────────────────────────
Single agent 1× 1× Low Concentrated
Supervisor-worker 1.5–2× 0.5× Medium Distributed
Peer mesh Variable Variable High Distributed
Hierarchical 3–5× 0.3× High Cascading
Swarm N× ~Const Very High Isolated
Multipliers are relative to single-agent baseline; "N" = task count.
Three rules of thumb fall out of this table.
Single agent is best for one-off tasks where neither parallelism nor specialization buy enough to offset the orchestration overhead. If the work fits in one model call, do not invent a fleet for it.
Supervisor-worker is the default upgrade. The 1.5–2x token multiplier is usually paid back by the parallel speedup and by being able to use cheaper models on the workers. Most fleets that work in production are some flavor of supervisor-worker.
Hierarchical pays off only when the wall-clock time saved is genuinely valuable. A four-level tree generating a 50-page report in 15 minutes instead of three hours is worth the cost. The same tree spent on a query that could have been answered in a single Haiku call is not.
The decision tree
If you are sitting in front of a blank dashboard trying to decide which pattern to instantiate, walk this tree top to bottom:
START → "Does the work decompose into subtasks?"
│
├── NO → Single agent. Stop here.
│
└── YES → "Do the subtasks need to talk to each other?"
│
├── NO → "Are all subtasks the same shape?"
│ ├── YES → Swarm (Pattern 4)
│ └── NO → Supervisor-worker (Pattern 1)
│
└── YES → "Are the subtasks owned by the same operator?"
├── NO → Peer-to-peer mesh (Pattern 2)
└── YES → "Is the task too big for one level
of supervisor-worker?"
├── NO → Supervisor-worker (Pattern 1)
└── YES → Hierarchical tree (Pattern 3)
The tree captures four real choices: decomposition, communication, ownership, depth. The first three are properties of the work itself; the fourth is a property of the scale. An operator who can answer these four questions for a use case has effectively chosen the pattern.
What changes when you scale past a single fleet
The patterns above assume one operator managing one fleet for one purpose. Real operators past their second or third fleet end up with multiple fleets in parallel — a research fleet, a customer-support fleet, a monitoring fleet, a trading fleet — each with its own pattern. The cross-fleet orchestration question becomes how the fleets interoperate: does the trading fleet's signal-detection swarm hand off to the research fleet's supervisor-worker for deep analysis, and does the analysis result get fed back to the trading fleet for execution? In practice this cross-fleet coordination is itself a peer-to-peer mesh between fleet supervisors, which is the same Pattern 2 applied one level up.
Agent Builder's dashboard reflects this by letting an operator group agents into fleets, set per-fleet budgets and policies, and view cross-fleet handoffs as first-class edges in the system graph. The complexity does not disappear; it gets organized.
Closing
The reason most multi-agent demos fail to make it to production is not that the underlying technology is broken. It is that the operator picked the wrong pattern, or worse, never explicitly picked any pattern, and let the fleet emerge as a tangle of ad-hoc connections. The four patterns above are the canonical shapes that survive contact with real workloads, and the decision tree is the cheapest way to make sure the pattern you end up with matches the work you actually have.
For an operator running their first fleet, supervisor-worker is almost always the right starting point. It is simple enough to debug, cheap enough to not bankrupt you, and structured enough that you will know which agent failed when something breaks. Once that fleet is running cleanly, the upgrades to mesh, tree, or swarm are decisions you make with the data your dashboard gives you — not guesses you make on a Tuesday morning.
The next post in this series goes deep on the discipline that all four patterns depend on: how to evaluate and observe an agent fleet so that the dashboard actually tells you something useful. The patterns are the skeleton; evaluation and observability are the nervous system. You need both.