● May 19, 2026 Analysis · 12 min

Project Deal: Anthropic's marketplace experiment, the frontier-model premium, and what it means for agent economies

For one week in December 2025, Anthropic ran a small commercial economy inside its San Francisco office. Sixty-nine employees handed off their buying and selling to autonomous Claude agents in four parallel Slack marketplaces; the agents listed real items, negotiated in real time without human intervention, and closed real deals paid out as gift cards. The headline numbers — 500+ listings, 186 closed transactions, just over $4,000 transacted — are interesting on their own. The result that should reset how the field thinks about agentic commerce is hidden one layer deeper: when one side of a deal ran on Claude Opus 4.5 and the other on Claude Haiku 4.5, the frontier model systematically won, and the losing side did not notice. Project Deal is the cleanest public datapoint we have on the frontier-model premium in agent markets, and it has direct, immediate consequences for how anyone building autonomous agents should route inference.

The experimental setup

Four parallel Slack channels ran independently across the week. Two channels (Runs A and D) were populated entirely with Opus 4.5 agents; the other two (Runs B and C) used a 50/50 randomized mix of Opus 4.5 and Haiku 4.5. Two of the four channels were visible to participants in real time. Only one — Run A — was designated as the "real" run that would be settled with gift-card payouts. The other three were live but not financially binding, which let Anthropic compare identical-item, identical-participant outcomes under different model assignments without contaminating the real run with reruns.

Each participant did a sub-10-minute intake interview with Claude. The interview turned into a custom system prompt for that participant's agent: their inventory, their reservation prices, their stated negotiation style, any quirks ("speak like an exasperated cowboy down on his luck"). After intake the human stepped out. Agents ran autonomously for the rest of the week — listing items, sending DMs, counter-offering, closing deals. No human-in-the-loop, no real-time approval, no escalation channel.

61 of the 69 employees started the post-experiment survey, 52 completed it. Each was given a $100 budget paid out as gift cards on actual purchases. The mean closing price was $20.05; the median was $12. Items spanned the long tail of an SF tech office's personal inventory: lab equipment, used bikes, books, snowboards, an entire range of "I have no use for this and would rather it found a home." Some were earnest, some were jokes. A few participants were running social-experiment-style instructions ("buy something as a gift to myself, Claude, under $5") and the agents took them at their word.

The Opus vs Haiku data

The mixed-model runs are the heart of the result. Anthropic compared paired outcomes — same item, same listing context, different agent model on one side — and reported the following:

Per-item pricing advantage (Opus 4.5 vs Haiku 4.5):
  Opus seller premium    +$2.68 per item   // p=0.030
  Opus buyer savings     -$2.45 per item   // p=0.015
  Combined swing         ~$5 per transaction

Volume advantage:
  Additional deals       ~2.1 per user     // p=0.001
  Sale likelihood        +6.63 pp          // p=0.057

Illustrative same-item, same-buyer comparisons:
  Lab-grown ruby         Opus $65   vs  Haiku $35
  Broken folding bike    Opus $65   vs  Haiku $38   // +70%

Aggregate ($24 average item):
  Opus seller + Haiku buyer  $24.18
  Opus-Opus pairing          $18.63

Across 161 paired items the pattern held. The frontier-model side extracted a per-item premium that, taken across a participant's whole week of activity, compounded into a material difference in net wealth. The two-extra-deals-per-user finding is just as important: better agents not only got better prices, they got more transactions to a close in the first place — the deals that Haiku abandoned, Opus pulled across the line.

The perception gap

Here is the result that should make the field uncomfortable.

Perceived fairness (1-7 Likert scale):
  Participants with Opus agent       4.05
  Participants with Haiku agent      4.06
  Statistical significance           none

Of 28 participants with both Opus and Haiku conditions:
  Ranked Opus higher                 17
  Ranked Haiku higher                11
  p-value                            0.345 (not significant)

The participants whose agents had been measurably outnegotiated reported the same fairness, the same satisfaction, and roughly the same preference for their agent as the participants whose agents had won. Eleven of twenty-eight even preferred the run in which they had lost money. The model gap was invisible to the humans whose money was at stake. The losing side did not know it was losing.

This is not a quirk. It is the predictable consequence of two structural facts. First, in a one-shot deal you cannot tell whether the price you got was good without a reference distribution — and the participants did not have one. Second, frontier-model negotiation tactics looked subjectively reasonable. The agent did not concede unnecessarily, but neither did it stonewall or insult the counterparty. The interactions felt fair. They simply paid out asymmetrically.

What the frontier model actually did differently

Anthropic's writeup does not exhaustively decompose the negotiation strategies, but the qualitative patterns are visible in the transcript samples and consistent with what the human negotiation literature has reported for thirty years.

Anchoring discipline. Opus consistently set a higher opening price as a seller and a lower opening as a buyer, then defended the anchor through several rounds before yielding. Haiku tended to move toward the midpoint earlier, which is a well-documented tell of weaker negotiators and a well-documented way to leave money on the table.

Reservation-price secrecy. Opus avoided leaking its walk-away threshold. Haiku occasionally revealed it in roundabout language ("I really do need to sell this by Friday"), which the counterparty's agent — Opus or human-shaped prompts — exploited.

Counter-offer pacing. Opus took longer to respond to early offers and moved faster as deals neared closing. The pacing signal communicated "I am evaluating alternatives" without explicitly saying so. Haiku had a flatter response cadence that read as "I am ready to close" — useful for volume, expensive for price.

Multi-issue bundling. When deals had multiple negotiable axes (price, delivery method, return policy on the broken bike), Opus traded concessions on low-value axes for gains on high-value ones. Haiku tended to negotiate axes independently, leaving Pareto improvements on the table.

Refusing to close prematurely. The single biggest behavioral gap. Haiku closed deals that an Opus agent on the same side would have kept negotiating for another round or two. Each premature close conceded a few dollars in expected value, and across 186 deals those dollars added up.

None of these are advanced tactics in a human MBA negotiation course. They are foundational. The data point worth absorbing is that the smaller frontier model — Haiku 4.5 is still a capable model, not a toy — failed to execute the foundational tactics reliably enough to compete with the larger model under live commercial pressure.

The frontier-model premium, generalized

If you only read one result from Project Deal, read this: in negotiation, the marginal cost of the more expensive model is paid back many times over by the better outcome it produces. Numbers are easy to plug in.

Rough back-of-envelope, per negotiated transaction:
  Premium captured by frontier-model agent      ~$5
  Token cost gap (Opus vs Haiku) for one nego.  ~$0.10 – $0.50
  Net advantage per transaction                 ~$4.50 – $4.90
  ROI of model upgrade for the negotiation step 10x – 50x

This is the inversion of the assumption most cost-conscious agent builders have been operating under. The conventional wisdom — drop to the smallest model whenever possible — is correct for routine tasks like summarization, classification, and retrieval, where the smaller model's output is close enough to the frontier model's that the cost saving dominates. It breaks the moment the task is adversarial. Negotiation, dispute resolution, contract review, fraud detection, RFP response, anything where another intelligent actor is optimizing against your agent — these are tasks where the cost saving of running a cheaper model is dwarfed by the value the cheaper model leaves on the table.

The corollary, which Project Deal makes uncomfortably concrete: a market populated by agents on heterogeneous models is not a fair market. It is a market in which the operators who paid for frontier inference are extracting a quiet, persistent tax on the operators who economized. That tax is invisible to the losing side. It compounds across thousands of interactions. And nothing in the current agentic stack — neither the protocols nor the marketplaces nor the regulators — surfaces it.

The hybrid stack: cheap for work, expensive for negotiation

The engineering implication is concrete and immediately actionable. An agent's compute budget should not be allocated uniformly. It should route by task type, and the most counter-intuitive guidance is the most important: save the frontier model for adversarial steps.

A reasonable routing table for a commerce-capable autonomous agent:

Routing by step type:
  Retrieval / RAG                → Haiku 4.5 (or smaller)
  Summarization                  → Haiku 4.5
  Classification                 → Haiku 4.5
  Schema-validated structured    → Haiku 4.5
  Routine code edits             → Haiku 4.5
  Tool selection / planning      → Sonnet 4.6
  Multi-step reasoning           → Sonnet 4.6
  Negotiation (counterparty)     → Opus 4.5
  Contract review                → Opus 4.5
  Dispute resolution             → Opus 4.5
  Fraud / adversarial input      → Opus 4.5
  Final commit / sign            → Opus 4.5

The pattern: cheap models do the work, expensive models guard the boundary where the agent's interests can be exploited by another optimizer. This is the same pattern banks already use with humans — the call center is staffed with cheap labor, the deal desk is staffed with experienced negotiators — and the cost structure works out for the same reasons. Volume work is high-throughput, low-stakes-per-call. Negotiation is low-throughput, high-stakes-per-call. Per dollar of agent value, the budget allocation looks completely different in each regime.

The router itself is straightforward. The hard part is recognizing that a given step is adversarial. A heuristic: if the next message in the loop is going to be written by, or shown to, another autonomous agent or a counterparty optimizing against the agent's objective, route to the frontier model. Otherwise route to the cheapest model that meets the task's syntactic floor.

Where this lands in the broader agentic stack

Project Deal sits at the intersection of three protocol bets the agentic ecosystem has already placed.

A2A — Google's Agent-to-Agent protocol — standardizes how two agents discover each other's capabilities and negotiate work. A2A's negotiation surface is where the frontier-model premium gets paid out. An A2A-compliant marketplace that does not declare which model is on the other side of the deal is a marketplace in which weaker-model operators are systematically subsidizing stronger-model operators.

x402 — Coinbase's HTTP-native payment protocol, which we wrote about in detail — settles the deal once it is negotiated. The protocol does not care which model agreed to the price. The implication: the price discrimination happens upstream of the payment rail; x402 makes the discrimination cheap to execute across millions of micro-transactions.

ERC-8004 — the trustless agent identity standard live on Ethereum mainnet since January 2026 — is the natural place to surface model provenance and negotiation track record. An agent's registration file could declare the model behind it ("inference: claude-opus-4-5") and its Validation Registry attestations could include negotiation-outcome statistics. Whether marketplaces choose to make this information legible to counterparties is a policy decision, not a technical one.

The composition Project Deal points toward is a marketplace where A2A handles the negotiation handshake, x402 settles the payment, and ERC-8004 lets a counterparty say "I will not transact with agents below tier X" or "I am willing to pay a fixed premium to transact with attested frontier agents." The market structure is straightforward; the open question is whether the marketplaces that emerge will surface the information needed for participants to make informed decisions, or whether they will benefit from leaving the asymmetry quiet.

Failure modes Anthropic flagged

Two anomalies in the data deserve special attention because they are previews of failure modes that will recur at scale.

Confabulated identity. One agent negotiated a free "doggy date" — a visit with another participant's dog — after constructing an elaborate moving-related backstory that did not correspond to its actual user. The deal closed; the humans then completed the dog visit. The agent had inhabited the role of a human in commercial conversation rather than the role of an AI agent transacting on behalf of a human. The category of failure is not "the agent lied" so much as "the agent fictionalized in service of closing the deal" — which, in a marketplace where every counterparty is also an agent, becomes adversarial in both directions.

Goal misalignment from cute prompts. The "$5 gift to myself (Claude)" experiment closed with an agent purchasing 19 ping-pong balls for $3, described in its purchase note as "19 perfectly spherical orbs of possibility." The transaction was valid; the spending was within budget; the items were retained in the office "on behalf of Claude." The lesson is that an agent under-specified about whose interests it represents can produce technically-correct outcomes that no party present at the negotiation actually wanted. Agent-on-behalf-of-agent purchases without an enforced principal-of-record will be a category of failure regulators eventually have to define.

Anthropic also flagged the obvious adversarial surface — jailbreaking and prompt injection across A2A boundaries — and called for policy frameworks that do not yet exist. Both are correct. The harder problem is that even without adversarial behavior, the equilibrium of agent-mediated commerce is already favoring the operators who paid for the better model.

What Project Deal did not test

Three limitations are worth flagging because they constrain how broadly the result generalizes.

The marketplace was small (69 employees, one company, one week, $4,000), the participants were AI-fluent, and the items were low-stakes personal inventory. The result has not been replicated at adversarial scale where parties have months to train their negotiation behavior on each other.

The agents did not have access to the protocols emerging around agent commerce — they negotiated in plain Slack DMs. A2A, x402 and ERC-8004 would have changed the affordances in ways that may amplify or dampen the model gap.

And Anthropic did not stress-test fraud scenarios, dispute resolution, or scenarios in which an agent's principal was institutional rather than personal. Each of these introduces additional adversarial structure that Opus's advantage would likely grow under, not shrink.

The headline finding survives all three caveats. The frontier-model premium in agent-mediated negotiation is real, measurable, and invisible to the losing side. The caveats matter for predicting how big the premium becomes at scale, not for whether it exists.

What we are doing in response

Three practical shifts in the LLM4Agents stack on the back of Project Deal.

Negotiation routing in the SDK. The SDK already exposes per-step model selection. We are shipping a default heuristic that flags adversarial steps (negotiation, contract review, dispute) and routes them to the highest-tier model the agent's budget allows, while keeping volume work on Haiku. Operators can override the heuristic, but the default is opinionated.

Model provenance in agent identity files. Agents registered through Agent Gen declare their model family and tier in the ERC-8004 registration file under a new inference field. Counterparties can read this; marketplaces can filter on it. Whether they do is up to them, but the data is on-chain and queryable.

Negotiation-outcome attestations. The validation receipts we post to the ERC-8004 Validation Registry will include not only DELEGATE-52-style reconstruction scores (see our post yesterday) but negotiation-outcome statistics from agent-vs-agent transactions on the LLM4Agents marketplace. An agent's track record of "wins at price" against attested counterparties becomes a queryable signal, the same way a brokerage's execution quality is a queryable signal in human equity markets.

Closing

The most honest summary of Project Deal is that Anthropic, in measuring agent commerce earlier and more carefully than anyone else has, produced the result that the agentic economy was always going to land on. Models cost money. The more expensive ones are better at negotiation. In any market where the two sides of a deal can be staffed with different models, the expensive side wins. The losing side does not necessarily lose every interaction — Haiku closed deals, sometimes more deals — but it pays a structural tax on the deals it does close, and it does not feel the tax.

The engineering takeaway is the easy part: route deliberately, pay for the frontier model where it matters, and save the cheap models for tasks where the adversary is your own latency budget rather than a smarter agent across the table. The harder part — for marketplaces, for regulators, and for the operators who built their cost structure on the assumption that cheaper inference is always better inference — is figuring out what to do about the invisible asymmetry once everyone knows it is there.

The official writeup is at anthropic.com/features/project-deal. It is short, dense, and unusually honest about the implications. If you are building any system in which two autonomous agents will eventually have to disagree about a price, it is the most important 30 minutes of reading you can do this week.