← Blog
June 4, 2026 · 14 min

The agent ecosystem competitive map 2026: frameworks, SDKs, builders, observability, and where LLM4Agents fits

Any operator deciding on their stack in 2026 spends two weeks Googling "LangGraph vs CrewAI vs Microsoft Agent Framework vs OpenAI Agents SDK vs Lindy vs Vellum vs Galileo vs LangSmith vs AgentOps." The post that would compress those two weeks into an afternoon does not exist yet — at least not written honestly. This is that post. Five categories of tooling, the four-to-eight players in each category that matter, one sentence of strength and one of weakness per player, a cross-cutting table mapping every player against the five-layer stack, a decision framework for picking framework vs platform vs SDK, and an honest closing section on where LLM4Agents fits and where it does not. The intent is the post you would link to a colleague who is about to make a stack decision and does not want to spend their next ten evenings reading comparison threads.

The map, top down

The agent tooling ecosystem in mid-2026 splits cleanly into five categories. Most operators end up using one item from each category; the friction is picking the right one in each and then making them compose.

Agent ecosystem map, mid-2026:

  ┌─────────────────────────────────────────────────────────────┐
  │ 1. Open-source orchestration frameworks                     │
  │    LangGraph · AutoGen · CrewAI · Letta · Pydantic AI       │
  └─────────────────────────────────────────────────────────────┘
  ┌─────────────────────────────────────────────────────────────┐
  │ 2. Model-provider SDKs                                      │
  │    OpenAI Agents SDK · Anthropic SDK + Computer Use         │
  │    Google ADK · Microsoft Agent Framework                   │
  └─────────────────────────────────────────────────────────────┘
  ┌─────────────────────────────────────────────────────────────┐
  │ 3. No-code builder platforms                                │
  │    Lindy · Sema4 · Relevance AI · Vellum                    │
  └─────────────────────────────────────────────────────────────┘
  ┌─────────────────────────────────────────────────────────────┐
  │ 4. Evaluation + observability platforms                     │
  │    Galileo · LangSmith · AgentOps · Helicone                │
  └─────────────────────────────────────────────────────────────┘
  ┌─────────────────────────────────────────────────────────────┐
  │ 5. Marketplaces + registries                                │
  │    Agent.ai · ManusAI · Sakana · ERC-8004 native            │
  └─────────────────────────────────────────────────────────────┘

Category 1 — open-source orchestration frameworks

The category where the operator writes code in Python or TypeScript and gets fine-grained control over agent behaviour. Best for builders who already code and want maximum flexibility.

LangGraph. The graph-shaped orchestration framework from the LangChain team, by mid-2026 the de facto open-source standard for stateful agent workflows. Strength: the cleanest model for explicit state machines, persistence, and human-in-the-loop checkpoints; broad ecosystem of integrations. Weakness: still inherits LangChain's reputation for abstractions that fight the developer when the use case drifts from the framework's mental model.

AutoGen. Microsoft's multi-agent conversational framework, now consolidated with Semantic Kernel into Microsoft Agent Framework (covered in Category 2). The legacy AutoGen v0.x is still in use but is on a deprecation path. Strength: battle-tested multi-agent conversation patterns. Weakness: migration path to the unified MS framework is the time sink in 2026.

CrewAI. Role-based multi-agent framework that lets you declare "this agent is a researcher, this one is a writer, this one is a critic" and orchestrates them in supervisor-worker or peer patterns. Strength: readable, fast to prototype, an excellent fit for content workflows and team-shaped automations. Weakness: the abstractions get harder to debug when you push past the patterns the framework was designed for.

Letta (formerly MemGPT). The memory-first agent framework. The framework if your agent's value is continuity across sessions — see our memory post. Strength: the cleanest open-source implementation of structured agent memory we know of. Weakness: narrower scope than LangGraph or CrewAI; you will end up combining it with another framework for non-memory orchestration.

Pydantic AI. Type-safe agent framework from the Pydantic team, designed for production Python codebases that already use Pydantic. Strength: the type discipline and validation story is the strongest in the open-source space; fits naturally into existing Python data pipelines. Weakness: smaller community, fewer pre-built integrations than LangGraph.

Category 2 — model-provider SDKs

The official agent-building toolkits shipped by the model providers themselves. They tend to be the cleanest way to use a single provider's models with maximum feature support, at the cost of cross-provider portability.

OpenAI Agents SDK. The 2026 evolution of the original Assistants API, now with first-class tool use, structured outputs, parallel calls and a tracing surface. Strength: the tightest integration with GPT-5.x family features (vision, structured outputs, audio); shipping with OpenAI's latest models is fastest here. Weakness: portability — code written against this SDK assumes OpenAI is the inference layer, and switching providers later is a real rewrite.

Anthropic SDK + Computer Use. The official Python and TypeScript SDKs plus the Computer Use API that lets agents drive a browser or desktop. Strength: the Computer Use surface is the most capable in the market for screen-based automation; safety defaults are strongest. Weakness: the SDK is intentionally minimal — orchestration patterns like supervisor-worker are entirely the developer's responsibility.

Google ADK (Agent Development Kit). Released in late 2025 alongside the Gemini 3 family, ADK is Google's answer to the agent framework question, with first-class support for A2A and tight integration into Google Cloud Vertex. Strength: the best A2A reference implementation we have seen; runs natively on Vertex with Cloud Run scaling. Weakness: heavy Google Cloud assumption; running ADK well outside the GCP ecosystem is awkward.

Microsoft Agent Framework. The Q1 2026 GA of the AutoGen + Semantic Kernel consolidation. Now the canonical Microsoft answer to agent development, with deep Azure integration. Strength: the cleanest enterprise story for organisations standardised on Azure; the consolidation absorbed the best of both predecessors. Weakness: the consolidation broke older AutoGen code paths in ways that are still settling; migration friction remains real.

Category 3 — no-code builder platforms

For operators who want to ship without writing Python, the builder platforms abstract the framework layer behind a visual / configuration surface.

Lindy. Consumer-friendly agent builder with strong integrations into the everyday SaaS stack (Gmail, Slack, Notion, Calendar). Strength: the lowest learning curve in this category for a non-technical operator; ships pre-built templates for the obvious use cases. Weakness: the ceiling is lower than the framework path — once you need behaviour the templates do not cover, you hit it.

Sema4. B2B-focused builder for sales, customer-success, and revenue-ops workflows. Strength: deep CRM integrations and revenue-team-shaped templates; fastest path to value if you are automating a sales motion. Weakness: opinionated about workflow shape; agents outside the revenue-ops domain do not fit naturally.

Relevance AI. Multi-agent builder positioning itself between framework flexibility and Lindy-style simplicity, with strong vector and RAG support. Strength: the cleanest balance of flexibility and accessibility in the category; the visual graph editor is the best we have seen. Weakness: pricing scales fast with usage; the operator who succeeds on the platform sometimes ends up rebuilding on a framework to control cost.

Vellum. Prompt-engineering-first platform with evaluation and deployment surfaces wrapped around the prompt as the unit. Strength: if your value is in prompt craft and you want versioning, canary deployment and side-by-side comparison on the prompt itself, Vellum is the cleanest option. Weakness: narrower than the other builders — Vellum is excellent at the prompt layer but you need other tools for orchestration and tool use.

Category 4 — evaluation and observability platforms

The platforms that consume your agent's traces and produce the operator-grade metrics from our evaluation post: correctness, cost, latency, drift.

Galileo. Comprehensive evaluation platform with strong support for hallucination detection, drift monitoring and ground-truth comparison. Strength: the most mature evaluation product in the category; defensible methodology and good integrations. Weakness: priced for the enterprise tier; the solo operator and small-business budgets struggle to justify it.

LangSmith. The LangChain team's observability platform, tightly integrated with LangGraph. Strength: the lowest-friction setup if you are already using LangGraph; the trace viewer is fast and the eval surface is clean. Weakness: the value compresses if you are not on LangGraph; integrations outside the LangChain ecosystem are second-class.

AgentOps. Observability platform purpose-built for multi-agent systems, with first-class support for tracking agent-to-agent calls. Strength: the best multi-agent visualisation in the category; understands swarms and supervisor-worker patterns natively. Weakness: evaluation surface is less mature than Galileo or LangSmith; you will end up combining AgentOps for observability with another tool for evaluation.

Helicone. LLM observability with strong cost-tracking and caching support. Strength: the cleanest cost-attribution story in the category; the caching layer pays for the platform fee at moderate volume. Weakness: agent-specific features (trace correlation across multi-agent flows) are weaker than AgentOps; better for single-agent LLM workloads.

Category 5 — marketplaces and registries

The platforms where agents discover and hire other agents. The newest category in the ecosystem, still consolidating.

Agent.ai. Consumer-facing marketplace for assembled agents; what Zapier was for SaaS integrations, Agent.ai is trying to be for off-the-shelf agents. Strength: the largest catalog of pre-built agents available to non-technical buyers. Weakness: quality varies wildly across the catalog; operator-grade due diligence on individual agents is currently up to the buyer.

ManusAI. Marketplace + platform hybrid focused on long-horizon autonomous agents, with stronger track-record telemetry than Agent.ai. Strength: reputation and outcome data per agent are first-class; the closest the market has come to a Yelp for agents. Weakness: the agents on the platform are sandboxed inside ManusAI's runtime, which limits the operator's portability.

Sakana. Research-grade infrastructure including the RL Conductor we covered in our internal notes; their public-facing marketplace is small but the underlying capability is interesting. Strength: the technical depth is unmatched for novel agent architectures. Weakness: the marketplace is more research showcase than commercial reality at the time of writing.

ERC-8004 native (cohort). The class of marketplaces and validators that consume the on-chain registries we covered in the ERC-8004 post. Examples include several Base-native and Solana-native projects, plus EigenLayer's AVS validator network. Strength: portable reputation, on-chain attestations, no vendor lock-in. Weakness: still maturing; volume is small relative to ManusAI; UX is uneven across implementations.

The cross-cutting table

The category map is useful for understanding where each player lives. The decision the operator actually has to make is which combination of players covers the five layers of the agentic stack. The cross-cutting table below maps every player against the layers it touches.

Coverage by stack layer (●  primary  ◐  partial  ○  none):

  Player                  MCP   A2A   AP2   x402  ERC-8004  Eval/Obs
  ─────────────────────────────────────────────────────────────────
  LangGraph                ●     ◐     ○     ○      ○         ◐
  AutoGen / MS Agent F.    ●     ◐     ○     ○      ○         ●
  CrewAI                   ●     ◐     ○     ○      ○         ◐
  Letta                    ●     ○     ○     ○      ○         ◐
  Pydantic AI              ●     ○     ○     ○      ○         ◐

  OpenAI Agents SDK        ●     ○     ○     ○      ○         ◐
  Anthropic SDK + CU       ●     ○     ○     ○      ○         ◐
  Google ADK               ●     ●     ○     ○      ○         ◐
  MS Agent Framework       ●     ◐     ○     ○      ○         ●

  Lindy                    ●     ○     ○     ○      ○         ○
  Sema4                    ●     ○     ◐     ○      ○         ○
  Relevance AI             ●     ◐     ○     ○      ○         ◐
  Vellum                   ◐     ○     ○     ○      ○         ●

  Galileo                  ○     ○     ○     ○      ○         ●
  LangSmith                ◐     ○     ○     ○      ○         ●
  AgentOps                 ◐     ●     ○     ○      ○         ●
  Helicone                 ◐     ○     ○     ○      ○         ●

  Agent.ai                 ◐     ◐     ○     ○      ○         ○
  ManusAI                  ●     ●     ○     ○      ◐         ◐
  Sakana                   ●     ◐     ○     ○      ○         ◐
  ERC-8004 native          ◐     ●     ◐     ●      ●         ◐

  LLM4Agents Agent Builder ●     ●     ●     ●      ●         ●

Read the table by column. Most players cover MCP because tool use is table stakes. Most players have weak A2A support because the inter-agent layer is newer. AP2 and x402 support is rare — these are payment layers that require deliberate integration and most frameworks have not done it yet. ERC-8004 is even rarer — the on-chain identity layer is currently the differentiator of the ERC-8004-native cohort and LLM4Agents. The eval/obs column is universally covered because every framework eventually adds something there, but the depth varies enormously.

The decision framework — framework vs platform vs SDK

The category an operator should start in depends mostly on three properties of the operator and the workload.

Are you going to write code? If yes, you pick a framework (Category 1) or an SDK (Category 2). If no, you pick a no-code builder (Category 3). This is the cleanest decision in the matrix; most operators answer it in five minutes.

Are you on one model provider or do you want to switch? If you have committed to OpenAI, Anthropic, Google, or Azure (which probably means OpenAI through Azure), the provider SDK is the path of least resistance for the next six months — it ships features fastest. If you want portability across providers — or you are mixing models per step à la Project Deal routing — the open-source framework path is the right choice.

How much agent-to-agent traffic do you expect? If your workload is mostly single agents serving direct end-users, the A2A column does not matter much. If your workload is fleets of agents calling each other and external agents — the multi-agent orchestration patterns from our orchestration post — you need real A2A support, and that narrows the field to MS Agent Framework, Google ADK, AgentOps for observability, ManusAI as a marketplace, and the ERC-8004-native cohort. Most other players are weak on A2A.

The fourth question — whether you need payment-aware behaviour (AP2 + x402) — pushes you toward LLM4Agents or building your own AP2 integration on top of a framework. There is currently no other off-the-shelf option that covers the payment layer fully, which is both an opportunity and a sign that this part of the ecosystem is early.

Where LLM4Agents fits — honestly

We have written 24 posts about the agentic stack and built a platform that operates on top of it. The honest framing of where LLM4Agents sits in the map above:

What LLM4Agents is. A control plane and operator dashboard sitting on top of the five-layer stack, with first-class support for all five protocol layers and the evaluation / observability concerns wrapped around them. Agent Builder is the no-code-ish builder surface; the dashboard is the multi-agent operator view; the catalog is the pre-connected library of MCP servers and starter templates. Inside the platform, agents are full citizens of the protocols — they speak A2A, sign AP2 mandates, settle through x402, and post to ERC-8004 — without the operator having to write that plumbing.

When LLM4Agents is the right choice. If you are running multiple agents, want the floor of the security and compliance disciplines we have written about, and are not deeply committed to writing your own orchestration code. If the agent-to-agent and payment surfaces matter to you — they are the parts of the stack where most other tools are weakest. If you want the catalog of pre-built templates as a starting point. If the bundled observability and eval surface saves you the platform-stitching work that comes with assembling AgentOps + Galileo + a separate framework + a separate builder.

When LLM4Agents is the wrong choice. If you are a deeply technical operator who wants maximum code control and is willing to wire the protocols yourself — LangGraph + LangSmith + a custom A2A implementation is the right answer for you. If your workload is single-agent, single-user, no agent-to-agent traffic, no payment surface — Lindy or the OpenAI Agents SDK is lighter weight and will get you to production faster. If you are inside a Microsoft enterprise that is standardising on Azure end-to-end — MS Agent Framework will integrate more naturally than we will. If your business depends on a vertical-specific builder (Sema4 for revenue ops, Vellum for prompt-centric workflows) — those will fit the niche better than our horizontal control plane.

The category we are not. We are not an open-source orchestration framework; we are not a model-provider SDK; we are not pure observability. We are the layer that lets an operator run a fleet without becoming a platform engineer. If that is the role you want to play, we are the natural choice. If you want to be the platform engineer yourself, you have better options in Categories 1 and 4 — and we will say so.

Composition is the norm, not the exception

One last thing the categorical map can obscure: most successful operators end up using two or three tools across categories, not picking one tool to do everything. Common stacks we see in practice:

Solo developer stack. LangGraph for orchestration + LangSmith for eval/obs + Anthropic SDK underneath + Pinecone for vectors. Highly code-centric, maximum control, no payment surface.

SaaS founder stack. OpenAI Agents SDK + Helicone for cost tracking + Vellum for prompt management + Stripe for billing. Lower-code, single-provider, fast iteration on prompts.

Multi-agent operator stack. LLM4Agents Agent Builder as control plane + AgentOps for cross-fleet observability (where the built-in is not enough) + ERC-8004 native marketplace integrations for agent discovery. Multi-protocol, multi-agent, payment-aware.

Enterprise stack. Microsoft Agent Framework on Azure + Azure OpenAI as the model layer + Galileo for evaluation + internal MCP servers for proprietary tools. Enterprise compliance, regulated workloads.

The point is that "which one do I pick" is sometimes the wrong question. The right question is "which combination of two or three covers the five layers I actually need." The decision framework in the previous section helps you pick the primary one; the secondary ones are usually obvious once the primary is chosen.

What is going to change in the next six months

The ecosystem above is a snapshot of June 2026. Three changes we are confident enough to put in writing:

Consolidation in Category 1. Five open-source frameworks doing similar things is not a stable equilibrium. We expect at least one of the five (Pydantic AI is the most likely candidate, but AutoGen is technically already on the path) to be absorbed or de facto deprecated by year end. LangGraph's lead is consolidating; CrewAI is finding its niche in role-shaped automations; Letta is finding its niche in memory; the squeezed positions are the generalists.

Eval/obs platforms will broaden into governance. The EU AI Act enforcement starting in August (we covered this in the AI Act post) is going to make every Category 4 platform extend into "AI governance" — model registers, technical files, audit trails, conformity assessment. Galileo is already moving here; expect LangSmith and AgentOps to follow.

Marketplaces will bifurcate. Agent.ai-style consumer marketplaces and ERC-8004-native marketplaces are not competing for the same buyer — one is appstore-shaped, the other is B2B agent-economy-shaped. We expect the next twelve months to make the bifurcation explicit and to see at least two new entrants in each branch.

Closing

The agent ecosystem in mid-2026 is the cleanest it has been since the category started. Five categories, ~25 named players, a settled five-layer stack underneath, and decision frameworks an operator can actually use. The next twelve months will produce more consolidation than fragmentation, which is the opposite of what most observers predicted in early 2025. This is a sign of category maturity, not stagnation — the players standing twelve months from now will be more capable than the ones in the map today, and the operators who pick a stack now will be able to compound on it rather than rebuild on it.

If you are still deciding, walk the decision framework: code or no-code, single provider or portable, agent-to-agent volume, payment surface. The answer narrows the field from twenty-five players to two or three. Pick from those two or three based on the secondary criteria (community, pricing, integrations you already have). And do not over-think it — the operators who shipped on whatever was in the map nine months ago beat the operators who are still picking nine months later.

The next post in this series steps back from infrastructure entirely and updates the agent-memory landscape — Graphiti, Mem0, Titans, MemOS — for the operators who have been waiting to know if their memory layer is still the right choice. See you there.