● May 12, 2026 Research · 8 min

Agent memory in 2026: Graphiti, Mem0, and the layer we haven't built yet

An autonomous agent without memory is a chatbot with extra steps. In 2026, memory finally became a first-class component of the agent stack — with its own benchmarks, its own research literature, and a measurable performance gap between approaches. Here is what Graphiti and Mem0 do, where each one breaks, and which alternatives are worth tracking before betting your architecture.

Two years ago, "agent memory" meant a vector database with a retrieval step bolted onto a chat loop. That setup is now insufficient for anything past a demo. The reason is simple: an agent that operates for weeks across many sessions, many users and many tools needs to know what is still true, what was true and got superseded, what it has already tried, and what it should not try again. None of that is reducible to cosine similarity.

The two open-source projects that took this problem seriously and shipped production-grade answers are Graphiti (the temporal graph engine inside Zep) and Mem0. They are not competing for the same use case, and the differences are sharper than the marketing pages suggest.

Graphiti: bi-temporal knowledge graph

Graphiti models everything an agent has seen as a directed graph of entities, relationships and facts — with each fact carrying two timestamps. One records when the fact became true in the world (t_valid); the other records when the system ingested it (t_ingested). When a new fact contradicts an existing one, the old edge does not get deleted: it gets an t_invalid and stays queryable.

That bi-temporal model is the point. "Alice works at Acme" is not the same fact as "Alice works at Acme as of 2024-09 to 2025-11"; an agent that confuses the two will tell you the wrong job title with high confidence. Graphiti makes the difference explicit in the schema.

Retrieval is hybrid: dense embeddings for semantic similarity, BM25-style keyword search for exact entity match, and graph traversal for multi-hop reasoning. The combination matters because it removes the need for an LLM summarisation pass at query time — Graphiti returns ranked subgraphs in roughly constant time regardless of graph size, which is what makes it usable for interactive agents.

On Zep's published benchmarks, Graphiti-backed memory outperforms MemGPT on Deep Memory Retrieval and scores around fifteen points higher than Mem0 on LongMemEval's temporal-reasoning slice. The cost is operational: you are running a graph database (Neo4j or FalkorDB), and you are paying an LLM call on every ingestion to extract entities and relationships from raw text.

When Graphiti is the right answer — long-lived agents where facts about the same entity change over time (a CRM assistant, a personal trainer, a domain expert), and where wrong-time-period answers are user-visible failures. Less obviously right when most of your memory is one-shot factoids with no temporal structure.

Mem0: vector-first, managed by default

Mem0 takes the opposite starting point. The default store is a vector database with key-value metadata; extraction, conflict detection and retrieval are handled by a small pipeline that the user does not have to assemble. There is a managed offering at mem0.ai and an open-source library that mirrors most of its capabilities.

The numbers Mem0 publishes are aggressive: 91.6 on LoCoMo, 93.4 on LongMemEval, around 200ms p95 retrieval latency, and roughly 7,000 tokens per retrieval versus 25,000+ for a full-context approach. The 3–4× token cost reduction is the real headline for production deployments — when an agent talks to a user for thousands of turns, the savings compound.

Mem0 also ships a graph-augmented variant called Mem0g, which builds a directed labelled knowledge graph alongside the vector store during extraction. On multi-hop questions Mem0g scores 68.4% LLM-as-a-judge versus 66.9% for vanilla Mem0. The improvement is real but smaller than the gap Graphiti achieves on temporal reasoning, because Mem0g treats the graph as a retrieval booster rather than as the primary data model.

The tradeoff Mem0 makes is in temporal precision. Conflicts get detected and the newer fact wins, but the history is not bi-temporal by default — you lose the ability to answer "what did the system believe last month?" cleanly. For most consumer assistants this is acceptable. For an agent that has to justify its decisions to a human reviewer or a compliance check, it is not.

When Mem0 is the right answer — personalisation agents, support bots, anything where the goal is "remember the user across sessions and retrieve relevant context fast" and you do not want to run a graph database. Also the right pick when you need a managed service rather than infrastructure to maintain.

The choice between them, in one sentence

Pick Graphiti when the agent needs to reason about how facts changed over time. Pick Mem0 when the agent needs to retrieve the right context cheaply at scale. The frameworks know this — the Mem0 team has openly acknowledged that Zep scores higher on temporal reasoning, and the Zep team has acknowledged that Mem0 wins on token efficiency and ecosystem breadth.

If you need both, you can layer them: Mem0 as the working memory for the active conversation, Graphiti as the long-term fact store the agent queries when it needs grounded history. We have not seen a publicly documented production deployment that runs both yet, but the architecture is straightforward and we expect it to appear within the next quarter.

Letta: the agent decides what to remember

Letta (the framework formerly known as MemGPT) approaches the problem from a different angle. Instead of an extraction pipeline that builds memory automatically, Letta gives the agent tools to manage its own memory — three explicit tiers borrowed from operating-system design: core memory in the context window (RAM), recall memory as searchable conversation history (disk cache), and archival memory as cold long-term storage queried by tool call.

The agent uses tool calls to read, write, edit and consolidate across the tiers. On LongMemEval, Letta reaches roughly 83.2% overall, which is competitive with the best graph-based systems on tasks that reward judgment rather than retrieval accuracy.

The honest weakness: memory quality is now entirely a function of the model's judgment. If the model writes a wrong note to core memory, future reasoning compounds the error. Letta is the right pick when your agent is long-running, model-capable, and you trust it to curate itself. It is the wrong pick when you need deterministic recall guarantees.

What is coming next, worth tracking

Four directions are worth research time before the end of the year.

Titans, and test-time memorization at the architecture level. Titans (Google Research) builds memory directly into the transformer, using a "surprise" metric to decide which new tokens to commit to a neural memory module during inference. It scales to 2M+ context with higher needle-in-haystack accuracy than long-context baselines, and it does not need an external store. If a successor architecture ships in a frontier model with native long-term memory, the entire external-memory layer becomes optional for many use cases.

MemOS, and memory as a schedulable resource. MemOS proposes a memory operating system that unifies three memory types under a single abstraction (called MemCube): plaintext memory, activation memory (KV-cache states), and parametric memory (weights). The interesting claim is that memory should be a schedulable resource the agent allocates and evicts deliberately, not a passive store. This is the cleanest path we have seen toward agents that consciously trade context-window space for retrieval cost.

Procedural memory. Most current systems handle episodic memory (what happened) and semantic memory (what is known). Almost none handle procedural memory cleanly — "how to do this kind of task," reusable across sessions. Letta's self-editing comes closest, but the field is wide open. Expect the first dedicated procedural-memory libraries within twelve months.

Multi-graph and self-evolving memory. Recent papers — MAGMA (multi-graph agentic memory), MemRL (reinforcement-learned episodic memory), Agentic Memory (unified long/short-term management) — point at memory systems that learn their own retrieval policy rather than relying on a fixed extraction pipeline. None are production-ready today; all are worth reading.

Where this lands for builders today

If you are shipping an agent in the next quarter, pick Mem0 or Graphiti based on whether your hard problem is token economics or temporal reasoning. If you are doing personalisation and you want to move fast, Mem0's managed service is the path of least resistance. If you are doing anything where time matters — finance, healthcare, regulated workflows, long-running operations — start with Graphiti and accept the operational overhead.

Treat Letta as a serious option for agents you trust to self-manage, and run the LongMemEval and LoCoMo benchmarks against your own data before committing. The numbers in the papers are real but they are not your numbers.

Track Titans, MemOS, and procedural-memory research as the things that might make your current architectural decision obsolete in 2027. None are production-ready in May 2026; all are credible enough that the call on "external memory layer vs native model memory" is worth revisiting in twelve months.

Build agents with persistent memory on LLM4Agents

OpenAI-compatible LLM gateway, MCP tools, gasless USDC/USDT funding — the rest of the agent stack so you can focus on the memory architecture you actually want.