MMNTM logo
Return to Index
Technical Deep Dive

Agent Memory: From Stateless to Stateful AI

LLMs are stateless by design. Agents require state. The memory architectures—context management, vector stores, knowledge graphs—that transform amnesiacs into collaborators.

MMNTM Research Team
12 min read
#AI Agents#Memory#Vector Database#Knowledge Graph#Architecture

What is Agent Memory?

Agent memory is the architecture that transforms stateless LLMs into persistent collaborators using four memory types: working (context window), episodic (timestamped conversation logs in vector databases), semantic (facts in knowledge graphs), and procedural (tool definitions). Without memory, agents suffer digital anterograde amnesia—forcing users to re-declare context every session. Production systems use hybrid retrieval (vector + graph + keyword), MemGPT-style paging, and reflection for self-improvement without retraining.


Agent Memory: From Stateless to Stateful AI

The Amnesiac Genius

The Transformer architecture that powers every modern LLM is a stateless inference engine. It processes an input sequence, predicts the next token, and resets. No residual activation. No hidden state. No memory of the previous call.

In computer science terms, an LLM is a pure function: given the same input and random seed, it produces the same output, completely unaffected by what happened before.

This architectural purity—essential for training stability and generalized reasoning—creates a profound problem when these models are deployed as agents. An agent, by definition, perceives its environment, reasons, acts, and learns from the consequences. Without persistence, an AI agent suffers digital anterograde amnesia. It meets you for the first time at the start of every session.

Memory architecture is one of the core patterns explored in The Agent Thesis.

The Operational Cost of Forgetting

The consequences extend far beyond user inconvenience:

Workflow Fragmentation Consider a coding agent refactoring a legacy codebase. If it can't persist the architectural constraints from turn one ("Use Repository pattern," "No circular dependencies"), by turn ten it's generating code that violates those constraints. The user re-injects context repeatedly, burning tokens and patience.

Lost Personalization A financial advisor that can't remember your risk tolerance from three months ago forces you to re-declare your financial identity every session. This friction drives churn. Users expect relationships that evolve. Stateless agents offer only transactional utility.

Circular Reasoning Advanced reasoning requires maintaining investigation state across turns. When an agent can't persist intermediate findings—hypotheses generated, paths explored—it loses momentum. In root cause analysis, this means re-investigating cleared suspects because the record of previous deductions is gone.

The Business Case for Memory

The math is straightforward. If a user repeats a 500-token project description ten times, that's 5,000 redundant input tokens. At scale, this creates linear cost growth that stateful architectures avoid.

More strategically: memory is a moat. A system that remembers preferences, history, and constraints accumulates switching costs. A stateless agent is trivially replaceable by any model with similar reasoning capability.

Memory architecture is not a feature. It's a strategic asset.

The Memory Taxonomy

To bridge stateless models and stateful behavior, AI architects borrowed from cognitive psychology—mapping biological memory types to software components.

Memory TypeBiological FunctionEngineering ImplementationRetrieval Trigger
WorkingActive processingContext windowImmediate attention
EpisodicEvent recollectionVector database (logs)Similarity + time decay
SemanticFact retentionKnowledge graph / RAGEntity linking
ProceduralSkill executionTool registry / codeIntent classification

Working Memory is the context window—the model's immediate attention span. Finite and volatile, but high-fidelity. The model can attend to it directly via self-attention.

Episodic Memory stores sequences of experience. "We discussed API authentication last Tuesday." Implemented as timestamped, embedded conversation logs in vector databases.

Semantic Memory stores facts decoupled from when they were learned. "Python is a programming language." "Company policy forbids Level 3 data on public clouds." Often implemented as knowledge graphs or structured document collections.

Procedural Memory is the knowledge of how—tool definitions, few-shot examples, executable capabilities. Currently mostly static (developer-defined), though research on self-evolving agents suggests this may change.

Context Window: The Art of Scarcity

The context window is working memory—and it's expensive. Despite models with 128K or 1M token windows, filling them introduces latency and cost. Worse, models exhibit the "Lost in the Middle" phenomenon: retrieval accuracy for information in the middle of large contexts degrades significantly compared to the beginning or end.

Managing this finite resource is an optimization problem: maximize relevance while minimizing tokens.

Strategy 1: Sliding Window

The simplest approach—a FIFO queue of recent messages. As new turns enter, oldest are evicted.

Problem: "Context rot." Critical instructions from the beginning get discarded. The agent loses the conversation's original goal—the goldfish effect. This management challenge is amplified by the gap between claimed and effective context windows—The Context Window Race reveals how 10M token windows can collapse to ~1K effective tokens in practice.

Strategy 2: Recursive Summarization

Compress old history into narrative summaries. When the buffer fills, an LLM summarizes the oldest segment.

Trade-off: This enables "infinite" conversations in theory. But summarization is lossy compression. Specific code snippets or error logs become "User provided error logs"—useless for debugging.

Strategy 3: MemGPT (OS-Style Management)

Treat context management as an operating system capability. The context window is RAM; external stores are disk. The LLM manages data flow between them.

Through function calling, the agent issues commands like core_memory_append to save critical facts, or archival_memory_search to page data back into context.

Impact: The model decides what's important enough to keep in focus. This breaks dependency on heuristic buffers and enables coherence over significantly longer horizons.

Token Budget Allocation

Production systems enforce strict partitions:

AllocationTokensPurpose
System prompt1,000Instructions, persona, tools
Episodic history2,000Recent conversation turns
Retrieved context3,000RAG chunks
Output buffer2,000CoT reasoning + response

This ensures retrieved knowledge doesn't crowd out system instructions—which can cause behavioral degradation or jailbreaks.

Vector Memory: The Mechanics of Recall

For episodic and semantic memory, the dominant pattern is the vector database. Text is embedded into high-dimensional vectors; retrieval is similarity search in that space.

The Embedding Pipeline

  1. Embed: Map text to vectors (OpenAI text-embedding-3-small, open-source alternatives)
  2. Store: Index vectors with metadata (timestamps, user IDs, source)
  3. Query: Embed the query, find nearest neighbors
  4. Return: Retrieve top-k chunks for context injection

The vector space is constructed so semantically similar text clusters geometrically. Distance (typically cosine similarity) proxies for relevance.

Chunking: The Forgotten Step

Before embedding, continuous text must be segmented into chunks. This choice determines retrieval quality more than most realize.

Fixed-size chunking (every 512 tokens) is naive. It cuts sentences mid-thought, separates questions from answers, destroys semantic integrity.

Semantic chunking splits at natural boundaries—sentences, paragraphs, topic shifts. Each chunk represents a coherent thought.

Propositional chunking (state-of-the-art) uses an LLM to rewrite text into atomic, standalone propositions. "He went to the store" becomes "John Smith went to the grocery store." This decontextualization ensures retrievability without surrounding text.

Vector Database Landscape

FeaturePineconeWeaviateQdrantpgvector
TypeManaged SaaSOpen sourceOpen sourcePostgres extension
Latency<100msLowVery lowModerate
Best ForZero-ops enterpriseHybrid searchHigh throughputExisting Postgres

Qdrant excels at payload-based filtering—complex queries like "memories about Python, created by User_123, in the last 30 days."

pgvector appeals to teams wanting simplicity—co-locate vectors with relational data, simplify backups. May lag at massive scale (>100M vectors).

Beyond Cosine: Hybrid Search and Re-ranking

Raw vector search returns false positives—chunks that are semantically similar but factually irrelevant.

Production systems use multi-stage retrieval:

  1. Retrieve: Top 20-50 candidates via cosine similarity
  2. Hybrid: Combine with keyword (BM25) results. Catches proper nouns that embeddings miss ("Project Hades")
  3. Re-rank: Pass combined list to a cross-encoder model (Cohere Rerank) for precise relevance scoring
  4. Select: Top 3-5 chunks with highest re-ranked scores

This pipeline significantly reduces hallucinations from irrelevant context pollution.

Beyond Vectors: Structured Memory

Vector search treats the world as a bag of isolated concepts. It has no edges, no relationships. This breaks multi-hop reasoning.

If a user asks "How is the project manager related to the author of the compliance doc?", a vector store might retrieve both documents. But if the connection requires traversing Author → Department → Manager, the LLM must deduce the link. For complex graphs, this fails.

Knowledge Graphs (GraphRAG)

Structure memory as nodes (entities) and edges (relationships):

  1. Entity extraction: Background process analyzes text for Person, Organization, Project entities
  2. Relationship extraction: WORKS_FOR, LOCATED_IN, DISCUSSED edges
  3. Storage: Graph database (Neo4j, Memgraph)
  4. Retrieval: Query for subgraphs ("All entities within 2 hops of Project X")

This provides the LLM with a pre-validated map of relationships, dramatically reducing hallucination in structural reasoning.

Hybrid Systems (Mem0)

Mem0 combines both approaches. When memory is added:

  • Simultaneously embedded (for vector search)
  • Processed for entity extraction (for graph storage)

During retrieval, vector search finds relevant nodes; graph traversal pulls connected context.

Critical innovation: Contradiction handling. If a user says "I moved to Berlin," the system detects conflict with "User lives in London." An LLM-based resolver updates the fact, maintaining consistent state rather than accumulating contradictory logs.

Performance: 26% higher accuracy in response generation compared to pure vector retrieval, primarily from better entity relationship handling.

For enterprises deploying agents at scale, unified data platforms solve the "hidden infrastructure" problem. Databricks integrates vector search directly into Unity Catalog (with automatic permission inheritance), provides Delta Live Tables for real-time ingestion, and offers Agent Bricks for building production agents on top of governed data. See Databricks Foundation for why this unified stack matters.

Retrieval Patterns: When to Remember

Naive systems retrieve on every turn. This is costly and introduces noise. Advanced agents use active retrieval:

Classifier-based: A lightweight model analyzes the query. "Hello" triggers no retrieval. "What did we decide about the database?" triggers search.

Query expansion: Don't just search the raw query. Generate permutations. For "that project," search "Project Alpha," "Q3 Objectives," "Marketing Plan." Higher recall, more latency.

The Latency Budget

Retrieval adds to Time-To-First-Token:

StepLatency
Vector search20-50ms
Graph traversal50-200ms
Re-ranking200-500ms
Total added300-600ms

For real-time chat with ~1-2 second total budget, re-ranking often pushes the limit. Production systems cache frequent queries or run retrieval asynchronously—loading user profile data at session start before the user types.

Memory Maintenance: Learning to Forget

A memory system that only adds data collapses under its own weight—slow, expensive, noisy.

Time-Based Decay

Implement scoring that penalizes older vectors:

Score = Similarity × e^(-λ × Time)

Recent information is prioritized unless older information is extremely relevant.

Reflection (Generative Agents)

Stanford's Generative Agents research introduced reflection as maintenance. The agent periodically scans recent episodic memories and synthesizes high-level insights:

  • Observation 1: User asks for Python help
  • Observation 2: User asks for Django tutorials
  • Reflection: "User is a web developer learning Django"

The reflection is stored as new memory. Future queries match against this insight rather than scattered observations. This mimics how humans consolidate experience into wisdom.

Reflexion (Self-Improvement)

Reflexion loops agent output back as critique:

  1. Act → Attempt task
  2. Evaluate → Check result
  3. Self-reflect → Generate verbal analysis ("I failed because I didn't handle empty lists")
  4. Store → Save reflection as future hint

On subsequent attempts, the agent retrieves its own critique. This improves benchmark performance without updating model weights—learning through memory rather than training.

Framework Comparison

The tooling landscape offers distinct philosophies:

LangChain

LEGO blocks approach. Granular primitives: ConversationBufferMemory, ConversationSummaryMemory, ConversationKGMemory, VectorStoreRetrieverMemory.

Trade-off: Highly flexible but requires significant wiring. Powerful for prototyping; abstraction overhead in production.

LlamaIndex

Data-centric. Treats conversation history as another data source to index. Strong retrieval engines (recursive, auto-merging) that can be applied to chat logs.

ChatMemoryBuffer and ChatSummaryMemoryBuffer offer robust sliding window and summarization implementations.

Mem0

Specialized "Memory Layer" abstraction. Simple API (memory.add(), memory.search()) that handles entity extraction, embedding, and conflict resolution internally.

Architected specifically for user personalization—automatically organizes by user_id, consolidates facts into persistent personas.

Zep

Sidecar service, not library. Runs as standalone server, ingests chat logs asynchronously, performs summarization and embedding in the background.

Benefit: Memory maintenance doesn't block the main agent loop. Response latency stays snappy while memory curates in near-real-time.

Production Reality

Multi-Tenancy

In B2B contexts, strict data isolation is non-negotiable:

Metadata filtering: Every vector tagged with tenant_id. Every query includes filter={tenant_id: "client_X"}.

Namespaces: Physical partitions (Pinecone). Queries in Namespace A cannot access Namespace B.

RBAC: Access control lists in vector metadata. Filter results based on user permissions ("only documents tagged Public or Dept_Engineering").

Context Pollution

When RAG retrieves irrelevant chunks, the model gets confused. User asks "How do I fix the bug?" System retrieves documentation about a different bug. Model confidently provides the wrong fix.

Mitigations:

  • Strict similarity thresholds (discard chunks < 0.75)
  • Context purification (secondary LLM reviews chunks for relevance)
  • Source citation (ground generation in specific evidence)

Cost Analysis

Memory is a cost center. For ~1,000 active users per month:

ComponentMonthly Cost
Vector storage (1GB)$70-100
Embeddings (1M tokens)$0.10-0.20
RAG retrieval (context injection)$500-1,000
Async maintenance$50-100
Total$600-1,200

RAG retrieval dominates—every chunk injected incurs input token costs. At GPT-4 prices, this scales linearly with volume.

The Frontier

MemGPT: LLMs as Operating Systems

MemGPT proves that fixed context windows aren't a fundamental limitation if the model can page data. It enables processing massive datasets (entire legal case histories) by streaming data through the "processor" while maintaining state on "disk."

Generative Agents: Memory as Social Intelligence

Stanford's simulation gave agents episodic memory and reflection. The result: emergent social behavior—planning parties, sharing news, forming opinions—that was never explicitly programmed. Memory, not model size, enabled believable agency.

The Vision

The transition from stateless to stateful is the defining engineering challenge of agentic systems. It transforms the LLM from a brilliant amnesiac into a coherent, evolving collaborator.

The path forward:

  1. Tiered architecture: Context window for immediate reasoning, vector stores for episodes, graphs for structure
  2. Active management: Pruning, summarization, contradiction resolution—not just accumulation
  3. Hybrid retrieval: Vector + graph + keyword. No single approach captures human recall's nuance

Solve memory, and you unlock agents that don't just process commands but understand context, remember history, and grow with you over time.


See also: The Probabilistic Stack for engineering non-deterministic systems, Agent Failure Modes for what breaks without proper memory, and Agent Economics for the cost modeling that makes memory decisions concrete.

Agent Memory: Why Your AI Forgets (And How to Fix It)