Agent Memory Systems — Deep Dive | AI Engineering Curriculum

A stateless chatbot resets on every session. It doesn't know you, doesn't remember what you discussed last week, and can't improve its behavior based on past interactions. That's fine for a FAQ bot. It's not acceptable for a personal assistant, a business agent, or anything that's supposed to get better over time.

The difference between a stateless agent and one that genuinely knows you is memory architecture. Phase 0 introduced the three memory types at a high level. This module goes deep on how to actually build them.

The Four Memory Types

In-context memory is whatever is currently in the model's active context window. It's fast and exact — the model has perfect recall of everything in its window. The limitations are hard: it disappears at session end, and it's capped by the context window size. An agent relying purely on in-context memory resets to zero every conversation. Nothing carries forward.

Semantic memory is the agent's general knowledge base — facts, domain information, and concepts stored externally (typically in a vector database). Unlike in-context memory, semantic memory is permanent and grows over time. When you build a RAG system, you're giving your agent semantic memory. The contents are stable and reusable across sessions and users.

Episodic memory captures specific past interactions: what happened, when, with whom, and what was decided. Think of it as the agent's diary. "On February 10th, the user asked about Q1 budget and decided to extend the deadline to March 15th." Episodic memory is time-stamped and personal — it's what enables the agent to say "as we discussed last week..." and mean it.

Procedural memory encodes how to do things — learned patterns, routines, and user-specific habits. "This user prefers concise bullet-point summaries." "This user's meetings are always in the morning." Procedural memory improves execution quality over repeated interactions; it's the layer where agents genuinely get better at serving a specific person or use case.

Why it matters for agents specifically: Most production agents in 2025 use only in-context memory (the current session) and semantic memory (a RAG knowledge base). Episodic and procedural memory are the missing layers that separate a capable agent from one that truly knows you. The difference is enormous in practice — and the tooling to build it properly now exists.

mem0 — The Dedicated Memory Layer

mem0 is the most purpose-built framework for agent memory. It's an open-source universal memory layer that sits between your application and any LLM — works with Claude, OpenAI, Ollama, or anything else. It raised $24M in October 2025. The core claim: 90%+ token savings compared to context-dumping (the naive approach of injecting all past conversation history into every prompt) and 91% lower latency.

The key mental model: mem0 treats memory as an active system, not a log. It doesn't just store everything. It extracts what's worth keeping, consolidates redundant entries, and decays low-relevance memories over time. The result is a compact, high-signal memory store rather than an ever-growing pile of conversation history.

How the architecture works:

After each interaction, mem0 runs priority scoring — it decides what information from the exchange is actually worth storing
New memories are checked against existing ones — if they overlap, they're consolidated rather than duplicated
Stored memories are organized into three tiers: user memory (persists across all conversations with this user), session memory (current conversation), and agent memory (agent-specific state, not user-specific)
Older memories decay in relevance weight over time unless they're reinforced by repeated relevance

Python

pip install mem0ai

from mem0 import Memory

memory = Memory()

def chat_with_memory(message: str, user_id: str) -> str:
    # Retrieve relevant memories before calling the LLM
    relevant = memory.search(query=message, user_id=user_id, limit=3)

    # Inject relevant memories into context
    memory_context = "\n".join(
        f"- {m['memory']}" for m in relevant.get("results", [])
    )
    system = f"You are a helpful assistant.\n\nRelevant memories:\n{memory_context}"

    # Call your LLM (Claude, OpenAI, etc.) with memory-enriched context
    response = call_llm(system=system, user_message=message)

    # Store this interaction for future sessions
    memory.add(
        [{"role": "user", "content": message},
         {"role": "assistant", "content": response}],
        user_id=user_id
    )
    return response

The result: the agent's context window stays small (only relevant memories are injected, not the full history), costs drop dramatically, and the agent accumulates genuine knowledge about the user across sessions.

Performance benchmarks: 26% accuracy improvement vs baseline LLMs on user-specific tasks; 91% token cost reduction over context-dumping approaches.

LangChain Memory Classes

If you're already using LangChain, it has built-in memory abstractions — less sophisticated than mem0 but simpler to wire into an existing chain.

ConversationBufferMemory stores the full conversation history verbatim and injects it on every turn. Simple. Accurate. Doesn't scale — memory grows unbounded and eventually exceeds the context window.

VectorStoreRetrieverMemory embeds past messages and stores them in a vector database. When a new query arrives, it retrieves semantically similar past exchanges. Scales much better; trades exact recall for approximate relevance matching.

CombinedMemory merges both — buffer memory for current-session accuracy, vector store memory for long-term retrieval. This is the production pattern: keep recent messages in the buffer, archive older messages to the vector store.

Python

from langchain.memory import ConversationBufferMemory, VectorStoreRetrieverMemory, CombinedMemory
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings

# Long-term memory (vector store)
vectorstore = Chroma(embedding_function=OpenAIEmbeddings())
long_term = VectorStoreRetrieverMemory(retriever=vectorstore.as_retriever(k=3))

# Short-term memory (recent turns)
short_term = ConversationBufferMemory(memory_key="recent_history", k=5)

# Combine them
memory = CombinedMemory(memories=[short_term, long_term])

The limitation vs mem0: LangChain memory classes don't do intelligent filtering, consolidation, or forgetting. You manage the strategy yourself.

Cross-Session Persistence: What to Store, What to Forget, How to Retrieve

The design decisions matter enormously here.

Store:

User preferences and habits ("prefers bullet points," "works in PST timezone")
Completed tasks and decisions ("extended Q1 deadline to March 15th")
Summarized episodic events (the summary, not the full transcript)
Stable facts ("user's company is Acme Corp," "primary contact is Sarah at Engineering")

Forget:

Transient session state (what the user said 3 turns ago in a session that ended last month)
Superseded decisions ("old meeting time" once a new one is set)
Noise and off-topic remarks
PII that violates compliance requirements — anonymize before storage

Retrieve:

Semantic search (vector embeddings) for "what did we discuss about X?"
Metadata queries for structured facts ("what's this user's timezone?")
Recency weighting — recent memories should rank higher than old ones for the same relevance score

Gotchas

Memory bloat is the failure mode of storing everything without filtering. The retriever starts returning noise, the agent starts making decisions based on irrelevant old memories, and token costs climb back up despite using a memory system. Fix: priority scoring and intelligent extraction — mem0 handles this automatically; with LangChain you implement it manually.

Retrieval failures happen when embedding quality degrades or the similarity threshold is wrong. The agent asks for information the user already provided. Decisions become inconsistent because different memories surface on different runs. Fix: tune your embedding model and retrieval parameters against real user queries.

Privacy is non-negotiable. Memories are user-specific and frequently contain PII — names, emails, financial details, health information, preferences that reveal sensitive patterns. Every memory store needs encryption at rest, anonymization for sensitive fields, and user consent controls. "The agent remembers everything" is a privacy liability if not handled correctly.

Scaling cost: as the memory store grows, retrieval latency increases. Use hot/cold storage tiers — recent and high-relevance memories in fast storage (Redis), older memories in cheaper storage (object store + periodic re-embedding).

LangGraph Native Memory — Checkpointers and the Memory Store

LangGraph ships its own memory primitives, and they're worth knowing because they integrate directly into the graph execution model without requiring a separate service.

Checkpointers handle short-to-medium term state — the conversation history and any accumulated state during a multi-step run. Three backends, choose based on deployment context:

Python

from langgraph.checkpoint.memory import MemorySaver          # in-process, dev only
from langgraph.checkpoint.sqlite import SqliteSaver          # disk-backed, single server
from langgraph.checkpoint.postgres import AsyncPostgresSaver # distributed, production

checkpointer = AsyncPostgresSaver.from_conn_string(os.environ["DATABASE_URL"])
graph = builder.compile(checkpointer=checkpointer)

# Each conversation gets a thread_id — state is isolated per thread
config = {"configurable": {"thread_id": user_id}}
result = await graph.ainvoke({"messages": [...]}, config=config)

The LangGraph Memory Store is a separate primitive for cross-thread, long-term memory — exactly the episodic and semantic layer described above. It's a key-value store with namespacing and semantic search, and it persists independently of conversation threads.

Python

from langgraph.store.memory import InMemoryStore
from langgraph.store.postgres import AsyncPostgresStore

store = AsyncPostgresStore.from_conn_string(os.environ["DATABASE_URL"])

# Save a memory
await store.aput(
    namespace=("user", user_id, "preferences"),
    key="communication_style",
    value={"style": "bullet_points", "detail_level": "concise"}
)

# Retrieve — supports semantic search when embedding fn is configured
memories = await store.asearch(
    namespace=("user", user_id),
    query="how does this user like to receive information",
    limit=3
)

The key insight: checkpointer = session state, store = long-term knowledge. Use both together — checkpointer keeps the current thread's history, store accumulates what's worth keeping across sessions.

The Production Memory Architecture

For a real production agent, the full picture looks like this:

User sends message
    ↓
1. Load relevant long-term memories (store.asearch)
2. Load current thread history (checkpointer)
3. Build context: [system prompt + memories + thread history + new message]
4. Run LLM
5. Execute tool calls
6. Update thread state (checkpointer auto-handles)
7. Extract worth-keeping insights
8. Save to long-term store (store.aput)

Steps 7–8 are where mem0 earns its keep: it decides what's worth saving and deduplicates automatically. Without that logic, you manually implement the extraction and deduplication — which is doable but non-trivial.

The sweet spot for most agents: LangGraph checkpointer for session continuity + mem0 for cross-session learning. They're complementary, not competing.

Module 6.1