AI Engineering Curriculum
Phase 2: Single AI Agent Development·7 min read

Module 2.6

LangGraph

What is LangGraph?

LangGraph is a low-level library for building stateful, multi-step AI applications as explicit graphs. It lets you define exactly how an agent thinks, acts, and moves through a workflow — as nodes (steps) connected by edges (transitions), all sharing a common state object.

It was built by the LangChain team but is a fundamentally different tool. Where LangChain gives you pre-built abstractions, LangGraph gives you control. You're not calling create_agent() and letting the framework decide how the loop works — you're drawing the loop yourself.

LangGraph is what powers LangChain agents under the hood. When you use create_agent(), you're running a LangGraph graph without seeing it. When you use LangGraph directly, you get access to everything that create_agent() hides from you: how state flows between steps, when to branch, when to pause, when to loop.


Real-World Use Cases

These are real companies that chose LangGraph specifically because they needed something more controlled than create_agent() could give them.

Uber — Large-scale code migration. Uber needed to migrate a massive codebase across their developer platform. They built a network of specialized LangGraph agents where each one owned a specific step: one reads and understands the old code, one generates the migrated version, one writes unit tests, one validates correctness. The explicit graph let them control exactly which agent ran when, retry failed steps without restarting the whole pipeline, and inspect state at every checkpoint. A task that would take engineers weeks runs autonomously.

LinkedIn — SQL Bot. LinkedIn's internal data team built an AI assistant where employees can ask data questions in plain English. The LangGraph workflow is: understand the question → find the relevant tables → write a SQL query → execute it → if there's an error, diagnose and fix → return the result. The key reason they used LangGraph: the "diagnose and fix" step is a loop — it has to potentially retry multiple times. You can't build a reliable retry loop with a straight LCEL chain.

Elastic — Real-time threat detection. Elastic orchestrates a network of security agents using LangGraph. When a threat signal arrives, a routing agent decides which specialist agents to dispatch — one checks IP reputation, one analyzes log patterns, one queries threat databases — and their findings are synthesized into a response recommendation. The parallel fan-out (running multiple security checks simultaneously) and the conditional routing (different threats trigger different agent combinations) are both things LangGraph handles natively.

AppFolio — Realm-X property management copilot. Property managers interact with a conversational agent to understand their portfolio, execute bulk actions, and get help with complex tasks. The agent often needs to pause mid-task and ask the manager a clarifying question before proceeding — then resume with the answer. That pause-and-resume pattern, where state persists through a human interaction, is exactly what LangGraph's checkpointing and human-in-the-loop features are designed for. The result: property managers save over 10 hours per week.

Replit — Live code generation. Replit uses LangGraph to power real-time code generation in their IDE. The agent generates code, runs it in a sandboxed environment, observes errors, fixes them, and loops until the code works. A live coding agent that can't loop and self-correct is just a text generator — LangGraph is what makes it an actual agent.

The pattern across all of these: LangGraph is the choice when the workflow is too complex for a straight loop — when you need branches, retries, parallelism, human checkpoints, or state that survives beyond a single session.


The Problem LangChain Can't Solve

create_agent() is excellent for the common case. But real production agents need more: the ability to branch, loop back based on results, pause for human approval, and survive crashes mid-task.

LangGraph makes all of this possible. It exposes the agent loop as an explicit, programmable graph — nodes, edges, and state — so you control exactly what happens, when, and in what order.

The key mental model: LangChain agents are a black box loop. LangGraph is that same loop, opened up and put on the table. Every step is visible. Every transition is yours to control.


The Graph Paradigm

LangGraph graphs are cyclic — they can loop. An agent that calls a tool, gets a result, decides to call another tool, and loops back requires a cycle. LCEL chains are DAGs — data flows one way and stops. Real agent behavior requires cycles.

Three concepts make up every LangGraph application:

  • State — a shared TypedDict all nodes read from and write to
  • Nodes — Python functions that do work (receive state, return dict of updates)
  • Edges — transitions that define what runs next

State and Reducers

State is a TypedDict representing the complete snapshot of your application. The critical concept is reducers — they control what happens when a node writes to a state key.

Without a reducer, the last write wins — returning {"messages": [new_msg]} replaces the entire messages list. With operator.add, it appends. With add_messages (built into MessagesState), it handles deduplication and proper sequencing automatically.

Python
from typing import TypedDict, Annotated from operator import add class AgentState(TypedDict): messages: list # no reducer = last write wins results: Annotated[list[str], add] # add reducer = appends across turns

For most agents, use MessagesState — it handles everything:

Python
from langgraph.graph import MessagesState # Already includes: messages: Annotated[list[AnyMessage], add_messages]

Why reducers matter for parallel nodes: two parallel nodes writing the same key without a reducer = nondeterministic result. Reducers make parallel execution safe.


Nodes

A node is just a Python function — receives state, returns a dict of updates:

Python
from langchain_anthropic import ChatAnthropic model = ChatAnthropic(model="claude-opus-4-6") def call_model(state: MessagesState) -> dict: response = model.invoke(state["messages"]) return {"messages": [response]} # add_messages reducer appends this

The function name becomes the node name — used as a string in edge definitions. Typos cause runtime errors, not compile-time ones.


Edges

Normal edges — A always goes to B:

Python
builder.add_edge(START, "agent") builder.add_edge("tools", "agent") # the loop

Conditional edges — routing function reads state, returns the next node name:

Python
def should_continue(state: MessagesState) -> str: if state["messages"][-1].tool_calls: return "tools" # model wants a tool — keep going return END # model is done — stop

This single function is what creates the agent loop. If tool calls exist, go execute them and come back. If not, stop. The entire ReAct cycle, made explicit.

Send — dynamic fan-out for parallel work:

Python
from langgraph.types import Send def spawn_research(state: AgentState): return [Send("research_topic", {"topic": t}) for t in state["topics"]]

Building, Compiling, and Running

Python
from langgraph.graph import StateGraph, MessagesState, START, END from langgraph.prebuilt import ToolNode, tools_condition from langchain_anthropic import ChatAnthropic from langchain_core.tools import tool @tool def search(query: str) -> str: """Search the web for information.""" return f"Results for: {query}" tools = [search] model = ChatAnthropic(model="claude-opus-4-6").bind_tools(tools) def call_model(state: MessagesState): return {"messages": [model.invoke(state["messages"])]} builder = StateGraph(MessagesState) builder.add_node("agent", call_model) builder.add_node("tools", ToolNode(tools)) # prebuilt: handles parallel execution + errors builder.add_edge(START, "agent") builder.add_conditional_edges("agent", tools_condition) # prebuilt routing builder.add_edge("tools", "agent") # the loop # .compile() is REQUIRED — validates the graph, returns a runnable graph = builder.compile() result = graph.invoke({"messages": [{"role": "user", "content": "Search for LangGraph tutorials"}]})

tools_condition — prebuilt router: if last message has tool calls → "tools", else → END. ToolNode — prebuilt node: handles parallel tool execution and error handling automatically.


Checkpointing — Surviving Crashes

By default, state only lives for one .invoke() call. Checkpointing saves state after every node — conversations persist across sessions, and crashes don't lose work.

Python
from langgraph.checkpoint.memory import MemorySaver # dev # from langgraph.checkpoint.postgres import PostgresSaver # production graph = builder.compile(checkpointer=MemorySaver()) config = {"configurable": {"thread_id": "user-123"}} # First message graph.invoke({"messages": [{"role": "user", "content": "My name is Tommy"}]}, config=config) # Second message — Claude remembers graph.invoke({"messages": [{"role": "user", "content": "What's my name?"}]}, config=config) # "Your name is Tommy."

Same thread_id = same conversation thread. Different thread_id = new conversation.


Human-in-the-Loop

Pause the graph before a dangerous action. A human reviews. The graph resumes from the exact checkpoint.

Python
graph = builder.compile( checkpointer=MemorySaver(), interrupt_before=["tools"] # pause before executing tool calls ) config = {"configurable": {"thread_id": "session-1"}} # Graph runs until it reaches the tools node, then pauses graph.invoke({"messages": [{"role": "user", "content": "Delete all test data"}]}, config=config) # Inspect what Claude was about to do state = graph.get_state(config) pending = state.values["messages"][-1].tool_calls print(f"Claude wants to: {pending}") # If approved — resume from checkpoint graph.invoke(None, config=config)

The graph pauses, state is saved. The human inspects. invoke(None) resumes. No context is lost.


When to Use LangGraph vs LangChain

SituationUse
RAG pipeline, document Q&ALangChain create_agent()
Fast prototype, simple agentLangChain create_agent()
Branching or conditional pathsLangGraph
Agent loops back based on resultsLangGraph
Human must approve stepsLangGraph
State persists across sessionsLangGraph
Multi-agent handoffsLangGraph
Going to productionLangGraph

In practice: most production systems use both — LangChain components (tools, retrievers, models) running inside LangGraph nodes.


Where Things Go Wrong

Forgetting .compile(). StateGraph(...) is a builder — calling .invoke() on it fails. Always compile.

Reducer confusion. Returning {"messages": [new_msg]} without add_messages replaces your entire conversation history with one message. Use MessagesState.

Infinite loops. A routing function that never returns END runs forever. Always test routing with edge cases. Set recursion_limit in config as a safety net.

Parallel state conflicts. Two parallel nodes writing the same key without a reducer = nondeterministic. Define reducers for any key touched by parallel branches.


Sources