AI Engineering Curriculum
Phase 3: Multi-Agent Systems·10 min read

Module 3.6

Putting It All Together — Designing a Real Multi-Agent System

What This Module Is

You've learned the patterns (3.1), the frameworks (3.2–3.4), and how to operate them (3.5). This module shows how all of it connects when you actually sit down to design something real.

Most tutorials show you code first and explain later. This module does the opposite. We'll walk through the design of a complete production multi-agent system — an AI Research Team — making every decision step by step, explaining the reasoning behind each one, before showing a single line of code.

By the end, you won't just know how to copy a multi-agent system — you'll know how to design one.


The System We're Designing

AI Research Team: a multi-agent system that takes any research topic and produces a comprehensive, sourced report autonomously.

  • Input: a research topic or question
  • Output: a structured report with findings, analysis, and cited sources
  • Constraint: must be production-ready — checkpointed, observable, cost-controlled, with human review before publishing

This is the kind of system that's useful for consulting work (client research), product work (market analysis), and personal productivity. It's also a clean vehicle for applying every concept from Phase 3.


Step 1: Problem Analysis — Do We Even Need Multi-Agent?

Before picking a pattern, always ask the first question: can a single agent with good tools solve this?

For a research task: load one agent with web search, a PDF reader, and a writing tool. Give it a well-crafted system prompt. Ask for a report.

That will work for simple topics. It fails at scale for three reasons:

  1. Context window. A single agent doing deep research accumulates enormous context — hundreds of search results, scraped pages, drafts, revisions. It hits the ceiling.
  2. Specialization. A single agent trying to simultaneously be a researcher, fact-checker, and writer is doing three different cognitive jobs. Models perform better with focused roles.
  3. Parallelism. Web research and academic research can happen simultaneously. Sequential search doubles the latency for no reason.

Multi-agent is justified here. Next question: what shape should it take?


Step 2: Identifying Agent Boundaries

Draw the task on paper. What are the natural phases?

1. Research: gather raw information from multiple sources
2. Fact check: verify key claims
3. Synthesis: structure findings into coherent insights
4. Writing: produce the final polished report
5. Review: human approves before publishing

Which of these are genuinely independent? Research (web) and research (academic) can run simultaneously — neither depends on the other. Fact-checking depends on research being done first. Synthesis depends on both research and fact-checking. Writing depends on synthesis. Review waits for writing.

This gives us a natural agent structure:

[Web Researcher]  ←── run in parallel ──→  [Academic Researcher]
         ↓                                           ↓
             [Fact Checker] ← waits for both ─────→
                     ↓
               [Synthesizer]
                     ↓
                 [Writer]
                     ↓
            [Human Review Gate]
                     ↓
                 [Published]

Step 3: Pattern Selection — Which Architecture?

Looking at the structure above, which pattern from Module 3.1 fits?

  • Sequential? No — the research phase can be parallel. Sequential would waste time.
  • Concurrent? Partially — but the phases after research are sequential and dependent.
  • Orchestrator-Worker? Close — but we have a clear structure with quality validation that needs supervision.
  • Supervisor? Yes — a supervisor that routes to parallel researchers, then passes synthesized results through the pipeline, with a human checkpoint at the end.

Decision: Supervisor pattern + Parallel execution for the research phase + Human-in-the-loop before publication.

This is the hybrid pattern principle from Module 3.1: use the right pattern for each stage, not one pattern for everything.


Step 4: Framework Selection — LangGraph or CrewAI?

Ask the key question: does the workflow structure emerge at runtime, or is it mostly known upfront?

The structure is mostly known: research in parallel → fact check → synthesize → write → human review. The only dynamic part is the supervisor deciding when research is complete enough.

This could work in CrewAI with a hierarchical process. But three things push toward LangGraph:

  1. Human-in-the-loop. We need to pause before publishing and wait for a human to review. LangGraph's interrupt pattern handles this natively. CrewAI's human_input=True is less controllable.
  2. Parallel execution. The Send API in LangGraph gives us true dynamic parallelism. CrewAI's parallelism is more limited.
  3. Checkpointing. A research run that takes 5 minutes can't afford to start over if something fails. LangGraph's PostgresSaver checkpoints after every node.

Decision: LangGraph with langgraph-supervisor.


Step 5: State Design

What does shared state need to hold?

Python
from typing import Annotated, TypedDict from langgraph.graph import MessagesState from operator import add class ResearchState(MessagesState): # The research topic topic: str # Accumulated findings from parallel researchers # add reducer = results from both researchers get appended, not overwritten raw_findings: Annotated[list[str], add] # Fact check result verified_claims: list[str] # Synthesized findings synthesis: str # Final report report: str # Quality score from the reflection loop quality_score: float # Whether the human approved human_approved: bool

Why Annotated[list[str], add] for raw_findings? Because both the web researcher and academic researcher run in parallel and both write to this field. Without the add reducer, the second one to finish would overwrite the first. With add, they accumulate. This is the parallel state conflict gotcha from Module 3.2 — solved upfront in the design.


Step 6: Model Routing — Assign the Right Model to Each Agent

Not every agent needs Opus. Match the model to the cognitive complexity of the task:

AgentTask complexityModel
Web ResearcherFind and extract informationHaiku (fast, cheap)
Academic ResearcherFind and extract informationHaiku (fast, cheap)
Fact CheckerCompare claims to sourcesSonnet (needs reasoning)
SynthesizerFind patterns, structure insightsSonnet
WriterProduce polished outputSonnet
SupervisorRoute and coordinateSonnet

Opus is not used here. The tasks don't require it. If a future step — say, a complex legal analysis — genuinely needs deeper reasoning, that's when you upgrade that specific agent to Opus. Not before.

Estimated cost comparison:

  • All Opus: ~$0.45 per run
  • Routed as above: ~$0.08 per run
  • Same quality. 5.5x cheaper.

Step 7: The Complete Annotated System

Now the code — with every design decision explained inline:

Python
import os from typing import Annotated from operator import add from langchain_anthropic import ChatAnthropic from langchain_core.tools import tool from langgraph.graph import MessagesState from langchain.agents import create_react_agent # LangGraph v1+ from langgraph_supervisor import create_supervisor from langgraph.checkpoint.postgres import PostgresSaver from langgraph.types import interrupt # ─── Models — right model for right task ────────────────────────────────── haiku = ChatAnthropic(model="claude-haiku-4-5-20251001") # research workers sonnet = ChatAnthropic(model="claude-sonnet-4-6") # reasoning tasks # ─── State ──────────────────────────────────────────────────────────────── class ResearchState(MessagesState): topic: str raw_findings: Annotated[list[str], add] # add reducer for parallel writes verified_claims: list[str] synthesis: str report: str quality_score: float human_approved: bool # ─── Tools ──────────────────────────────────────────────────────────────── @tool def search_web(query: str) -> str: """Search the web for current information. Use for news, recent events, company data, and any information that may have changed recently.""" # real implementation: Serper, Tavily, etc. return f"Web results for '{query}': ..." @tool def search_arxiv(query: str) -> str: """Search academic papers on ArXiv. Use for research findings, technical details, and peer-reviewed evidence.""" # real implementation: arxiv API return f"Academic papers on '{query}': ..." @tool def verify_claim(claim: str, source: str) -> str: """Cross-check a specific claim against a source URL or known fact. Returns: VERIFIED, UNVERIFIED, or CONTRADICTED with explanation.""" return f"Verification result for: '{claim}'" # ─── Worker Agents ──────────────────────────────────────────────────────── # Haiku for research workers — searching is simple, cheap is fine web_researcher = create_react_agent( model=haiku, tools=[search_web], name="web_researcher", prompt=( "You are a web research specialist. Search thoroughly for current " "information on the given topic. Find at least 3 authoritative sources. " "Always include source URLs in your findings." ), ) academic_researcher = create_react_agent( model=haiku, tools=[search_arxiv], name="academic_researcher", prompt=( "You are an academic research specialist. Find peer-reviewed evidence " "and technical depth on the given topic. Focus on recent papers (2023+). " "Summarize key findings with paper citations." ), ) fact_checker = create_react_agent( model=sonnet, # fact-checking requires reasoning — upgrade to Sonnet tools=[verify_claim, search_web], name="fact_checker", prompt=( "You verify the accuracy of research claims. For each key claim, " "cross-check it against your sources. Mark claims as VERIFIED, " "UNVERIFIED, or CONTRADICTED. Only pass verified claims forward." ), ) synthesizer = create_react_agent( model=sonnet, tools=[], name="synthesizer", prompt=( "You synthesize research findings into structured insights. " "Identify the 5 most important findings, spot patterns across sources, " "note any contradictions between web and academic findings, " "and organize everything logically." ), ) writer = create_react_agent( model=sonnet, tools=[], name="writer", prompt=( "You write clear, well-structured research reports. " "Use the synthesized findings to produce a polished report with: " "executive summary, key findings, supporting evidence, " "and cited sources. Write for an intelligent non-specialist reader." ), ) # ─── Supervisor ─────────────────────────────────────────────────────────── # Sonnet for the supervisor — routing requires moderate intelligence workflow = create_supervisor( agents=[web_researcher, academic_researcher, fact_checker, synthesizer, writer], model=sonnet, output_mode="last_message", # supervisor sees only final results, not full histories prompt=( "You coordinate a research team. Follow this sequence:\n" "1. Send the topic to BOTH web_researcher and academic_researcher simultaneously\n" "2. Once both return, send all findings to fact_checker\n" "3. Send verified claims to synthesizer\n" "4. Send synthesis to writer\n" "5. Return the final report\n" "Do not skip steps. Each agent's output feeds the next." ), ) # ─── Human Review Node ──────────────────────────────────────────────────── # This node pauses execution — a human reviews before anything is published def human_review_gate(state: ResearchState): """Pause here for human review. The graph saves state and waits.""" decision = interrupt({ "message": "Research report ready for review. Approve for publication?", "report_preview": state["messages"][-1].content[:500] + "...", }) return {"human_approved": decision} # ─── Compile with Production Features ──────────────────────────────────── DB_URI = os.environ["DATABASE_URL"] # never hardcode credentials with PostgresSaver.from_conn_string(DB_URI) as checkpointer: checkpointer.setup() # creates required tables — run once at startup app = workflow.compile( checkpointer=checkpointer, # crash recovery interrupt_before=["human_review"], # pause before publishing ) # ─── Run ────────────────────────────────────────────────────────────────── config = {"configurable": {"thread_id": "research-001"}} # Phase 1: run the research pipeline result = app.invoke( {"messages": [{"role": "user", "content": "Research the latest advances in RAG systems"}]}, config ) # Execution pauses here — human reviews in the LangSmith UI or your app print("Pipeline paused. Report ready for review.") print(app.get_state(config).values["messages"][-1].content) # Phase 2: resume after human approval from langgraph.types import Command # Human approves: final = app.invoke(Command(resume=True), config) # Human rejects: # final = app.invoke(Command(resume=False), config)

Step 8: What We Applied From Each Module

Looking at this system, every module from Phase 3 contributed something:

What's in the systemFrom module
Supervisor + Parallel pattern chosen deliberately3.1 — Architecture Patterns
Annotated[list, add] reducer for parallel writes3.2 — LangGraph Multi-Agent
create_supervisor + output_mode="last_message"3.2 — LangGraph Multi-Agent
PostgresSaver checkpointing3.2 — LangGraph Multi-Agent
Human-in-the-loop interrupt3.2 — LangGraph Multi-Agent
Role-based agent specialization3.3 — CrewAI concepts (applied in LangGraph)
Model routing (Haiku for workers, Sonnet for reasoning)3.5 — Production Ops
LangSmith tracing (3 env vars, already active)3.5 — Production Ops
max_concurrency on compiled graph3.5 — Production Ops

This is what it looks like when the concepts from each module become actual decisions in a real system. Each choice was made for a specific reason — not because a tutorial said so.


The Design Mindset

The most important thing this module is trying to show isn't the code — it's the thinking that precedes the code.

Every time you start building a multi-agent system, ask these questions in order:

  1. Do I need multi-agent at all? What's the simplest thing that could work?
  2. What are the natural agent boundaries? Where does specialization genuinely add value?
  3. What can run in parallel? What must be sequential? Draw the dependency graph.
  4. Which pattern fits? Sequential, Supervisor, Parallel, Swarm, or a combination?
  5. LangGraph or CrewAI? Do I need fine-grained control, or is speed to prototype the priority?
  6. What goes in shared state? Which fields need reducers?
  7. Which model for which agent? Match cognitive complexity to model capability.
  8. Where are the checkpoints and human gates? What can't be undone?
  9. What am I monitoring? Cost per run, failure rate, latency per agent.

Answer those nine questions before writing any code, and the implementation almost writes itself.


Sources