Module 3.3
CrewAI
What Is CrewAI?
CrewAI is an open-source framework for building teams of AI agents that collaborate like a real work crew. You describe who each agent is — their role, their goal, their personality — and what you need done. CrewAI handles the coordination, the information flow, and the execution.
Where LangGraph thinks in graphs and state machines, CrewAI thinks in people and teams. You're not wiring nodes and edges — you're hiring specialists and assigning them work. A researcher. A writer. A reviewer. A data analyst. Each agent has a defined role and a clear objective. You put them in a crew, give the crew a task, and let them work.
This is why CrewAI has a much lower learning curve than LangGraph. The concepts map directly to how teams actually work. If you can describe a team in words, you can build it in CrewAI.
Real-World Use Cases
PwC — code generation pipeline. A crew of agents collaborates on software tasks: one writes the code, one reviews it, one checks security, one writes documentation. Before CrewAI, their code generation accuracy was around 10%. After deploying a properly structured crew, it reached 70%. The coordination — one agent's review feeding into another's revision — made the difference.
DocuSign — lead qualification. Multiple agents extract data from different internal systems (CRM, product usage database, marketing platform), then a synthesis agent scores and qualifies the lead. No single agent had access to all the data, but together they could paint a complete picture. Result: dramatically faster time-to-first-contact with qualified leads.
Gelato — lead enrichment. Agents gather company size, printer infrastructure details, and revenue estimates from multiple public and private sources. A synthesis agent combines this into an enriched lead profile. The specialization — one agent per data source — meant each agent could be optimized for exactly what it needed to find.
General enterprise pattern: Most companies start with internal process automation (30–60 days to production), then expand to customer-facing use cases once they trust the system.
Key Terms for This Module
Role — the job title and function of an agent. "Senior Data Researcher." "Code Review Specialist." The model shapes its behavior around this description.
Goal — what the agent is trying to achieve. Specific and outcome-oriented. "Uncover the top 5 trends in {topic} with supporting evidence."
Backstory — the agent's background and personality. Shapes how it approaches problems and communicates. "You are a veteran researcher who always cites sources and flags uncertainty."
Task — a specific assignment. Has a description of what to do and an expected_output describing what the result should look like.
Crew — the team. Contains agents and tasks. Has a process defining execution order.
Process — how the crew executes. Either Sequential (tasks run one after another in order) or Hierarchical (a manager agent dynamically allocates tasks).
Flow — the production orchestration layer that wraps Crews with deterministic control: branching, loops, conditional logic, and explicit state management.
Delegation — when an agent decides it needs help and hands a sub-task to another agent in the crew. Only works when allow_delegation=True.
Guardrail — a validation function attached to a task. If the output doesn't pass, the task retries automatically (up to 3 times by default).
The Role-Playing Trinity
The three most important fields in any CrewAI agent are role, goal, and backstory. Together they shape how the underlying LLM behaves for the entire duration of the task.
This works because LLMs respond powerfully to persona framing. An agent told it's a "veteran data analyst" with "10 years of experience identifying patterns" will approach problems differently than one told it's a "general assistant." The backstory gives the model context about how to reason, what to prioritize, and how to communicate.
from crewai import Agent
researcher = Agent(
role="Senior Data Researcher",
goal="Uncover cutting-edge developments in {topic} with evidence from authoritative sources",
backstory=(
"You are a veteran research analyst with 10 years of experience. "
"You always cite your sources, flag uncertainty clearly, and "
"organize findings so others can act on them immediately."
),
llm="claude-opus-4-6",
tools=[search_tool, scrape_tool],
max_iter=10, # max attempts before forced answer
verbose=True,
allow_delegation=False # this agent doesn't hand off work
)The interpolation trick: {topic} in the goal is a template variable. When you call crew.kickoff(inputs={"topic": "AI agents in 2025"}), it fills in automatically. This makes agents reusable for different inputs without rewriting them.
Tasks — Defining the Work
A task is a specific, scoped assignment. The description tells the agent what to do. The expected_output tells it what the result should look like. Both matter.
from crewai import Task
research_task = Task(
description=(
"Research {topic} thoroughly. Find at least 5 authoritative sources. "
"Identify key developments, major players, and emerging trends. "
"Collect the URLs of every source you use."
),
expected_output=(
"A comprehensive research summary with: key facts, statistics, "
"notable quotes, and source URLs organized by subtopic."
),
agent=researcher,
)Piping outputs between tasks — the context parameter makes one task's output available to another:
synthesis_task = Task(
description=(
"Synthesize the research into a structured report. "
"Identify the 5 most important findings and 3 recommended actions."
),
expected_output="A structured report with findings and recommendations.",
context=[research_task], # researcher's output injected here automatically
agent=analyst,
)Structured output with Pydantic — instead of free text, get a typed object back:
from pydantic import BaseModel
from typing import List
class ResearchReport(BaseModel):
topic: str
key_findings: List[str]
sources: List[str]
recommended_actions: List[str]
confidence: str # "High" | "Medium" | "Low"
synthesis_task = Task(
description="...",
expected_output="A structured research report.",
context=[research_task],
agent=analyst,
output_pydantic=ResearchReport, # output is now a typed object
)Guardrails — automatic retry on bad output:
def validate_has_sources(output) -> tuple[bool, str]:
if "http" not in output.raw:
return False, "Output must include at least one URL source."
return True, output
research_task = Task(
description="...",
expected_output="...",
guardrails=[validate_has_sources], # retries up to 3x if validation fails
agent=researcher,
)This is enormously useful in production. Instead of getting a beautiful but uncited report, the task keeps retrying until the output actually meets your quality criteria.
Crews and Processes
A Crew is the team — it holds the agents, tasks, and execution strategy.
Sequential process (the default):
Tasks run in the order you define them. Each task's output automatically becomes context for the next. Deterministic, predictable, easy to debug.
from crewai import Crew, Process
crew = Crew(
agents=[researcher, analyst, writer],
tasks=[research_task, synthesis_task, writing_task],
process=Process.sequential,
memory=True, # enable shared memory across agents
verbose=True,
output_log_file="output/run_log.json", # full audit trail
)
result = crew.kickoff(inputs={"topic": "multi-agent AI systems"})
# Access results
print(result.raw) # plain text output
print(result.pydantic) # structured output (if output_pydantic was set)
print(result.token_usage) # how much this run cost in tokensHierarchical process — when you want a manager:
A manager agent is automatically created (or you provide your own). The manager dynamically decides which agent gets which task based on their capabilities. Tasks are NOT pre-assigned to agents — the manager allocates at runtime.
crew = Crew(
agents=[researcher, analyst, writer, data_engineer],
tasks=[task1, task2, task3, task4], # not pre-assigned
process=Process.hierarchical,
manager_llm="claude-opus-4-6", # capable model for the manager
verbose=True,
)When sequential, when hierarchical:
| Sequential | Hierarchical | |
|---|---|---|
| Task assignment | Fixed upfront | Dynamic at runtime |
| Predictability | High | Lower |
| Debugging | Easy | Harder |
| Best for | Clear linear pipelines | Ambiguous tasks, dynamic allocation |
| Cost | Predictable | Higher (manager LLM adds calls) |
The key question: do you know exactly which agent should do which task? If yes — sequential. If the right agent depends on what the task turns out to be — hierarchical.
Tools — Giving Agents Hands
Without tools, agents can only reason and write text. Tools let them take actions in the world: search the web, read files, query databases, execute code.
The @tool decorator — the simplest approach:
from crewai.tools import tool
@tool("Web Search")
def web_search(query: str) -> str:
"""Search the web for current information on a topic.
Use when you need recent data, news, or live information."""
# your real implementation here
return search_results
@tool("Database Query")
def query_db(sql: str) -> str:
"""Execute a read-only SQL query against the analytics database.
Returns results as formatted text. Use for historical data and metrics."""
return run_query(sql)The BaseTool subclass — full control:
from crewai.tools import BaseTool
from pydantic import BaseModel, Field
from typing import Type
class SearchInput(BaseModel):
query: str = Field(..., description="The search query to execute")
max_results: int = Field(default=5, description="Number of results to return")
class WebSearchTool(BaseTool):
name: str = "Web Search"
description: str = (
"Search the web for current information. Returns titles, URLs, and snippets. "
"Use for recent events, current data, or anything that may have changed recently."
)
args_schema: Type[BaseModel] = SearchInput
def _run(self, query: str, max_results: int = 5) -> str:
return perform_search(query, max_results)
async def _arun(self, query: str, max_results: int = 5) -> str:
return await async_search(query, max_results)
def cache_function(self, args, result):
# Cache search results to avoid duplicate API calls
return True30+ built-in tools are available out of the box: SerperDevTool (Google search), ScrapeWebsiteTool, PDFSearchTool, CSVSearchTool, CodeInterpreterTool, PGSearchTool (PostgreSQL), GithubSearchTool, and more. You don't need to implement common integrations from scratch.
The docstring is the tool description. The model reads it to decide when to call the tool. Vague docstrings lead to wrong tool calls. Treat them with the same care as Anthropic's tool description best practices.
Memory — Agents That Remember
By default, each crew.kickoff() starts fresh. With memory enabled, agents can recall what they learned in previous runs — across sessions.
crew = Crew(
agents=[researcher, analyst],
tasks=[...],
memory=True,
embedder={
"provider": "openai",
"config": {"model": "text-embedding-3-small"}
}
)CrewAI uses a unified Memory class that retrieves memories by ranking them on three dimensions simultaneously: semantic similarity (is this memory relevant to the current query?), recency (how recent is it?), and importance (how significant was it?). Each dimension has a configurable weight.
The practical effect: after your research crew runs on "LLM agents" today, when it runs on "multi-agent systems" tomorrow, the researcher will automatically recall relevant findings from yesterday's run without being explicitly told to.
Storage defaults to LanceDB at ./.crewai/memory. In production, configure this to a persistent store so memories survive restarts.
Flows — The Production Orchestration Layer
Crews handle the autonomous, intelligent work. Flows handle the structure around it — conditional logic, loops, quality gates, branching.
The production pattern: Flow manages the application; Crews do the intelligent work inside Flow steps.
from crewai.flow.flow import Flow, listen, start
from pydantic import BaseModel
class ContentState(BaseModel):
topic: str = ""
report: str = ""
quality_score: float = 0.0
approved: bool = False
class ContentFlow(Flow[ContentState]):
@start()
def initialize(self):
print(f"Starting content creation for: {self.state.topic}")
@listen(initialize)
def run_research_crew(self):
result = research_crew.kickoff(inputs={"topic": self.state.topic})
self.state.report = result.raw
@listen(run_research_crew)
def quality_check(self):
# Deterministic quality gate — not LLM-driven
word_count = len(self.state.report.split())
has_sources = "http" in self.state.report
if word_count < 300 or not has_sources:
print("Quality check failed — re-running research")
return self.run_research_crew() # conditional retry loop
self.state.approved = True
@listen(quality_check)
def publish(self):
if self.state.approved:
with open("output/report.md", "w") as f:
f.write(self.state.report)
print("Report published.")
flow = ContentFlow()
flow.kickoff(inputs={"topic": "multi-agent AI systems in 2025"})Why Flows matter: Without a Flow, your Crew runs once and you get whatever it produces. With a Flow, you can loop until quality criteria are met, branch based on what the crew found, chain multiple crews in sequence, and recover gracefully from failures — all with explicit, readable Python logic instead of hoping the LLM makes the right decisions.
CrewAI vs LangGraph — The Honest Comparison
| CrewAI | LangGraph | |
|---|---|---|
| Mental model | Role-based team | State machine / graph |
| Learning curve | Low | High |
| Time to first working system | Hours | Days |
| Flexibility | Lower | Higher |
| State management | Crew context + memory | Explicit TypedDict with reducers |
| Debugging | Verbose logs, task replay | Time-travel, breakpoints, checkpoints |
| Human-in-the-loop | human_input=True on tasks | Built-in interrupt pattern |
| Production stability | Stable | v1.0 (Oct 2025), battle-tested |
| Monthly downloads | 1.38M | 6.17M |
The 2025 emerging pattern: "Prototype with CrewAI, productionize with LangGraph." Some teams even nest CrewAI Crews inside LangGraph nodes — autonomous agent collaboration within a well-controlled LangGraph execution graph.
When to pick CrewAI:
- Your problem maps naturally to a team of role-based specialists
- You need to move fast (days, not weeks)
- Workflow is mostly linear with occasional delegation
- Your team isn't deeply familiar with state machines and graph theory
When to pick LangGraph:
- Fine-grained control over every transition is required
- Conditional branching or cycles are core to the design
- Compliance, auditability, or exact audit trails are required
- Scaling to millions of executions per month
Where Things Go Wrong
Cost explosion. A 4-agent crew on Claude Opus can cost 10–50x a single API call. Every agent's full conversation history gets sent on every iteration. Use max_iter to hard-cap loops, route cheaper tasks to smaller models (llm="claude-haiku-4-5-20251001"), and enable cache=True to avoid duplicate tool calls.
Non-determinism. The same crew doesn't always produce the same result. For critical tasks, use output_pydantic for structured output and attach guardrails. Free-text output with no validation is the leading cause of downstream failures.
Manager bottleneck in hierarchical mode. Every task routes through the manager LLM. A confused manager breaks the entire crew. Use a capable model for the manager, write an explicit and detailed manager backstory, and test manager routing in isolation before running the full crew.
Context overflow in long runs. Enable respect_context_window=True (auto-summarizes when approaching token limits) and break long workflows into multiple sequential Crew invocations via a Flow rather than one enormous single run.
Observability gap. CrewAI's built-in logging is basic. For production, connect LangSmith or Langfuse using step_callback and task_callback hooks. You cannot reliably improve what you can't see.
Sources
- CrewAI Documentation
- CrewAI Agents
- CrewAI Tasks
- CrewAI Memory
- ZenML: LangGraph vs CrewAI
- DataCamp: CrewAI vs LangGraph vs AutoGen
- CrewAI Production Insights