Production — Debugging, Observability & Cost Control

What Is This Module About?

Building a multi-agent system that works in a demo is one thing. Building one that works reliably in production — day after day, for real users, at real cost — is a completely different challenge.

This module is about everything that happens after you get the agents to collaborate. How do you see what they're doing? How do you find bugs in a system where the behavior is emergent and non-deterministic? How do you stop your API bill from exploding when you scale? How do you test systems that don't produce the same output twice?

These aren't afterthoughts. They're the difference between a system that gets deployed and a system that gets shut down.

Why This Matters — The Production Reality

Multi-agent systems fail at 41–86.7% rates in production. 40% of multi-agent pilots fail within 6 months of going live. The failures aren't usually bugs in individual agents. They're coordination failures, cascading errors, silent quality degradation, and cost explosions that nobody saw coming because nobody was watching.

The good news: production teams that implement proper tracing and observability report a 70% reduction in mean time to resolution for multi-agent failures compared to log-based debugging. The visibility itself is the fix for most problems.

Real-World Examples

Exa's research system. As they scaled from prototype to a system processing hundreds of research queries daily, they found that token observability from LangSmith was "really important" for pricing their product. They couldn't set a price without knowing what each query actually cost to run. Observability wasn't a nice-to-have — it was a business requirement.

Elastic's threat detection. They discovered through LangSmith traces that their reflection loop rarely produced meaningful improvements after 3 iterations. Before tracing, they had no way to know this. After seeing it in the data, they capped the loop at 3 — cutting costs without impacting quality.

BudgetMLAgent (research project). By routing tasks to cheap models (GPT-4o mini) for simple steps and only using expensive models (GPT-4) for hard reasoning, they achieved a 94.2% cost reduction — from $0.931 per run to $0.054 — while actually improving success rates from 22.72% to 32.95%. The right model for the right task beats one expensive model for everything.

Key Terms for This Module

Trace — the complete record of one agent execution from start to finish: every LLM call, every tool call, every intermediate step, in a nested tree.

Run — one individual step within a trace. An LLM call is a run. A tool invocation is a run. Runs are nested inside traces.

Thread — a collection of traces that belong to the same conversation. If a user interacts with your agent across 5 separate sessions, all 5 traces belong to the same thread.

LLM-as-judge — using a language model to evaluate the quality of another model's output. The alternative to exact-match assertions, which don't work for probabilistic outputs.

Model routing — dynamically choosing which model handles a given task based on its complexity and cost. Simple tasks → cheap fast model. Hard reasoning → expensive capable model.

Token-aware rate limiting — rate limiting based on tokens consumed, not just requests per second. LLM calls vary enormously in compute load — a single complex prompt can consume 10,000 tokens.

Circuit breaker — a pattern that stops sending requests to a failing service before the failures cascade. Like a fuse box for your API calls.

LangSmith — Seeing Inside Your Agents

LangSmith is the observability platform for LangChain and LangGraph. It records everything that happens inside your agents and makes it searchable, visualizable, and analyzable.

Setup — three environment variables:

Bash

export LANGSMITH_TRACING=true
export LANGSMITH_API_KEY="lsv2_..."
export LANGSMITH_PROJECT="my-agent-project"

That's it. Once these are set, every LangGraph and LangChain execution is automatically traced — no code changes needed. Every LLM call, every tool invocation, every agent handoff appears in the LangSmith UI as a nested tree.

What you see in a trace:

For a 3-agent supervisor system, a single user request might generate a trace that looks like this:

[Trace: "Research and write report on RAG"]
  ├── [Run: Supervisor LLM call]         2.1s  |  1,243 tokens
  ├── [Run: Transfer to researcher]
  │     ├── [Run: Researcher LLM call]   3.8s  |  2,891 tokens
  │     ├── [Run: search_web("RAG")]     0.9s  |  tool call
  │     └── [Run: search_web("vector")]  0.7s  |  tool call
  ├── [Run: Transfer to writer]
  │     ├── [Run: Writer LLM call]       4.2s  |  3,104 tokens
  │     └── [Run: write_file("report")]  0.1s  |  tool call
  └── [Run: Supervisor synthesis]        1.6s  |  892 tokens
                                    ─────────────────────
                                    Total: 12.4s | 8,130 tokens | ~$0.09

This tells you: where time was spent, which agent was expensive, which tool call was slow, exactly what the supervisor said to route to the researcher, and what the researcher found before the writer got it. When something goes wrong, you trace back to exactly which step and why.

Tracing code outside LangChain:

For custom functions or raw API calls, use @traceable:

Python

from langsmith import traceable

@traceable  # this function and everything it calls gets traced
def my_rag_pipeline(question: str) -> str:
    docs = retrieve_relevant_docs(question)
    context = format_docs(docs)
    return generate_answer(question, context)

@traceable(name="Document Retrieval", tags=["retrieval"])
def retrieve_relevant_docs(query: str) -> list:
    # appears as a nested run inside the parent trace
    return vector_store.similarity_search(query)

Polly — AI-assisted debugging (July 2025):

For deep agents that run hundreds of steps over several minutes, manually reading trace logs is painful. Polly is an AI assistant built into LangSmith that you can ask questions about a trace directly:

"Did the agent make any mistakes in this trace?"
"Why did the supervisor route to the researcher instead of the writer?"
"Which step was responsible for the incorrect final answer?"

Instead of scanning a 200-step trace manually, you ask and get a specific answer pointing to the exact run.

LangSmith Fetch CLI:

Bash

pip install langsmith-fetch

# Grab the most recent trace immediately after a run
langsmith-fetch traces --project-uuid <uuid> --limit 1

# Bulk export for evaluation
langsmith-fetch traces --project-uuid <uuid> --last-n-minutes 60

Cost Control — The Math That Kills Multi-Agent Systems

A single Claude Opus call costs roughly $0.015 per 1,000 output tokens. A 4-agent supervisor system where each agent averages 2,000 output tokens, running 5 turns each, generates ~40,000 output tokens per user request. That's $0.60 per request. At 1,000 requests a day, that's $18,000 a month — for one feature.

Most people don't calculate this before building. The ones who do build sustainable systems. The ones who don't get surprised by their API bill.

Strategy 1: Model routing — the highest leverage change

Don't use Opus for everything. Route tasks to the cheapest model that can handle them reliably:

Python

from langchain_anthropic import ChatAnthropic

# Fast and cheap — for classification, routing, simple extraction
haiku = ChatAnthropic(model="claude-haiku-4-5-20251001")

# Balanced — for most agent work
sonnet = ChatAnthropic(model="claude-sonnet-4-6")

# Expensive — only for hard reasoning, complex synthesis
opus = ChatAnthropic(model="claude-opus-4-6")

# In your supervisor:
research_agent = create_react_agent(
    model=haiku,    # searching is simple — Haiku handles it
    tools=[search_web],
    name="researcher"
)

synthesis_agent = create_react_agent(
    model=opus,     # synthesizing complex research needs Opus
    tools=[],
    name="synthesizer"
)

supervisor = create_supervisor(
    [research_agent, synthesis_agent],
    model=sonnet,   # routing decisions need moderate intelligence
)

Result from research (BudgetMLAgent): Proper model routing achieved a 94.2% cost reduction while improving task success rates. The expensive model was reserved for the steps that genuinely required it.

Strategy 2: Prompt caching

If your system prompt is large and repeated across many calls, caching eliminates the cost of re-sending it:

Python

# Anthropic — explicit cache_control
system=[
    {"type": "text", "text": "You are a research analyst..."},
    {
        "type": "text",
        "text": large_knowledge_base,   # could be 50,000 tokens
        "cache_control": {"type": "ephemeral"}
    }
]
# First call: normal cost. Subsequent calls: ~90% cheaper.

Strategy 3: Context engineering — pass less between agents

Exa's key insight: agents only need the cleaned, final output from prior agents — not the entire reasoning chain. Every intermediate thought token you pass downstream is a token you're paying for in the next agent's input.

Python

# Instead of passing the full agent output (includes reasoning, tool calls, etc.)
# Extract just the actionable findings before passing to the next agent

def extract_findings(agent_output: str) -> str:
    """Strip intermediate reasoning, return only the final findings."""
    # Parse and return the structured part of the output
    return agent_output.split("FINAL FINDINGS:")[1].strip()

Strategy 4: output_mode="last_message" in supervisors

In LangGraph supervisor, this single parameter change can cut supervisor token costs by 60–80%:

Python

workflow = create_supervisor(
    [researcher, writer],
    model=model,
    output_mode="last_message"   # supervisor sees only final result, not full history
)

Strategy 5: Cap iterations everywhere

Reflection loops and retry patterns have no natural upper bound. The agent will keep trying if it keeps failing. Always cap:

Python

# LangGraph
graph = builder.compile(recursion_limit=10)  # hard cap on graph cycles

# CrewAI
agent = Agent(max_iter=5, max_execution_time=120)  # 5 attempts, 2 min max

# AutoGen
termination = MaxMessageTermination(15)  # never more than 15 turns

Rate Limiting and Concurrency

When your multi-agent system runs parallel agents, they all hit the API simultaneously. A 10-agent parallel search hitting Claude at once will almost certainly trigger rate limits.

The key insight: LLM rate limits are measured in tokens per minute, not just requests per minute. A single complex prompt can consume 5,000 tokens. Ten simultaneous ones consume 50,000 tokens in seconds.

LangGraph concurrency control:

Python

graph = builder.compile(max_concurrency=3)  # never more than 3 parallel API calls

LangChain rate limiter:

Python

from langchain_core.rate_limiters import InMemoryRateLimiter

rate_limiter = InMemoryRateLimiter(
    requests_per_second=2,      # max 2 requests per second
    check_every_n_seconds=0.1,
    max_bucket_size=10
)

model = ChatAnthropic(
    model="claude-opus-4-6",
    rate_limiter=rate_limiter
)

Exponential backoff for 429 errors:

Python

import time, anthropic

def call_with_backoff(fn, max_retries=5):
    for attempt in range(max_retries):
        try:
            return fn()
        except anthropic.RateLimitError:
            if attempt == max_retries - 1:
                raise
            wait = 2 ** attempt  # 1s → 2s → 4s → 8s → 16s
            time.sleep(wait)

For production at scale: implement a token-aware gateway (AWS API Gateway, Azure APIM, or open-source Agentgateway) that enforces per-user, per-route, and per-model quotas before requests reach the LLM APIs.

Testing Multi-Agent Systems

Traditional software testing doesn't translate to multi-agent systems. You can't assert output == expected_output because LLM outputs are probabilistic — the same input produces different (but equally valid) outputs each time.

What actually works:

1. Test each agent in isolation first:

Before testing how agents collaborate, verify each one behaves correctly on its own. A supervisor that gets bad inputs from a broken researcher will fail in confusing ways.

Python

# Test the researcher agent independently
result = researcher.invoke({"messages": [
    {"role": "user", "content": "Research recent advances in transformer architectures"}
]})

# Assert structure, not exact content
assert len(result["messages"]) > 0
assert "transformer" in result["messages"][-1].content.lower()
assert result["messages"][-1].tool_calls is not None  # should have searched

2. LLM-as-judge for quality evaluation:

Use a model to evaluate output quality against criteria — not exact match:

Python

from langsmith.evaluation import evaluate
from langsmith import Client

client = Client()

def quality_evaluator(run, example):
    """Use Claude to evaluate if the output meets quality criteria."""
    evaluation_prompt = f"""
    Rate the following research report on a scale of 1-5 for:
    - Accuracy (are claims supported?)
    - Completeness (does it address the question?)
    - Citations (are sources included?)

    Report: {run.outputs["report"]}
    Question: {example.inputs["question"]}

    Return JSON: {{"accuracy": N, "completeness": N, "citations": N, "overall": N}}
    """
    # call Claude to evaluate
    score = claude_evaluate(evaluation_prompt)
    return {"score": score["overall"] / 5, "key": "quality"}

results = evaluate(
    my_research_pipeline,
    data="research-eval-dataset",   # dataset in LangSmith
    evaluators=[quality_evaluator],
    experiment_prefix="v2-test"
)

3. Run evaluations multiple times:

Single-run tests are unstable. Run your evaluation suite 3–5 times and look at the distribution, not the single result.

4. System-level integration tests:

After testing individual agents, test the full multi-agent workflow end-to-end with realistic inputs. The most interesting bugs only appear when agents actually interact.

Monitoring in Production

What to track per agent, per run:

Input and output token counts
Latency (time to first token, total duration)
Tool call count and which tools were called
Failure rate (did the agent error or produce invalid output?)
Cost (tokens × model price)

What to set alerts on:

Cost-per-run exceeds threshold (catches runaway loops early)
Failure rate spikes above baseline
Average latency degrades significantly
Any individual agent consistently failing its task

The dashboard you want: a per-agent breakdown showing cost, latency, and failure rate over time. When one agent starts degrading, you catch it before it degrades the whole system.

LangSmith provides this out of the box for LangGraph and LangChain systems. For custom systems, use the @traceable decorator and emit metrics to your monitoring platform of choice (Datadog, Grafana, CloudWatch).

Module 3.5