Module 6.4
Production Infrastructure
An agent that works on your laptop is a prototype. An agent that works for real users at 2am when you're asleep is a product. The gap between those two is production infrastructure — and it's larger than most people expect the first time they cross it.
This module covers the four layers you need: packaging (Docker Compose), deployment (cloud platforms), observability (monitoring), and reliability (CI/CD, alerting).
The Docker Compose Stack
Docker Compose is the standard packaging format for agent stacks. The same configuration that runs locally runs in production — no rewrites, no environment drift.
A typical agent stack has five services:
# docker-compose.yml
services:
agent:
build: .
environment:
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
depends_on: [redis, vectordb]
vectordb:
image: chromadb/chroma:latest # or pinecone, weaviate
volumes:
- chroma_data:/data
redis:
image: redis:7-alpine
command: redis-server --save 60 1 # persist to disk
volumes:
- redis_data:/data
api:
image: nginx:alpine # reverse proxy + rate limiting
ports:
- "80:80"
depends_on: [agent]
volumes:
chroma_data:
redis_data:Why each layer: Vector DB for semantic memory and RAG. Redis for session state, caching, distributed locks (prevents race conditions when multiple agent instances run). API gateway for rate limiting, authentication, and routing. The agent itself is stateless — state lives in Redis and the vector DB, which means you can scale agent instances horizontally without coordination issues.
Docker's 2025 docker offload feature lets you test locally, then offload GPU-intensive workloads to cloud infrastructure with a single command — logs stream back to your terminal.
Cloud Platform Decision
| Platform | Best For | Notes |
|---|---|---|
| Railway | AI-native workloads, getting to production fast | Built-in MCP support, $5/month hobby tier, 25M monthly deployments, low-latency edge |
| Fly.io | Global distribution, pay-as-you-go | $3.15/month for small VM, strong DevOps tooling, deploys close to users worldwide |
| GCP Cloud Run | Serverless, cost-sensitive, bursty traffic | Free tier, supports gcloud run compose up, 100–500ms cold start |
| AWS ECS/Fargate | Enterprise, compliance, Bedrock integration | Maximum ecosystem, scalability, unpredictable cost without careful tuning |
The decision rule: Railway or Fly.io for most agent workloads — simpler, cheaper, faster to deploy. AWS or GCP when you have enterprise compliance requirements, need Bedrock integration, or already have existing cloud infrastructure to integrate with.
Avoid serverless (Cloud Run, Lambda) for stateful agents — cold starts add 100–500ms, and state management is painful. Use it for stateless, bursty workloads where cost efficiency matters more than latency.
CI/CD Pipeline for Agents
Agent code has an unusual property: the prompts are as important as the code. A two-word change to a system prompt can break production behavior in ways that pass all unit tests but fail real users. Your CI/CD pipeline must account for this.
# .github/workflows/agent-deploy.yml
name: Agent Deploy Pipeline
on: [push]
jobs:
test-and-deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Unit tests
run: pytest tests/unit/ # tool integrations, parsing, schemas
- name: Integration tests
run: pytest tests/integration/ --mock-external # full flow, mocked APIs
- name: Run eval suite
run: python run_evals.py --threshold 0.85
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
# Fails the pipeline if eval pass rate drops below 85%
- name: Build and push image
run: docker build -t agent:${{ github.sha }} . && docker push ...
- name: Blue-green deploy
run: |
# Spin up new container, health check, then route traffic
railway deploy --image agent:${{ github.sha }}Version control everything that affects agent behavior: prompts, model configs, RAG parameters (chunk size, overlap, similarity threshold), tool definitions, system prompts. Treat a prompt change exactly like a code change — it goes through review, tests, and staged rollout.
Blue-green deployment: always keep the previous working image available. If error rate spikes after a deploy, rollback is a one-command operation to the previous container image — not a code revert and rebuild cycle.
Monitoring — What to Instrument
Instrument from day one. Retrofitting observability into a running production system is painful; adding it at the start costs nothing.
The core metrics:
Latency (P50, P95, P99): break it down by component — LLM inference time, tool call time, vector retrieval time. P99 latency is what your worst-case users experience. A high P99 with a low P50 usually points to a specific tool call or retrieval pattern that occasionally goes slow.
Cost per run: this is the metric most teams forget to track until they get a surprise bill. Track total tokens consumed per agent run, multiply by model pricing. Set alerts. An agent that suddenly starts using 10x more tokens is either looping, hitting a failure mode, or being abused.
Error rate: tool call failures, invalid output formats, timeouts. Break down by error type — a 5% error rate concentrated in one specific tool call tells you exactly where to look.
Tool call success rate: what percentage of tool invocations produce valid results? Track this separately from the overall error rate — a tool call that returns a result but with wrong parameters is a different failure mode than one that times out.
Agent loop depth: for agentic workflows, track how many steps each run takes. Sudden increases in average step count mean the agent is looping, getting confused, or hitting edge cases that require more back-and-forth.
Datadog LLM Observability (2025) auto-instruments Anthropic, OpenAI, LangChain, and Bedrock with no code changes. It produces an interactive agent decision graph — you can visualize the full trace of inputs → tool calls → LLM reasoning → outputs for any run. Critical for debugging non-obvious failures in multi-step agent workflows.
LangSmith works with any LLM framework (not just LangChain). Custom dashboards for cost, latency, error rates, and feedback scores. OpenTelemetry compatible — forward traces to Datadog, Grafana, or New Relic if you have an existing observability stack.
Alerting Without Alert Fatigue
Alert on trends, not single spikes. A single anomalous LLM call is noise. The same pattern sustained for 5 minutes is a signal.
Alert on:
- Cost per interaction >5x baseline (sustained 5 min) — agent is probably looping
- Error rate >5% (sustained 5 min) — something broke in a tool or model response
- P95 latency >2x baseline (sustained 5 min) — infrastructure or model issue
- Loop depth >N steps — agent is stuck; N depends on your max expected task complexity
- Unusual API key patterns or malformed tool calls — potential security incident
Suppress:
- Single-spike errors (retry logic handles these)
- Expected maintenance windows
- Correlated alerts from the same root cause (aggregate them into one alert)
The goal: when a page or notification fires, it should require immediate human attention. If engineers start ignoring alerts because they're noisy, you've lost your safety net.
Key Gotchas
Cold start latency bites hard for interactive agents on serverless platforms. A 300ms cold start on Cloud Run or Lambda is imperceptible in a batch job; it's a terrible user experience in a real-time chat interface. Use Railway or Fly.io for latency-sensitive workloads.
Stateful agents on stateless infrastructure fail in subtle ways. If your agent stores conversation state in memory (a Python dict), it works great with one instance and breaks immediately with two. Store all state externally in Redis or a database from day one — it costs almost nothing and makes horizontal scaling trivial.
Secret management: API keys, database credentials, and tool secrets should never be in Dockerfiles, .env files committed to git, or environment variables set manually. Use platform-native secret management (Railway Secrets, Fly.io Secrets, AWS Secrets Manager). Rotate keys quarterly and immediately after any suspected compromise.
Prompt drift: model APIs are silently updated. Provider upgrades can shift model behavior in ways that pass your unit tests but change real user experience. Monitor output distribution — if the typical response length, structure, or vocabulary shifts suddenly after a weekend, a model update probably happened. Track model version in your logs.
Health Checks — The Minimum Production Signal
A health check is a lightweight endpoint that tells your load balancer or orchestrator whether this instance is ready to handle traffic. Without it, your platform might route traffic to a container that started but can't actually serve requests.
# FastAPI health check endpoint — add to every agent API
from fastapi import FastAPI
import anthropic, redis.asyncio as redis
app = FastAPI()
@app.get("/health")
async def health():
"""Liveness check — is the process alive?"""
return {"status": "ok"}
@app.get("/ready")
async def ready():
"""Readiness check — can this instance actually serve traffic?"""
checks = {}
# Check Anthropic API reachability
try:
client = anthropic.Anthropic()
client.models.list(limit=1) # lightweight API check
checks["anthropic_api"] = "ok"
except Exception as e:
checks["anthropic_api"] = f"error: {e}"
# Check Redis connectivity
try:
r = await redis.from_url(os.environ["REDIS_URL"])
await r.ping()
checks["redis"] = "ok"
await r.aclose()
except Exception as e:
checks["redis"] = f"error: {e}"
healthy = all(v == "ok" for v in checks.values())
return {"status": "ready" if healthy else "degraded", "checks": checks}Docker Compose uses this for healthcheck. Railway and Fly.io use it to know when deployments are complete. The readiness check is more important than the liveness check — it validates that all dependencies are actually reachable, not just that the process started.
Load Testing Agents Before Launch
You wouldn't ship a web API without load testing. The same applies to agents — and agent load testing is more nuanced because:
- Each request triggers LLM inference (expensive and slow)
- Concurrent requests hit rate limits faster than you expect
- Redis connection pools exhaust under load before the app server does
# Locust load test for an agent API
from locust import HttpUser, task, between
class AgentUser(HttpUser):
wait_time = between(1, 5) # simulate realistic human pacing
@task(3)
def short_query(self):
"""Common case — quick question, fast response"""
self.client.post("/agent/chat", json={
"user_id": f"test_user_{self.user_id}",
"message": "What's the status of my last order?"
})
@task(1)
def complex_query(self):
"""Heavy case — multi-step reasoning"""
self.client.post("/agent/chat", json={
"user_id": f"test_user_{self.user_id}",
"message": "Analyze my spending across all categories for Q4 and give me recommendations"
}, timeout=60) # complex tasks take longerWhat to look for in agent load tests:
- Rate limit errors (429) from the LLM provider — add token budget tracking and backoff
- Redis connection pool exhaustion — increase
max_connectionsin Redis client config - Response time distribution — agents have higher P99 than typical APIs; set realistic SLA expectations
- Error rate at concurrent load — LLM calls occasionally fail; retry logic must work under pressure
Run at 1x, 2x, 5x, 10x expected peak traffic before launch. Find the breaking point before your users do.
Sources
- Docker — Build AI Agents with Docker Compose
- Datagrid — AI Agent CI/CD Pipeline Guide
- Datadog — LLM Observability
- LangChain — LangSmith Observability
- Omega.ai — Top 5 AI Agent Observability Platforms 2026
- Railway — AI-Native Cloud Infrastructure
- AWS Prescriptive Guidance — CI/CD for Serverless Agentic AI