AI Engineering Curriculum
Phase 6: Advanced Topics·8 min read

Module 6.4

Production Infrastructure

An agent that works on your laptop is a prototype. An agent that works for real users at 2am when you're asleep is a product. The gap between those two is production infrastructure — and it's larger than most people expect the first time they cross it.

This module covers the four layers you need: packaging (Docker Compose), deployment (cloud platforms), observability (monitoring), and reliability (CI/CD, alerting).

The Docker Compose Stack

Docker Compose is the standard packaging format for agent stacks. The same configuration that runs locally runs in production — no rewrites, no environment drift.

A typical agent stack has five services:

YAML
# docker-compose.yml services: agent: build: . environment: - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY} depends_on: [redis, vectordb] vectordb: image: chromadb/chroma:latest # or pinecone, weaviate volumes: - chroma_data:/data redis: image: redis:7-alpine command: redis-server --save 60 1 # persist to disk volumes: - redis_data:/data api: image: nginx:alpine # reverse proxy + rate limiting ports: - "80:80" depends_on: [agent] volumes: chroma_data: redis_data:

Why each layer: Vector DB for semantic memory and RAG. Redis for session state, caching, distributed locks (prevents race conditions when multiple agent instances run). API gateway for rate limiting, authentication, and routing. The agent itself is stateless — state lives in Redis and the vector DB, which means you can scale agent instances horizontally without coordination issues.

Docker's 2025 docker offload feature lets you test locally, then offload GPU-intensive workloads to cloud infrastructure with a single command — logs stream back to your terminal.

Cloud Platform Decision

PlatformBest ForNotes
RailwayAI-native workloads, getting to production fastBuilt-in MCP support, $5/month hobby tier, 25M monthly deployments, low-latency edge
Fly.ioGlobal distribution, pay-as-you-go$3.15/month for small VM, strong DevOps tooling, deploys close to users worldwide
GCP Cloud RunServerless, cost-sensitive, bursty trafficFree tier, supports gcloud run compose up, 100–500ms cold start
AWS ECS/FargateEnterprise, compliance, Bedrock integrationMaximum ecosystem, scalability, unpredictable cost without careful tuning

The decision rule: Railway or Fly.io for most agent workloads — simpler, cheaper, faster to deploy. AWS or GCP when you have enterprise compliance requirements, need Bedrock integration, or already have existing cloud infrastructure to integrate with.

Avoid serverless (Cloud Run, Lambda) for stateful agents — cold starts add 100–500ms, and state management is painful. Use it for stateless, bursty workloads where cost efficiency matters more than latency.

CI/CD Pipeline for Agents

Agent code has an unusual property: the prompts are as important as the code. A two-word change to a system prompt can break production behavior in ways that pass all unit tests but fail real users. Your CI/CD pipeline must account for this.

YAML
# .github/workflows/agent-deploy.yml name: Agent Deploy Pipeline on: [push] jobs: test-and-deploy: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Unit tests run: pytest tests/unit/ # tool integrations, parsing, schemas - name: Integration tests run: pytest tests/integration/ --mock-external # full flow, mocked APIs - name: Run eval suite run: python run_evals.py --threshold 0.85 env: ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }} # Fails the pipeline if eval pass rate drops below 85% - name: Build and push image run: docker build -t agent:${{ github.sha }} . && docker push ... - name: Blue-green deploy run: | # Spin up new container, health check, then route traffic railway deploy --image agent:${{ github.sha }}

Version control everything that affects agent behavior: prompts, model configs, RAG parameters (chunk size, overlap, similarity threshold), tool definitions, system prompts. Treat a prompt change exactly like a code change — it goes through review, tests, and staged rollout.

Blue-green deployment: always keep the previous working image available. If error rate spikes after a deploy, rollback is a one-command operation to the previous container image — not a code revert and rebuild cycle.

Monitoring — What to Instrument

Instrument from day one. Retrofitting observability into a running production system is painful; adding it at the start costs nothing.

The core metrics:

Latency (P50, P95, P99): break it down by component — LLM inference time, tool call time, vector retrieval time. P99 latency is what your worst-case users experience. A high P99 with a low P50 usually points to a specific tool call or retrieval pattern that occasionally goes slow.

Cost per run: this is the metric most teams forget to track until they get a surprise bill. Track total tokens consumed per agent run, multiply by model pricing. Set alerts. An agent that suddenly starts using 10x more tokens is either looping, hitting a failure mode, or being abused.

Error rate: tool call failures, invalid output formats, timeouts. Break down by error type — a 5% error rate concentrated in one specific tool call tells you exactly where to look.

Tool call success rate: what percentage of tool invocations produce valid results? Track this separately from the overall error rate — a tool call that returns a result but with wrong parameters is a different failure mode than one that times out.

Agent loop depth: for agentic workflows, track how many steps each run takes. Sudden increases in average step count mean the agent is looping, getting confused, or hitting edge cases that require more back-and-forth.

Datadog LLM Observability (2025) auto-instruments Anthropic, OpenAI, LangChain, and Bedrock with no code changes. It produces an interactive agent decision graph — you can visualize the full trace of inputs → tool calls → LLM reasoning → outputs for any run. Critical for debugging non-obvious failures in multi-step agent workflows.

LangSmith works with any LLM framework (not just LangChain). Custom dashboards for cost, latency, error rates, and feedback scores. OpenTelemetry compatible — forward traces to Datadog, Grafana, or New Relic if you have an existing observability stack.

Alerting Without Alert Fatigue

Alert on trends, not single spikes. A single anomalous LLM call is noise. The same pattern sustained for 5 minutes is a signal.

Alert on:

  • Cost per interaction >5x baseline (sustained 5 min) — agent is probably looping
  • Error rate >5% (sustained 5 min) — something broke in a tool or model response
  • P95 latency >2x baseline (sustained 5 min) — infrastructure or model issue
  • Loop depth >N steps — agent is stuck; N depends on your max expected task complexity
  • Unusual API key patterns or malformed tool calls — potential security incident

Suppress:

  • Single-spike errors (retry logic handles these)
  • Expected maintenance windows
  • Correlated alerts from the same root cause (aggregate them into one alert)

The goal: when a page or notification fires, it should require immediate human attention. If engineers start ignoring alerts because they're noisy, you've lost your safety net.

Key Gotchas

Cold start latency bites hard for interactive agents on serverless platforms. A 300ms cold start on Cloud Run or Lambda is imperceptible in a batch job; it's a terrible user experience in a real-time chat interface. Use Railway or Fly.io for latency-sensitive workloads.

Stateful agents on stateless infrastructure fail in subtle ways. If your agent stores conversation state in memory (a Python dict), it works great with one instance and breaks immediately with two. Store all state externally in Redis or a database from day one — it costs almost nothing and makes horizontal scaling trivial.

Secret management: API keys, database credentials, and tool secrets should never be in Dockerfiles, .env files committed to git, or environment variables set manually. Use platform-native secret management (Railway Secrets, Fly.io Secrets, AWS Secrets Manager). Rotate keys quarterly and immediately after any suspected compromise.

Prompt drift: model APIs are silently updated. Provider upgrades can shift model behavior in ways that pass your unit tests but change real user experience. Monitor output distribution — if the typical response length, structure, or vocabulary shifts suddenly after a weekend, a model update probably happened. Track model version in your logs.

Health Checks — The Minimum Production Signal

A health check is a lightweight endpoint that tells your load balancer or orchestrator whether this instance is ready to handle traffic. Without it, your platform might route traffic to a container that started but can't actually serve requests.

Python
# FastAPI health check endpoint — add to every agent API from fastapi import FastAPI import anthropic, redis.asyncio as redis app = FastAPI() @app.get("/health") async def health(): """Liveness check — is the process alive?""" return {"status": "ok"} @app.get("/ready") async def ready(): """Readiness check — can this instance actually serve traffic?""" checks = {} # Check Anthropic API reachability try: client = anthropic.Anthropic() client.models.list(limit=1) # lightweight API check checks["anthropic_api"] = "ok" except Exception as e: checks["anthropic_api"] = f"error: {e}" # Check Redis connectivity try: r = await redis.from_url(os.environ["REDIS_URL"]) await r.ping() checks["redis"] = "ok" await r.aclose() except Exception as e: checks["redis"] = f"error: {e}" healthy = all(v == "ok" for v in checks.values()) return {"status": "ready" if healthy else "degraded", "checks": checks}

Docker Compose uses this for healthcheck. Railway and Fly.io use it to know when deployments are complete. The readiness check is more important than the liveness check — it validates that all dependencies are actually reachable, not just that the process started.

Load Testing Agents Before Launch

You wouldn't ship a web API without load testing. The same applies to agents — and agent load testing is more nuanced because:

  1. Each request triggers LLM inference (expensive and slow)
  2. Concurrent requests hit rate limits faster than you expect
  3. Redis connection pools exhaust under load before the app server does
Python
# Locust load test for an agent API from locust import HttpUser, task, between class AgentUser(HttpUser): wait_time = between(1, 5) # simulate realistic human pacing @task(3) def short_query(self): """Common case — quick question, fast response""" self.client.post("/agent/chat", json={ "user_id": f"test_user_{self.user_id}", "message": "What's the status of my last order?" }) @task(1) def complex_query(self): """Heavy case — multi-step reasoning""" self.client.post("/agent/chat", json={ "user_id": f"test_user_{self.user_id}", "message": "Analyze my spending across all categories for Q4 and give me recommendations" }, timeout=60) # complex tasks take longer

What to look for in agent load tests:

  • Rate limit errors (429) from the LLM provider — add token budget tracking and backoff
  • Redis connection pool exhaustion — increase max_connections in Redis client config
  • Response time distribution — agents have higher P99 than typical APIs; set realistic SLA expectations
  • Error rate at concurrent load — LLM calls occasionally fail; retry logic must work under pressure

Run at 1x, 2x, 5x, 10x expected peak traffic before launch. Find the breaking point before your users do.

Sources