Evals & Agent Testing

Why Evals — The Gap Between "It Works" and "It Works Reliably"

Most agents fail in production for a reason that has nothing to do with the model. The model does exactly what it's always done. What changed is the data, the user, the edge case, the adjacent tool call — and nobody noticed because there was nothing in place to notice.

That's the gap evals close. Not "does this agent work" — you can answer that by trying it. The question evals answer is: does it still work after I changed X? Does it work for user type Y? Does it degrade gracefully when Z goes wrong? Those questions can't be answered by trying it once.

Real-World Use Cases

A RAG agent returns confident, plausible answers that turn out to be unfaithful to the retrieved documents. Without evals, this goes undetected until a user complains or acts on bad information.
A customer support agent handles 95% of cases correctly but consistently fails a specific intent (say, refund requests with gift card payments). Without evals against a labeled dataset, the 5% failure rate is invisible.
A code generation agent worked perfectly on GPT-4o. After a model version bump, it started producing subtly different tool call sequences that broke a downstream parser. The change in behavior was never caught — there were no regression tests.

Every one of these is a real failure mode. Every one would have been caught early with a minimal eval setup.

Why Agent Evals Are Harder Than LLM Evals

Evaluating a single LLM response is relatively tractable. You have an input, you have an output, you judge the output.

Agents are different. An agent operates across multi-step workflows — it plans, selects tools, processes results, decides next steps, and adapts over dozens of interactions. The final output might look correct while the path to get there was completely wrong — the agent got lucky, or it succeeded for the wrong reasons, or it succeeded on this input but will fail on a slight variation.

This means evaluating agents requires two things single-turn eval misses:

Trajectory-level scoring — evaluating the full sequence of decisions, not just the last output
Behavioral consistency — does the agent handle the same task the same way reliably, or does it drift across runs?

The Mindset Shift

The teams that build reliable agents treat evals the same way software engineers treat tests: write them before you need them, run them automatically, and never ship without them.

The mindset shift: evals aren't a QA step at the end. They're the specification for what success means — written before the agent is built, updated as the agent evolves, run on every change. Without that, you're not building an agent. You're building a demo.

Evaluation Approaches

There are four ways to evaluate an agent output, and none of them is sufficient alone. Production eval stacks combine all four.

LLM-as-Judge

An LLM evaluates the output of another LLM. You write an evaluation prompt that defines the criteria, pass the agent's output (and optionally the input and any reference answer), and get back a score or verdict.

Python

import anthropic

client = anthropic.Anthropic()

def llm_judge(question: str, agent_response: str, criteria: str) -> dict:
    prompt = f"""You are an expert evaluator. Score the following response on the given criteria.

Question: {question}
Response: {agent_response}

Criteria: {criteria}

Respond with JSON only:
{{"score": <0.0-1.0>, "reasoning": "<one sentence>", "verdict": "pass" | "fail"}}"""

    result = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=200,
        messages=[{"role": "user", "content": prompt}]
    )
    import json
    return json.loads(result.content[0].text)

# Example usage
verdict = llm_judge(
    question="What is our return policy?",
    agent_response="You can return any item within 30 days for a full refund.",
    criteria="Is the response accurate, on-topic, and free of hallucination?"
)
# {"score": 0.9, "reasoning": "Accurate and directly answers the question.", "verdict": "pass"}

State-of-the-art models align with human judgment ~85% of the time. That's good enough for automated screening — not good enough to replace human judgment on high-stakes decisions.

The self-evaluation problem: asking the same model that generated the response to evaluate it is unreliable. Use a different model, or a different model family, for judging. Claude evaluating Claude outputs has systematic biases. Claude evaluating GPT-4o outputs is more independent.

Multi-agent judge panels: for high-stakes evals, run multiple judge agents with different roles — a domain expert judge, a critic judge, a user-perspective judge — and aggregate their scores. This emulates a panel of human reviewers and reduces single-judge bias.

Human Evaluation

The ground truth. LLM-as-judge gets you scale. Human eval gets you accuracy.

74% of deployed production agents still rely primarily on human-in-the-loop evaluation. 52% use LLM-as-judge and human verification — the two work together. LLM-as-judge screens at volume. Humans validate the borderline cases and calibrate the judge periodically.

The practical setup: maintain a labeled evaluation dataset built from real production failures and edge cases. When LLM-as-judge flags something as borderline (score between 0.4–0.7), route it to human review. Human verdicts feed back into the eval dataset, improving future LLM-judge calibration.

Automated / Code-Based Metrics

For anything with a deterministic ground truth, don't use an LLM — use code. It's faster, cheaper, and more reliable.

Python

def evaluate_tool_call(expected: dict, actual: dict) -> dict:
    """Grade a tool call deterministically."""
    correct_tool = expected["tool"] == actual["tool"]
    correct_params = all(
        actual["params"].get(k) == v
        for k, v in expected["params"].items()
    )
    return {
        "tool_correct": correct_tool,
        "params_correct": correct_params,
        "passed": correct_tool and correct_params,
        "score": (correct_tool + correct_params) / 2
    }

def evaluate_task_completion(task_state: dict) -> bool:
    """Check whether required state was reached, not just whether output looks right."""
    return (
        task_state.get("file_created") is True and
        task_state.get("email_sent") is False and  # should NOT have sent email
        task_state.get("output_format") == "markdown"
    )

The key insight: evaluate state, not just output text. An agent that produces the right words but takes wrong actions failed. An agent that takes right actions but produces clunky prose succeeded. State-based graders capture the former correctly; text-based graders don't.

Benchmark Suites

A curated set of tasks with known correct answers, representative of real-world usage. Expensive to build — requires domain expertise, labeled data, and ongoing maintenance. But once built, a benchmark suite is the most valuable eval artifact you have.

The lifecycle of a benchmark: first it's a capability eval (can the agent do this class of task at all?). Once it passes reliably, it becomes a regression suite (did the latest change break anything?). Run it on every commit.

RAG Metrics — What RAGAS Measures

The specific failure modes of RAG systems require specific metrics. Generic response quality scores miss the ways RAG actually breaks. RAGAS (Retrieval Augmented Generation Assessment) provides a research-backed framework of four core metrics that together characterize RAG system health.

The Four Core RAGAS Metrics

Faithfulness measures whether the response is factually consistent with the retrieved context. It answers: did the model make things up, or did it stay grounded in what it retrieved?

faithfulness = correct_statements / total_statements_in_response

A response that cites facts not present in any retrieved chunk scores near 0. A response that accurately restates retrieved content scores near 1. This is the most important RAG metric — hallucination on top of retrieval is the core failure mode.

Answer Relevance measures how pertinent the response is to the original question. It penalizes irrelevant tangents, incomplete answers, and redundant padding.

answer_relevance ≈ cosine_similarity(question_embedding, generated_questions_from_answer_embedding)

The trick RAGAS uses: generate several questions that the response would answer, embed them, and compare against the original question embedding. If they cluster tightly around the original, the response is relevant.

Context Precision measures signal-to-noise in what was retrieved. Are the retrieved chunks actually relevant to answering the question, or is the retriever pulling in noise?

context_precision = relevant_chunks_retrieved / total_chunks_retrieved

Low precision means your retriever is noisy — the model has to work harder to find signal, and hallucination risk goes up.

Context Recall measures whether all the information needed to answer the question was actually retrieved. Requires ground truth labels (you must know what a correct, complete answer looks like).

context_recall = facts_in_answer_attributable_to_context / total_facts_in_ground_truth_answer

Low recall means relevant documents aren't making it into the context window — your chunking, embedding model, or similarity threshold is too aggressive.

Using RAGAS in Practice

Python

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset

# Build eval dataset from real agent runs
eval_data = {
    "question": ["What is our return policy?", "How do I reset my password?"],
    "answer": [agent_answers[0], agent_answers[1]],      # agent's actual responses
    "contexts": [retrieved_chunks[0], retrieved_chunks[1]], # what was retrieved
    "ground_truth": ["30 days full refund...", "Go to settings > security..."],  # correct answers
}

dataset = Dataset.from_dict(eval_data)
result = evaluate(
    dataset=dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)
print(result)
# {'faithfulness': 0.87, 'answer_relevancy': 0.92, 'context_precision': 0.79, 'context_recall': 0.84}

What the scores tell you:

Low faithfulness → the model is hallucinating; fix with prompt hardening or smaller, more focused context
Low answer relevancy → responses are wandering off-topic; tighten the system prompt
Low context precision → retriever is noisy; improve chunking strategy or similarity threshold
Low context recall → relevant content isn't being retrieved; check embedding model, chunk size, or coverage gaps in the knowledge base

Task Agent Metrics

RAG metrics are about information retrieval quality. Task agent metrics are about whether the agent successfully completes work in the world — tool calls, multi-step execution, state management.

Task Completion Rate

The bluntest metric and often the most important: what percentage of tasks does the agent complete end-to-end without user intervention?

Python

def measure_completion_rate(agent, test_tasks: list[dict]) -> float:
    completed = 0
    for task in test_tasks:
        result = agent.run(task["input"])
        if task["completion_check"](result):
            completed += 1
    return completed / len(test_tasks)

The completion_check is a function you write per task — it inspects the final state, not the output text. Did the file get created? Did the right API get called? Did the workflow reach the expected terminal state?

Tool Call Accuracy

Agents fail in subtle ways: calling the right tool with wrong parameters, calling tools in the wrong sequence, calling unnecessary tools, or skipping required ones. Text-based eval misses all of this. Tool call evaluation requires logging the full trace.

Python

def evaluate_tool_trace(expected_trace: list[dict], actual_trace: list[dict]) -> dict:
    """
    expected_trace: [{"tool": "search", "params": {"query": "..."}}, ...]
    actual_trace: the agent's logged tool calls
    """
    if len(actual_trace) != len(expected_trace):
        return {"passed": False, "reason": f"Step count mismatch: expected {len(expected_trace)}, got {len(actual_trace)}"}

    errors = []
    for i, (exp, act) in enumerate(zip(expected_trace, actual_trace)):
        if exp["tool"] != act["tool"]:
            errors.append(f"Step {i}: wrong tool (expected {exp['tool']}, got {act['tool']})")
        for param, value in exp["params"].items():
            if act["params"].get(param) != value:
                errors.append(f"Step {i}: wrong param '{param}'")

    return {"passed": len(errors) == 0, "errors": errors, "score": 1 - len(errors) / len(expected_trace)}

Step Efficiency

Measures whether the agent takes the optimal path or wastes steps. An agent that uses 12 steps to do what should take 4 is hallucinating work, looping unnecessarily, or failing to use tools effectively.

Python

def step_efficiency(optimal_steps: int, actual_steps: int) -> float:
    """1.0 = optimal. Lower = less efficient. Can exceed 1.0 if agent found a shortcut."""
    return optimal_steps / max(actual_steps, 1)

Track this over time. A sudden drop in step efficiency after a change usually means the agent is looping or getting confused by new prompts.

Trajectory-Level Scoring

The most powerful eval for complex agents — score the entire chain of decisions, not just the final output. This is what LLM-as-judge does at the trajectory level.

Python

async def trajectory_judge(task: str, full_trace: list[dict]) -> dict:
    """
    full_trace: list of {step, thought, tool_called, result} dicts from agent run
    """
    trace_str = "\n".join(
        f"Step {t['step']}: Thought: {t['thought']} → Called: {t['tool_called']} → Got: {t['result'][:100]}"
        for t in full_trace
    )
    prompt = f"""Evaluate this agent's reasoning trace for the task: "{task}"

{trace_str}

Score 0-1 on:
1. Reasoning quality: did the agent's thinking lead logically to its actions?
2. Tool selection: did it use the right tools in the right order?
3. Recovery: did it handle errors or unexpected results gracefully?
4. Efficiency: did it avoid unnecessary steps?

Return JSON: {{"reasoning": 0.0-1.0, "tools": 0.0-1.0, "recovery": 0.0-1.0, "efficiency": 0.0-1.0, "overall": 0.0-1.0}}"""

    result = await client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=300,
        messages=[{"role": "user", "content": prompt}]
    )
    return json.loads(result.content[0].text)

Eval Frameworks

You don't have to build eval infrastructure from scratch. These frameworks exist specifically for agent evaluation.

Braintrust is the most framework-agnostic option. It works with LangChain, LlamaIndex, OpenAI Agents SDK, and plain API calls. Strong at trajectory-level analysis and unified observability — you can see both development evals and production metrics in one place. The right default choice if you're not locked into a single framework.

RAGAS is purpose-built for RAG evaluation. The four core metrics (faithfulness, relevance, precision, recall) are research-backed and well-calibrated. It also includes testset generation — given a knowledge base, it can automatically generate question-answer pairs for evaluation. Narrow focus: excellent for RAG, not designed for task agents.

LangSmith is the natural choice if you're already using LangChain. Deep tracing integration means every LangChain run is automatically logged and evaluable. Less flexible outside the LangChain ecosystem.

Weave (Weights & Biases) bridges development evals with production monitoring. If you already use W&B for ML experiment tracking, Weave extends that to agent evaluation. Best for teams that want a single observability platform across training and inference.

DeepEval is open-source and lightweight. Write custom evaluation metrics in Python, define your own scoring rubrics, run locally without a commercial platform. The right choice when you need proprietary evaluation criteria or want to avoid vendor lock-in.

Building a Regression Test Suite

The goal: a suite of tests that runs automatically on every change and tells you immediately if something broke.

Step 1: Collect Real Failures First

Don't invent test cases. Pull them from production. The first 20–50 tests in your suite should be real failure cases your agent encountered — inputs that produced wrong outputs, tasks that didn't complete, edge cases users hit.

Real failures make the best tests because they represent actual distribution of user behavior, not your assumptions about it.

Step 2: Establish Baselines

Run your eval suite against the current agent version. Record the scores. These are your baselines.

Python

import json
from datetime import datetime

def run_baseline(agent, eval_suite: list[dict]) -> dict:
    results = {}
    for test in eval_suite:
        output = agent.run(test["input"])
        results[test["id"]] = {
            "score": test["grader"](output),
            "passed": test["grader"](output) >= test["threshold"],
        }

    summary = {
        "timestamp": datetime.utcnow().isoformat(),
        "pass_rate": sum(r["passed"] for r in results.values()) / len(results),
        "mean_score": sum(r["score"] for r in results.values()) / len(results),
        "results": results,
    }
    with open("eval_baseline.json", "w") as f:
        json.dump(summary, f, indent=2)
    return summary

Step 3: Automate in CI/CD

Run evals on every PR, every model change, every prompt change. Fail the deployment if key metrics drop below threshold.

YAML

# .github/workflows/agent-evals.yml
name: Agent Eval Suite
on: [push, pull_request]

jobs:
  evals:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run eval suite
        run: python run_evals.py
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
      - name: Check pass rate
        run: |
          PASS_RATE=$(python -c "import json; d=json.load(open('eval_results.json')); print(d['pass_rate'])")
          python -c "assert float('$PASS_RATE') >= 0.85, f'Eval pass rate {float(\"$PASS_RATE\"):.0%} below 85% threshold'"

Step 4: Separate Quality Benchmarks from Regression Tests

Two distinct suites with different purposes:

Quality benchmarks — broad capability testing. "Can the agent do this class of task at all?" Run less frequently (weekly, or before major releases). Accepts some failures — the goal is to understand the capability frontier.

Regression tests — specific known-good behaviors that must not break. "This exact input must produce this exact class of output." Run on every commit. Zero tolerance for failures — a regression test that fails means something broke.

Step 5: The Dev → Production Feedback Loop

[Offline evals] → [Ship] → [Production monitoring] → [New failures] → [Back to offline evals]

In development, you run evals against curated datasets. In production, you monitor live traffic for quality signals — user thumbs-down, escalations, anomalous tool call patterns. Every production failure that slips through becomes a new test case in the offline suite. The suite grows over time to cover your actual failure distribution.

This loop is what separates an eval suite that's useful at launch from one that's still useful six months later.

Module 5.7

Why Evals — The Gap Between "It Works" and "It Works Reliably"

Real-World Use Cases

Why Agent Evals Are Harder Than LLM Evals

The Mindset Shift

Evaluation Approaches

LLM-as-Judge

Human Evaluation

Automated / Code-Based Metrics

Benchmark Suites

RAG Metrics — What RAGAS Measures

The Four Core RAGAS Metrics

Using RAGAS in Practice

Task Agent Metrics

Task Completion Rate

Tool Call Accuracy

Step Efficiency

Trajectory-Level Scoring

Eval Frameworks

Building a Regression Test Suite

Step 1: Collect Real Failures First

Step 2: Establish Baselines

Step 3: Automate in CI/CD

Step 4: Separate Quality Benchmarks from Regression Tests

Step 5: The Dev → Production Feedback Loop

Sources