AI Engineering Curriculum
Phase 2: Single AI Agent Development·7 min read

Module 2.4

RAG (Retrieval-Augmented Generation)

The Problem RAG Solves

A model knows what it was trained on. Nothing more.

Ask Claude about a news event from last week and it doesn't know. Ask it about your internal company docs and it doesn't know. Ask it about a codebase you haven't shown it and it doesn't know. The context window gives you one escape hatch — you can paste things in — but the context window is finite. You can't fit a 10,000-page knowledge base into a single prompt.

RAG is the solution. Instead of putting everything in the prompt, you build a searchable knowledge base and retrieve only the relevant pieces at query time. The agent asks a question, the relevant chunks are fetched and injected into the prompt, and Claude answers using that fresh context.

The key mental model: RAG turns an unbounded knowledge base into a targeted injection. The model never sees everything — it only ever sees the pieces most relevant to what it's currently working on.


The Four-Stage Pipeline

RAG has two distinct phases: ingestion (done once, upfront) and retrieval (done at query time, every time).

INGESTION (offline):
Documents → Chunk → Embed → Store in vector DB

RETRIEVAL (online, per query):
User query → Embed query → Search vector DB → Inject top chunks → Generate answer

Understanding these as two separate phases matters because they have different performance characteristics, different failure modes, and different optimization strategies.


Stage 1: Embeddings — Turning Text into Numbers

An embedding is a list of numbers — typically 1024 floating-point values — that encodes the meaning of a piece of text. Two sentences that mean the same thing, even if worded differently, will have embeddings that are mathematically close to each other. Two sentences about completely different topics will have embeddings far apart.

This is what makes semantic search possible. Instead of matching keywords, you're matching meaning.

Anthropic does not have its own embedding model. The recommended provider is Voyage AI, which Anthropic has partnered with:

ModelUse for
voyage-3-largeBest quality, general purpose
voyage-3.5Balanced quality and cost
voyage-3.5-liteLowest latency and cost
voyage-code-3Code and technical documentation
voyage-finance-2Finance domain
voyage-law-2Legal domain
Python
pip install -U voyageai export VOYAGE_API_KEY="your-key"
Python
import voyageai vo = voyageai.Client() # Embed documents at ingestion time doc_embeddings = vo.embed( documents, model="voyage-3.5", input_type="document" # important — tells the model these are docs to be retrieved ).embeddings # Embed a query at retrieval time query_embedding = vo.embed( [user_query], model="voyage-3.5", input_type="query" # important — tells the model this is a search query ).embeddings[0]

The input_type parameter is not optional. When you specify "document", Voyage prepends "Represent the document for retrieval: " to the text before embedding. For "query", it prepends "Represent the query for retrieving supporting documents: ". These different prefixes produce better-separated vector spaces for retrieval. Omit input_type and your retrieval quality quietly degrades.

The rule you must never break: Always use the same embedding model for both ingestion and querying. The vector spaces are model-specific — mixing models makes the similarity scores meaningless.


Stage 2: Chunking — Splitting Documents Intelligently

Before you can embed a document, you have to split it into chunks. The model can't embed a 200-page PDF as a single vector — and even if it could, retrieving the whole document when you only need one paragraph is wasteful.

Fixed-size chunking is the simplest approach and a fine starting point:

Python
from langchain_text_splitters import RecursiveCharacterTextSplitter splitter = RecursiveCharacterTextSplitter( chunk_size=512, # tokens per chunk chunk_overlap=50, # overlap between consecutive chunks add_start_index=True # track position in original document ) chunks = splitter.split_documents(docs)

The overlap matters. Without it, a sentence that spans a chunk boundary gets split in half — one half goes into one chunk, the other half into the next. Neither chunk contains the full sentence, so neither will be retrieved when someone searches for it. Overlap of 50–100 tokens is a reasonable default.

Recursive chunking (the default in LangChain's RecursiveCharacterTextSplitter) is smarter — it tries to split on paragraph breaks first, then sentence breaks, then word breaks. This preserves more natural language structure than cutting at fixed character counts.

Semantic chunking uses the embedding model itself to find topic boundaries — where the similarity between consecutive sentences drops sharply. More expensive (requires embedding every sentence), but produces chunks that correspond to actual idea boundaries rather than arbitrary lengths. Worth it for dense technical or research documents.

Always add metadata to every chunk:

Python
{ "content": "...chunk text...", "metadata": { "source": "product-manual-v3.pdf", "page": 12, "section": "Installation" } }

The metadata travels with the chunk through the vector DB. When you retrieve it, Claude can cite the source — essential for any RAG system that needs to be trustworthy.


Stage 3: Vector Databases — Storing and Searching Embeddings

A vector database stores embeddings and answers one question efficiently: "Given this query vector, which stored vectors are most similar?" That's nearest-neighbor search, and it's the core operation of every RAG system.

DatabaseBest for
ChromaDBLocal development, getting started fast — runs in-process
PineconeProduction, managed, no infrastructure to run
QdrantOpen-source, rich filtering, self-hosted or cloud
pgvectorAlready using PostgreSQL — add vector search without a new system
FAISSResearch, batch processing, GPU acceleration — not a server

For learning and prototyping, ChromaDB is the easiest starting point:

Python
import chromadb from chromadb.utils import embedding_functions client = chromadb.Client() collection = client.create_collection("my-docs") # Store chunks collection.add( documents=[chunk.page_content for chunk in chunks], metadatas=[chunk.metadata for chunk in chunks], ids=[f"chunk-{i}" for i in range(len(chunks))] ) # Search results = collection.query( query_texts=[user_query], n_results=5 )

Stage 4: Retrieval and Generation

At query time, embed the user's question, find the most similar chunks, inject them into the prompt, and let Claude answer:

Python
import anthropic import voyageai vo = voyageai.Client() client = anthropic.Anthropic() def rag_query(user_question: str, collection) -> str: # 1. Embed the question query_embedding = vo.embed([user_question], model="voyage-3.5", input_type="query").embeddings[0] # 2. Retrieve relevant chunks results = collection.query(query_embeddings=[query_embedding], n_results=5) context = "\n\n".join(results["documents"][0]) # 3. Generate answer with context injected response = client.messages.create( model="claude-opus-4-6", max_tokens=1024, system="Answer questions using only the provided context. If the answer isn't in the context, say so.", messages=[{ "role": "user", "content": f"Context:\n{context}\n\nQuestion: {user_question}" }] ) return response.content[0].text

Contextual Retrieval — Anthropic's September 2024 Innovation

Standard RAG has a subtle failure mode: chunks lose context when extracted from their source.

Imagine a chunk that says: "The mechanism was introduced in version 3.2 and deprecated in 4.0." Extracted from a 200-page manual, the chunk doesn't say which mechanism. Searching for it will miss it for most relevant queries.

Contextual Retrieval fixes this by prepending a short AI-generated summary to each chunk before embedding, explaining where it fits in the broader document:

Python
def add_context_to_chunk(document: str, chunk: str) -> str: response = client.messages.create( model="claude-haiku-4-5-20251001", # cheap model for this task max_tokens=100, messages=[{ "role": "user", "content": f"""Document: <document>{document}</document> Chunk to contextualize: <chunk>{chunk}</chunk> Write a short 1-2 sentence context explaining where this chunk fits in the document. Be concise.""" }] ) context = response.content[0].text return f"{context}\n\n{chunk}" # prepend context, then original chunk

The result: each chunk carries its own context. The mechanism chunk becomes "This chunk describes the lifecycle of the authentication mechanism in the security module. The mechanism was introduced in version 3.2 and deprecated in 4.0."

According to Anthropic's own benchmarks, Contextual Retrieval reduces failed retrievals by 49% on its own, and 67% when combined with re-ranking.


Where Things Go Wrong

Mixing embedding models. Embed your documents with voyage-3.5 and query with voyage-3-large and your similarity scores are garbage. The vector spaces are incompatible. Always use the same model, always.

Skipping input_type. Documents and queries need different input types. Without them, Voyage embeds both the same way and retrieval quality drops quietly — no error, just worse results.

Stale embeddings. If your source documents update, your embeddings are now stale. They still return results — just wrong ones. Build a re-ingestion pipeline from the start.

k too small or too large. k=1 misses relevant chunks. k=20 floods the context with noise and costs more tokens. Start at k=5 and tune based on answer quality.

Chunk size tradeoff. Large chunks → better recall (less likely to miss info), more noise per chunk. Small chunks → more precise, can lose context. 512 tokens with 50-token overlap is a proven starting point.


Sources