Module 2.4
RAG (Retrieval-Augmented Generation)
The Problem RAG Solves
A model knows what it was trained on. Nothing more.
Ask Claude about a news event from last week and it doesn't know. Ask it about your internal company docs and it doesn't know. Ask it about a codebase you haven't shown it and it doesn't know. The context window gives you one escape hatch — you can paste things in — but the context window is finite. You can't fit a 10,000-page knowledge base into a single prompt.
RAG is the solution. Instead of putting everything in the prompt, you build a searchable knowledge base and retrieve only the relevant pieces at query time. The agent asks a question, the relevant chunks are fetched and injected into the prompt, and Claude answers using that fresh context.
The key mental model: RAG turns an unbounded knowledge base into a targeted injection. The model never sees everything — it only ever sees the pieces most relevant to what it's currently working on.
The Four-Stage Pipeline
RAG has two distinct phases: ingestion (done once, upfront) and retrieval (done at query time, every time).
INGESTION (offline):
Documents → Chunk → Embed → Store in vector DB
RETRIEVAL (online, per query):
User query → Embed query → Search vector DB → Inject top chunks → Generate answer
Understanding these as two separate phases matters because they have different performance characteristics, different failure modes, and different optimization strategies.
Stage 1: Embeddings — Turning Text into Numbers
An embedding is a list of numbers — typically 1024 floating-point values — that encodes the meaning of a piece of text. Two sentences that mean the same thing, even if worded differently, will have embeddings that are mathematically close to each other. Two sentences about completely different topics will have embeddings far apart.
This is what makes semantic search possible. Instead of matching keywords, you're matching meaning.
Anthropic does not have its own embedding model. The recommended provider is Voyage AI, which Anthropic has partnered with:
| Model | Use for |
|---|---|
voyage-3-large | Best quality, general purpose |
voyage-3.5 | Balanced quality and cost |
voyage-3.5-lite | Lowest latency and cost |
voyage-code-3 | Code and technical documentation |
voyage-finance-2 | Finance domain |
voyage-law-2 | Legal domain |
pip install -U voyageai
export VOYAGE_API_KEY="your-key"import voyageai
vo = voyageai.Client()
# Embed documents at ingestion time
doc_embeddings = vo.embed(
documents,
model="voyage-3.5",
input_type="document" # important — tells the model these are docs to be retrieved
).embeddings
# Embed a query at retrieval time
query_embedding = vo.embed(
[user_query],
model="voyage-3.5",
input_type="query" # important — tells the model this is a search query
).embeddings[0]The input_type parameter is not optional. When you specify "document", Voyage prepends "Represent the document for retrieval: " to the text before embedding. For "query", it prepends "Represent the query for retrieving supporting documents: ". These different prefixes produce better-separated vector spaces for retrieval. Omit input_type and your retrieval quality quietly degrades.
The rule you must never break: Always use the same embedding model for both ingestion and querying. The vector spaces are model-specific — mixing models makes the similarity scores meaningless.
Stage 2: Chunking — Splitting Documents Intelligently
Before you can embed a document, you have to split it into chunks. The model can't embed a 200-page PDF as a single vector — and even if it could, retrieving the whole document when you only need one paragraph is wasteful.
Fixed-size chunking is the simplest approach and a fine starting point:
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512, # tokens per chunk
chunk_overlap=50, # overlap between consecutive chunks
add_start_index=True # track position in original document
)
chunks = splitter.split_documents(docs)The overlap matters. Without it, a sentence that spans a chunk boundary gets split in half — one half goes into one chunk, the other half into the next. Neither chunk contains the full sentence, so neither will be retrieved when someone searches for it. Overlap of 50–100 tokens is a reasonable default.
Recursive chunking (the default in LangChain's RecursiveCharacterTextSplitter) is smarter — it tries to split on paragraph breaks first, then sentence breaks, then word breaks. This preserves more natural language structure than cutting at fixed character counts.
Semantic chunking uses the embedding model itself to find topic boundaries — where the similarity between consecutive sentences drops sharply. More expensive (requires embedding every sentence), but produces chunks that correspond to actual idea boundaries rather than arbitrary lengths. Worth it for dense technical or research documents.
Always add metadata to every chunk:
{
"content": "...chunk text...",
"metadata": {
"source": "product-manual-v3.pdf",
"page": 12,
"section": "Installation"
}
}The metadata travels with the chunk through the vector DB. When you retrieve it, Claude can cite the source — essential for any RAG system that needs to be trustworthy.
Stage 3: Vector Databases — Storing and Searching Embeddings
A vector database stores embeddings and answers one question efficiently: "Given this query vector, which stored vectors are most similar?" That's nearest-neighbor search, and it's the core operation of every RAG system.
| Database | Best for |
|---|---|
| ChromaDB | Local development, getting started fast — runs in-process |
| Pinecone | Production, managed, no infrastructure to run |
| Qdrant | Open-source, rich filtering, self-hosted or cloud |
| pgvector | Already using PostgreSQL — add vector search without a new system |
| FAISS | Research, batch processing, GPU acceleration — not a server |
For learning and prototyping, ChromaDB is the easiest starting point:
import chromadb
from chromadb.utils import embedding_functions
client = chromadb.Client()
collection = client.create_collection("my-docs")
# Store chunks
collection.add(
documents=[chunk.page_content for chunk in chunks],
metadatas=[chunk.metadata for chunk in chunks],
ids=[f"chunk-{i}" for i in range(len(chunks))]
)
# Search
results = collection.query(
query_texts=[user_query],
n_results=5
)Stage 4: Retrieval and Generation
At query time, embed the user's question, find the most similar chunks, inject them into the prompt, and let Claude answer:
import anthropic
import voyageai
vo = voyageai.Client()
client = anthropic.Anthropic()
def rag_query(user_question: str, collection) -> str:
# 1. Embed the question
query_embedding = vo.embed([user_question], model="voyage-3.5", input_type="query").embeddings[0]
# 2. Retrieve relevant chunks
results = collection.query(query_embeddings=[query_embedding], n_results=5)
context = "\n\n".join(results["documents"][0])
# 3. Generate answer with context injected
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
system="Answer questions using only the provided context. If the answer isn't in the context, say so.",
messages=[{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {user_question}"
}]
)
return response.content[0].textContextual Retrieval — Anthropic's September 2024 Innovation
Standard RAG has a subtle failure mode: chunks lose context when extracted from their source.
Imagine a chunk that says: "The mechanism was introduced in version 3.2 and deprecated in 4.0." Extracted from a 200-page manual, the chunk doesn't say which mechanism. Searching for it will miss it for most relevant queries.
Contextual Retrieval fixes this by prepending a short AI-generated summary to each chunk before embedding, explaining where it fits in the broader document:
def add_context_to_chunk(document: str, chunk: str) -> str:
response = client.messages.create(
model="claude-haiku-4-5-20251001", # cheap model for this task
max_tokens=100,
messages=[{
"role": "user",
"content": f"""Document:
<document>{document}</document>
Chunk to contextualize:
<chunk>{chunk}</chunk>
Write a short 1-2 sentence context explaining where this chunk fits in the document. Be concise."""
}]
)
context = response.content[0].text
return f"{context}\n\n{chunk}" # prepend context, then original chunkThe result: each chunk carries its own context. The mechanism chunk becomes "This chunk describes the lifecycle of the authentication mechanism in the security module. The mechanism was introduced in version 3.2 and deprecated in 4.0."
According to Anthropic's own benchmarks, Contextual Retrieval reduces failed retrievals by 49% on its own, and 67% when combined with re-ranking.
Where Things Go Wrong
Mixing embedding models. Embed your documents with voyage-3.5 and query with voyage-3-large and your similarity scores are garbage. The vector spaces are incompatible. Always use the same model, always.
Skipping input_type. Documents and queries need different input types. Without them, Voyage embeds both the same way and retrieval quality drops quietly — no error, just worse results.
Stale embeddings. If your source documents update, your embeddings are now stale. They still return results — just wrong ones. Build a re-ingestion pipeline from the start.
k too small or too large. k=1 misses relevant chunks. k=20 floods the context with noise and costs more tokens. Start at k=5 and tune based on answer quality.
Chunk size tradeoff. Large chunks → better recall (less likely to miss info), more noise per chunk. Small chunks → more precise, can lose context. 512 tokens with 50-token overlap is a proven starting point.
Sources
- Anthropic Embeddings Guide (official)
- Anthropic — Contextual Retrieval (official, September 2024)
- Voyage AI Documentation