Module 6.2
Fine-Tuning
Most people fine-tune when they should be prompting. This is the most common expensive mistake in AI development. Fine-tuning takes days, costs money, and creates a model you now have to maintain. If a well-crafted system prompt or a RAG layer solves the problem, use that instead.
That said, when you actually need fine-tuning, nothing else substitutes for it. This module covers the decision, the technique, and the traps.
When Fine-Tuning Beats Prompting
Fine-tuning earns its cost in four situations:
Consistent domain behavior at scale. Your agent must reliably produce outputs in a specific format, tone, or vocabulary across thousands of calls. Prompts are probabilistic — fine-tuning makes behavior deterministic. If you're generating structured JSON for a downstream system and format errors break things, fine-tuning the format consistency is often worth it.
Prompts have plateaued. You've tried every prompting technique. Few-shot examples, chain-of-thought, role specification — all of it. Performance is stuck below your threshold. Fine-tuning injects the capability into the model weights rather than relying on it to follow instructions at inference time.
Latency matters at scale. Fine-tuning can eliminate the need for long system prompts, RAG retrieval, and few-shot examples — all of which add tokens and latency. A fine-tuned model that knows your domain can respond faster and cheaper per call than a base model with a 2,000-token system prompt.
Real benchmark wins justify it. When Anthropic fine-tuned Claude 3 Haiku for a specific task, the fine-tuned version improved F1 score by 24.6% and outperformed the larger Claude 3.5 Sonnet base model by 9.9% on that task. That's the kind of win that justifies the investment — but you have to measure it.
When Prompting Beats Fine-Tuning
Prototyping and MVPs. No training infrastructure, no dataset curation, no retraining cycle. A prompt change ships in seconds. Build the thing first; fine-tune later if it's actually needed.
Frequently changing requirements. If your product's behavior needs to adapt weekly based on user feedback, fine-tuning is too slow. Prompts iterate instantly.
The task is within model capability. Don't fine-tune to teach a model something it can already do. If Claude Sonnet can produce the output you need with a clear, specific prompt, fine-tuning is wasted effort.
The hierarchy to follow: strong prompts → RAG for fresh/private data → fine-tuning only for persistent skills at scale.
LoRA and QLoRA — How Parameter-Efficient Fine-Tuning Works
Full fine-tuning updates every weight in the model — billions of parameters. For a 70B model, this requires enormous GPU clusters and costs thousands of dollars. LoRA makes this tractable.
The key mental model: instead of rewriting the model's entire brain, LoRA teaches it a new skill by attaching small "adapter" modules. The original weights are frozen. Only the adapters — which represent about 0.2–0.3% of total parameters — are trained. The result is competitive with full fine-tuning on most tasks, at a fraction of the cost.
Concretely: training a 7B model normally requires 100–120GB of GPU memory. With LoRA, that drops to 10–20GB — trainable on a single high-end GPU.
QLoRA goes further. It loads the base model weights in 4-bit quantized format (NF4 — a compressed representation that preserves numerical precision where it matters most), then trains LoRA adapters in 16-bit. The memory savings are dramatic: you can fine-tune a 70B model on a single RTX 4090. The accuracy trade-off vs standard LoRA is negligible in practice for most agent tasks.
Full fine-tuning: all weights updated, 100-120GB VRAM for 7B model
LoRA: 0.3% of weights updated, 10-20GB VRAM for 7B model
QLoRA: LoRA + 4-bit base model, ~8GB VRAM for 7B model
Training Data Requirements
Volume: 500–5,000 high-quality instruction-response pairs is the practical range. Quality compounds. 1,000 carefully curated examples consistently outperform 10,000 mediocre ones. You don't need millions of examples — fine-tuning is not pretraining.
Format (instruction tuning, the standard for agent tasks):
{"instruction": "Analyze this customer support ticket and classify its priority",
"input": "[ticket content]",
"output": "[priority: HIGH. Reason: customer reporting data loss...]"}Coverage: Include edge cases and failure modes, not just the happy path. For agent tasks, include examples of how to handle tool call errors, ambiguous inputs, and multi-step reasoning. Diversity across your domain matters — if you only train on easy examples, the model only improves on easy examples.
The catastrophic forgetting fix: mix 10–20% general instruction data into your domain-specific training set. Without this, the model slowly overwrites general capabilities to hyper-specialize, and starts struggling with tasks it used to handle easily.
Fine-Tuning Claude (2026 Status)
Anthropic's fine-tuning API is available for Claude 3 Haiku in preview status via Amazon Bedrock (US West Oregon region). Direct fine-tuning through Anthropic's native API is not yet available to general users.
If you need fine-tuning for production Claude agents today, the path is: Amazon Bedrock → Claude 3 Haiku → instruction-completion pairs → fine-tuned model endpoint. Context window up to 32K tokens.
For most use cases where you want fine-tuning right now without waiting for expanded availability, open-source models (Llama 3, Mistral, Qwen) with local fine-tuning is the more accessible path.
Open-Source Fine-Tuning Frameworks
Unsloth — the fastest LoRA training on NVIDIA GPUs, dead simple API. Best for getting started quickly and iterating fast.
Axolotl — battle-tested, production-grade. Handles LoRA, QLoRA, and DPO (Direct Preference Optimization — for alignment-style tuning). The serious practitioner's choice.
LLaMA-Factory — supports 100+ models with a unified interface. Strong for working across multiple base models.
HuggingFace PEFT — the official standard. More verbose but maximum control and ecosystem compatibility.
Gotchas
Catastrophic forgetting happens when training on a narrow domain gradually overwrites general capabilities. The model gets excellent at your task and worse at everything else. The fix: mix 10–20% general instruction data into training, and use lower learning rates (1e-4 for QLoRA vs 5e-4 for full fine-tuning).
Overfitting happens when the model memorizes training examples instead of generalizing from them. 500-example datasets are easy to memorize. The fix: 80/20 train/validation split, monitor validation loss during training, stop when validation loss starts rising (early stopping), add dropout in LoRA adapters.
The evaluation gap is the most dangerous: your fine-tuned model tests great on your evaluation set, then fails in production. This usually means your eval set was too similar to your training set — you tested memorization, not generalization. Fix: evaluate on tasks genuinely unseen during training, test full multi-step agent workflows (not just single-turn QA), and compare against the base model on standard benchmarks to verify you haven't regressed on general capability.
SFT vs DPO — Two Different Training Objectives
All the techniques above (LoRA, QLoRA) are infrastructure. The training objective — what you're actually optimizing for — is a separate decision.
Supervised Fine-Tuning (SFT) is the default. You provide input→output pairs. The model learns to produce your preferred outputs. SFT is what you want when you have clear right answers: "given this ticket, classify it as HIGH/MEDIUM/LOW." Deterministic, measurable, straightforward.
Direct Preference Optimization (DPO) is the alternative for style and alignment. Instead of right/wrong pairs, you provide chosen/rejected pairs — "given this prompt, I prefer response A over response B." The model learns your preferences without a reward model. Use DPO when you're trying to shape output style, tone, safety behavior, or helpfulness — anything subjective where there's no single "correct" answer.
# DPO training data format
dataset = [
{
"prompt": "Summarize this legal document:",
"chosen": "The agreement grants exclusive license...", # preferred response
"rejected": "This legal document says that..." # worse response
}
]The practical hierarchy: SFT first to teach the domain. DPO second if the style/tone/safety needs further shaping after SFT. Most production fine-tunes only need SFT — DPO is for when the behavior is directionally right but you need preference-level refinement.
A Real Unsloth Fine-Tuning Run
End-to-end code to fine-tune a 7B model with QLoRA on a single GPU using Unsloth:
from unsloth import FastLanguageModel
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments
# 1. Load model with 4-bit quantization (QLoRA)
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Llama-3.1-8B-Instruct",
max_seq_length=4096,
load_in_4bit=True, # QLoRA — fits on 8GB VRAM
)
# 2. Add LoRA adapters (the 0.3% that gets trained)
model = FastLanguageModel.get_peft_model(
model,
r=16, # rank — higher = more params = more capacity
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha=32,
lora_dropout=0.05,
bias="none",
)
# 3. Format your dataset
def format_prompt(example):
return f"""### Instruction:
{example['instruction']}
### Input:
{example['input']}
### Response:
{example['output']}"""
dataset = load_dataset("json", data_files="my_dataset.jsonl")["train"]
dataset = dataset.map(lambda x: {"text": format_prompt(x)})
# 4. Train
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=4096,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4, # effective batch = 8
num_train_epochs=3,
learning_rate=2e-4,
fp16=True,
logging_steps=10,
save_strategy="epoch",
output_dir="./output",
),
)
trainer.train()
# 5. Save just the LoRA adapter weights (~50MB, not 15GB)
model.save_pretrained("my-lora-adapter")
tokenizer.save_pretrained("my-lora-adapter")Total training time for 1,000 examples on an RTX 4090: roughly 15–30 minutes. The saved adapter is ~50MB and loads on top of the base model at inference time.
Sources
- Anthropic — Fine-Tune Claude 3 Haiku on Amazon Bedrock (GA)
- NVIDIA — How to Fine-Tune an LLM with Unsloth
- SuperAnnotate — Fine-Tuning LLMs in 2025
- Pieces — Claude Fine-Tuning: A Complete Guide
- Index.dev — LoRA vs QLoRA: Best AI Fine-Tuning Tools 2026