AI Engineering Curriculum
Phase 6: Advanced Topics·7 min read

Module 6.2

Fine-Tuning

Most people fine-tune when they should be prompting. This is the most common expensive mistake in AI development. Fine-tuning takes days, costs money, and creates a model you now have to maintain. If a well-crafted system prompt or a RAG layer solves the problem, use that instead.

That said, when you actually need fine-tuning, nothing else substitutes for it. This module covers the decision, the technique, and the traps.

When Fine-Tuning Beats Prompting

Fine-tuning earns its cost in four situations:

Consistent domain behavior at scale. Your agent must reliably produce outputs in a specific format, tone, or vocabulary across thousands of calls. Prompts are probabilistic — fine-tuning makes behavior deterministic. If you're generating structured JSON for a downstream system and format errors break things, fine-tuning the format consistency is often worth it.

Prompts have plateaued. You've tried every prompting technique. Few-shot examples, chain-of-thought, role specification — all of it. Performance is stuck below your threshold. Fine-tuning injects the capability into the model weights rather than relying on it to follow instructions at inference time.

Latency matters at scale. Fine-tuning can eliminate the need for long system prompts, RAG retrieval, and few-shot examples — all of which add tokens and latency. A fine-tuned model that knows your domain can respond faster and cheaper per call than a base model with a 2,000-token system prompt.

Real benchmark wins justify it. When Anthropic fine-tuned Claude 3 Haiku for a specific task, the fine-tuned version improved F1 score by 24.6% and outperformed the larger Claude 3.5 Sonnet base model by 9.9% on that task. That's the kind of win that justifies the investment — but you have to measure it.

When Prompting Beats Fine-Tuning

Prototyping and MVPs. No training infrastructure, no dataset curation, no retraining cycle. A prompt change ships in seconds. Build the thing first; fine-tune later if it's actually needed.

Frequently changing requirements. If your product's behavior needs to adapt weekly based on user feedback, fine-tuning is too slow. Prompts iterate instantly.

The task is within model capability. Don't fine-tune to teach a model something it can already do. If Claude Sonnet can produce the output you need with a clear, specific prompt, fine-tuning is wasted effort.

The hierarchy to follow: strong prompts → RAG for fresh/private data → fine-tuning only for persistent skills at scale.

LoRA and QLoRA — How Parameter-Efficient Fine-Tuning Works

Full fine-tuning updates every weight in the model — billions of parameters. For a 70B model, this requires enormous GPU clusters and costs thousands of dollars. LoRA makes this tractable.

The key mental model: instead of rewriting the model's entire brain, LoRA teaches it a new skill by attaching small "adapter" modules. The original weights are frozen. Only the adapters — which represent about 0.2–0.3% of total parameters — are trained. The result is competitive with full fine-tuning on most tasks, at a fraction of the cost.

Concretely: training a 7B model normally requires 100–120GB of GPU memory. With LoRA, that drops to 10–20GB — trainable on a single high-end GPU.

QLoRA goes further. It loads the base model weights in 4-bit quantized format (NF4 — a compressed representation that preserves numerical precision where it matters most), then trains LoRA adapters in 16-bit. The memory savings are dramatic: you can fine-tune a 70B model on a single RTX 4090. The accuracy trade-off vs standard LoRA is negligible in practice for most agent tasks.

Full fine-tuning:  all weights updated, 100-120GB VRAM for 7B model
LoRA:              0.3% of weights updated, 10-20GB VRAM for 7B model
QLoRA:             LoRA + 4-bit base model, ~8GB VRAM for 7B model

Training Data Requirements

Volume: 500–5,000 high-quality instruction-response pairs is the practical range. Quality compounds. 1,000 carefully curated examples consistently outperform 10,000 mediocre ones. You don't need millions of examples — fine-tuning is not pretraining.

Format (instruction tuning, the standard for agent tasks):

JSON
{"instruction": "Analyze this customer support ticket and classify its priority", "input": "[ticket content]", "output": "[priority: HIGH. Reason: customer reporting data loss...]"}

Coverage: Include edge cases and failure modes, not just the happy path. For agent tasks, include examples of how to handle tool call errors, ambiguous inputs, and multi-step reasoning. Diversity across your domain matters — if you only train on easy examples, the model only improves on easy examples.

The catastrophic forgetting fix: mix 10–20% general instruction data into your domain-specific training set. Without this, the model slowly overwrites general capabilities to hyper-specialize, and starts struggling with tasks it used to handle easily.

Fine-Tuning Claude (2026 Status)

Anthropic's fine-tuning API is available for Claude 3 Haiku in preview status via Amazon Bedrock (US West Oregon region). Direct fine-tuning through Anthropic's native API is not yet available to general users.

If you need fine-tuning for production Claude agents today, the path is: Amazon Bedrock → Claude 3 Haiku → instruction-completion pairs → fine-tuned model endpoint. Context window up to 32K tokens.

For most use cases where you want fine-tuning right now without waiting for expanded availability, open-source models (Llama 3, Mistral, Qwen) with local fine-tuning is the more accessible path.

Open-Source Fine-Tuning Frameworks

Unsloth — the fastest LoRA training on NVIDIA GPUs, dead simple API. Best for getting started quickly and iterating fast.

Axolotl — battle-tested, production-grade. Handles LoRA, QLoRA, and DPO (Direct Preference Optimization — for alignment-style tuning). The serious practitioner's choice.

LLaMA-Factory — supports 100+ models with a unified interface. Strong for working across multiple base models.

HuggingFace PEFT — the official standard. More verbose but maximum control and ecosystem compatibility.

Gotchas

Catastrophic forgetting happens when training on a narrow domain gradually overwrites general capabilities. The model gets excellent at your task and worse at everything else. The fix: mix 10–20% general instruction data into training, and use lower learning rates (1e-4 for QLoRA vs 5e-4 for full fine-tuning).

Overfitting happens when the model memorizes training examples instead of generalizing from them. 500-example datasets are easy to memorize. The fix: 80/20 train/validation split, monitor validation loss during training, stop when validation loss starts rising (early stopping), add dropout in LoRA adapters.

The evaluation gap is the most dangerous: your fine-tuned model tests great on your evaluation set, then fails in production. This usually means your eval set was too similar to your training set — you tested memorization, not generalization. Fix: evaluate on tasks genuinely unseen during training, test full multi-step agent workflows (not just single-turn QA), and compare against the base model on standard benchmarks to verify you haven't regressed on general capability.

SFT vs DPO — Two Different Training Objectives

All the techniques above (LoRA, QLoRA) are infrastructure. The training objective — what you're actually optimizing for — is a separate decision.

Supervised Fine-Tuning (SFT) is the default. You provide input→output pairs. The model learns to produce your preferred outputs. SFT is what you want when you have clear right answers: "given this ticket, classify it as HIGH/MEDIUM/LOW." Deterministic, measurable, straightforward.

Direct Preference Optimization (DPO) is the alternative for style and alignment. Instead of right/wrong pairs, you provide chosen/rejected pairs — "given this prompt, I prefer response A over response B." The model learns your preferences without a reward model. Use DPO when you're trying to shape output style, tone, safety behavior, or helpfulness — anything subjective where there's no single "correct" answer.

Python
# DPO training data format dataset = [ { "prompt": "Summarize this legal document:", "chosen": "The agreement grants exclusive license...", # preferred response "rejected": "This legal document says that..." # worse response } ]

The practical hierarchy: SFT first to teach the domain. DPO second if the style/tone/safety needs further shaping after SFT. Most production fine-tunes only need SFT — DPO is for when the behavior is directionally right but you need preference-level refinement.

A Real Unsloth Fine-Tuning Run

End-to-end code to fine-tune a 7B model with QLoRA on a single GPU using Unsloth:

Python
from unsloth import FastLanguageModel from datasets import load_dataset from trl import SFTTrainer from transformers import TrainingArguments # 1. Load model with 4-bit quantization (QLoRA) model, tokenizer = FastLanguageModel.from_pretrained( model_name="unsloth/Llama-3.1-8B-Instruct", max_seq_length=4096, load_in_4bit=True, # QLoRA — fits on 8GB VRAM ) # 2. Add LoRA adapters (the 0.3% that gets trained) model = FastLanguageModel.get_peft_model( model, r=16, # rank — higher = more params = more capacity target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], lora_alpha=32, lora_dropout=0.05, bias="none", ) # 3. Format your dataset def format_prompt(example): return f"""### Instruction: {example['instruction']} ### Input: {example['input']} ### Response: {example['output']}""" dataset = load_dataset("json", data_files="my_dataset.jsonl")["train"] dataset = dataset.map(lambda x: {"text": format_prompt(x)}) # 4. Train trainer = SFTTrainer( model=model, tokenizer=tokenizer, train_dataset=dataset, dataset_text_field="text", max_seq_length=4096, args=TrainingArguments( per_device_train_batch_size=2, gradient_accumulation_steps=4, # effective batch = 8 num_train_epochs=3, learning_rate=2e-4, fp16=True, logging_steps=10, save_strategy="epoch", output_dir="./output", ), ) trainer.train() # 5. Save just the LoRA adapter weights (~50MB, not 15GB) model.save_pretrained("my-lora-adapter") tokenizer.save_pretrained("my-lora-adapter")

Total training time for 1,000 examples on an RTX 4090: roughly 15–30 minutes. The saved adapter is ~50MB and loads on top of the base model at inference time.

Sources