Fine-Tuning
You have a powerful foundation model. It can write code, answer questions, and summarize documents. But it uses an inconsistent tone, doesn’t follow your company’s specific format, sometimes refuses tasks it shouldn’t, and doesn’t know your domain-specific terminology. Fine-tuning is how you fix that.
Fine-Tuning vs. Prompting vs. RAG
Before spending money on fine-tuning, ask yourself whether it’s actually the right tool:
| Approach | Best For | Cost | Latency | Effort |
|---|---|---|---|---|
| Prompting | Format/style changes, general tasks | Low | Baseline | Low |
| RAG | Knowledge injection, up-to-date facts | Medium | Slightly higher | Medium |
| Fine-tuning | Style, behavior, format consistency | High (upfront) | Can be lower | High |
| Fine-tuning + RAG | Complex domain apps | Highest | Variable | Highest |
Fine-tuning wins when:
- The model needs to output in a very specific format consistently
- You want to teach a writing style or persona that’s hard to specify in prompts
- Latency matters and you need to shorten system prompts
- You have proprietary domain knowledge too sensitive for a third-party API
- The task requires capabilities the base model genuinely lacks
Fine-tuning loses when:
- The issue is knowledge cutoff (use RAG)
- A better prompt would solve it (try prompt engineering first)
- You have less than ~100 high-quality examples (likely to overfit)
Supervised Fine-Tuning (SFT)
The simplest form: show the model examples of input → desired output, and train it to reproduce those outputs.
Example training pair:{ "instruction": "Summarize this legal contract in plain English.", "input": "WHEREAS Party A agrees to... [1200 words of legalese]", "output": "This contract is a service agreement between Company X and Vendor Y for software development services running from Jan 2025 to Dec 2025, at a total cost of $240,000..."}Training on a few thousand such examples, with a small learning rate (so you don’t destroy the base model’s capabilities), will make the model reliably follow your desired output format and style.
Dataset size guidelines:
- Style/format adaptation: 100–500 examples
- Domain knowledge: 1,000–10,000 examples
- Full task specialization: 10,000–100,000 examples
Parameter-Efficient Fine-Tuning: LoRA
Full fine-tuning updates all model parameters — for a 70B model, that’s 280GB of gradients and optimizer states. For most teams, that’s impractical.
LoRA (Low-Rank Adaptation) is the solution. Instead of updating the full weight matrix W, you learn two small matrices A and B such that the update is W + A×B, where the rank r of A×B is much smaller than the original weight dimensions.
Original weight: W (d × d) e.g., 4096 × 4096 = 16.7M parameters
LoRA decomposition: A (d × r) × B (r × d) e.g., 4096 × 16 + 16 × 4096 = 131K parameters
Update: W' = W + (A × B) Only A and B are trained ↑ frozen ↑ trainedWith rank r=16, LoRA reduces trainable parameters by ~100× compared to full fine-tuning, while matching or approaching full fine-tuning quality on most tasks.
Common LoRA targets: Q and V projection matrices in attention layers (original paper recommendation), though training all attention projections and FFN layers typically works better.
QLoRA: Fine-Tuning Large Models on Consumer Hardware
QLoRA combines quantization with LoRA to make fine-tuning 65B+ models feasible on a single A100 (or even a 4090).
QLoRA approach:1. Load the base model in 4-bit NF4 quantization (reduces 70B from ~140GB to ~35GB)2. Keep frozen base model weights in 4-bit3. Add LoRA adapters in float164. Train only the LoRA adapters (float16)5. At inference, dequantize 4-bit → float16 on the flyThis made fine-tuning frontier-scale models accessible to researchers and companies without massive GPU clusters. As of 2025, it remains one of the most important practical techniques in the field.
RLHF: Alignment via Human Feedback
Reinforcement Learning from Human Feedback taught models to be helpful, harmless, and honest. The process:
Step 1: Collect preference data Human rater sees two outputs → selects preferred one Thousands of such comparisons
Step 2: Train reward model RM takes (prompt, response) → score Trained to predict human preferences
Step 3: RL fine-tuning Policy LLM generates responses RM scores them PPO/GRPO updates policy to maximize reward KL penalty keeps policy close to SFT model (prevents reward hacking)RLHF is why ChatGPT and Claude feel so different from raw GPT-3 base models. The instruction-following, the helpful tone, the appropriate refusals — all of that comes from RLHF.
Cost: RLHF requires human raters at scale, making it expensive. A well-annotated RLHF dataset from professional raters costs hundreds of thousands of dollars.
DPO: Simpler Alignment Without RL
Direct Preference Optimization (2023) simplified RLHF dramatically. Instead of training a separate reward model and doing RL, DPO directly optimizes the LLM on preference data using a classification-style loss.
DPO objective (simplified):For each (prompt, chosen_response, rejected_response): Maximize: log P_model(chosen) / P_ref(chosen) Minimize: log P_model(rejected) / P_ref(rejected)Results are comparable to RLHF on most benchmarks, with dramatically simpler implementation and more stable training. Widely adopted in open-source fine-tuning (Axolotl, TRL, LLaMA-Factory all support it).
GRPO (Group Relative Policy Optimization, used in DeepSeek-R1) is a variant that eliminates the reference model entirely, further simplifying the pipeline while achieving strong reasoning capabilities.
Practical Fine-Tuning Stack (2026)
For most teams, the practical stack looks like:
Model: LLaMA 3.1 8B or Mistral 7B v0.3 (good balance of size and capability)Method: QLoRA with r=64, alpha=128Framework: Unsloth (2x faster than standard training) or AxolotlHardware: 1–4× A100 80GB or H100Data: 500–10K high-quality instruction pairsTime: 2–12 hours depending on dataset sizeTools worth knowing:
- Unsloth: 2–5× faster QLoRA training, less memory than standard HuggingFace
- Axolotl: Flexible YAML-based fine-tuning orchestration
- LLaMA-Factory: Comprehensive fine-tuning UI + CLI
- TRL (HuggingFace): SFT, DPO, RLHF implementations
- Together AI / Modal / Lambda: Cloud GPU rentals when local hardware is insufficient
Common Fine-Tuning Mistakes
Catastrophic forgetting: Training too aggressively on a narrow dataset can degrade general capabilities. Keep LoRA rank moderate and learning rate low (1e-4 to 3e-4).
Data leakage: Validation set contaminated with training examples gives falsely optimistic metrics. Use held-out test sets from a different time period or source.
Not evaluating on real tasks: Fine-tuning metrics (training loss, SFT eval loss) don’t always correlate with downstream task performance. Always evaluate on actual use cases.
Over-training: More epochs isn’t always better. Monitor validation loss carefully — stop when it plateaus or begins to rise.
Ignoring prompt format: The model needs to see the same prompt format during fine-tuning as it will during inference. Inconsistent formatting causes confusing behavior.