Fine-Tuning

You have a powerful foundation model. It can write code, answer questions, and summarize documents. But it uses an inconsistent tone, doesn’t follow your company’s specific format, sometimes refuses tasks it shouldn’t, and doesn’t know your domain-specific terminology. Fine-tuning is how you fix that.

Fine-Tuning vs. Prompting vs. RAG

Before spending money on fine-tuning, ask yourself whether it’s actually the right tool:

Approach	Best For	Cost	Latency	Effort
Prompting	Format/style changes, general tasks	Low	Baseline	Low
RAG	Knowledge injection, up-to-date facts	Medium	Slightly higher	Medium
Fine-tuning	Style, behavior, format consistency	High (upfront)	Can be lower	High
Fine-tuning + RAG	Complex domain apps	Highest	Variable	Highest

Fine-tuning wins when:

The model needs to output in a very specific format consistently
You want to teach a writing style or persona that’s hard to specify in prompts
Latency matters and you need to shorten system prompts
You have proprietary domain knowledge too sensitive for a third-party API
The task requires capabilities the base model genuinely lacks

Fine-tuning loses when:

The issue is knowledge cutoff (use RAG)
A better prompt would solve it (try prompt engineering first)
You have less than ~100 high-quality examples (likely to overfit)

Supervised Fine-Tuning (SFT)

The simplest form: show the model examples of input → desired output, and train it to reproduce those outputs.

Example training pair:
{
  "instruction": "Summarize this legal contract in plain English.",
  "input": "WHEREAS Party A agrees to... [1200 words of legalese]",
  "output": "This contract is a service agreement between Company X and
             Vendor Y for software development services running from
             Jan 2025 to Dec 2025, at a total cost of $240,000..."
}

Training on a few thousand such examples, with a small learning rate (so you don’t destroy the base model’s capabilities), will make the model reliably follow your desired output format and style.

Dataset size guidelines:

Style/format adaptation: 100–500 examples
Domain knowledge: 1,000–10,000 examples
Full task specialization: 10,000–100,000 examples

Parameter-Efficient Fine-Tuning: LoRA

Full fine-tuning updates all model parameters — for a 70B model, that’s 280GB of gradients and optimizer states. For most teams, that’s impractical.

LoRA (Low-Rank Adaptation) is the solution. Instead of updating the full weight matrix W, you learn two small matrices A and B such that the update is W + A×B, where the rank r of A×B is much smaller than the original weight dimensions.

Original weight: W (d × d)        e.g., 4096 × 4096 = 16.7M parameters

LoRA decomposition:
  A (d × r)  ×  B (r × d)        e.g., 4096 × 16 + 16 × 4096 = 131K parameters

Update: W' = W + (A × B)          Only A and B are trained
        ↑ frozen   ↑ trained

With rank r=16, LoRA reduces trainable parameters by ~100× compared to full fine-tuning, while matching or approaching full fine-tuning quality on most tasks.

Common LoRA targets: Q and V projection matrices in attention layers (original paper recommendation), though training all attention projections and FFN layers typically works better.

QLoRA: Fine-Tuning Large Models on Consumer Hardware

QLoRA combines quantization with LoRA to make fine-tuning 65B+ models feasible on a single A100 (or even a 4090).

QLoRA approach:
1. Load the base model in 4-bit NF4 quantization (reduces 70B from ~140GB to ~35GB)
2. Keep frozen base model weights in 4-bit
3. Add LoRA adapters in float16
4. Train only the LoRA adapters (float16)
5. At inference, dequantize 4-bit → float16 on the fly

This made fine-tuning frontier-scale models accessible to researchers and companies without massive GPU clusters. As of 2025, it remains one of the most important practical techniques in the field.

RLHF: Alignment via Human Feedback

Reinforcement Learning from Human Feedback taught models to be helpful, harmless, and honest. The process:

Step 1: Collect preference data
        Human rater sees two outputs → selects preferred one
        Thousands of such comparisons

Step 2: Train reward model
        RM takes (prompt, response) → score
        Trained to predict human preferences

Step 3: RL fine-tuning
        Policy LLM generates responses
        RM scores them
        PPO/GRPO updates policy to maximize reward
        KL penalty keeps policy close to SFT model (prevents reward hacking)

RLHF is why ChatGPT and Claude feel so different from raw GPT-3 base models. The instruction-following, the helpful tone, the appropriate refusals — all of that comes from RLHF.

Cost: RLHF requires human raters at scale, making it expensive. A well-annotated RLHF dataset from professional raters costs hundreds of thousands of dollars.

DPO: Simpler Alignment Without RL

Direct Preference Optimization (2023) simplified RLHF dramatically. Instead of training a separate reward model and doing RL, DPO directly optimizes the LLM on preference data using a classification-style loss.

DPO objective (simplified):
For each (prompt, chosen_response, rejected_response):
  Maximize: log P_model(chosen) / P_ref(chosen)
  Minimize: log P_model(rejected) / P_ref(rejected)

Results are comparable to RLHF on most benchmarks, with dramatically simpler implementation and more stable training. Widely adopted in open-source fine-tuning (Axolotl, TRL, LLaMA-Factory all support it).

GRPO (Group Relative Policy Optimization, used in DeepSeek-R1) is a variant that eliminates the reference model entirely, further simplifying the pipeline while achieving strong reasoning capabilities.

Practical Fine-Tuning Stack (2026)

For most teams, the practical stack looks like:

Model:      LLaMA 3.1 8B or Mistral 7B v0.3 (good balance of size and capability)
Method:     QLoRA with r=64, alpha=128
Framework:  Unsloth (2x faster than standard training) or Axolotl
Hardware:   1–4× A100 80GB or H100
Data:       500–10K high-quality instruction pairs
Time:       2–12 hours depending on dataset size

Tools worth knowing:

Unsloth: 2–5× faster QLoRA training, less memory than standard HuggingFace
Axolotl: Flexible YAML-based fine-tuning orchestration
LLaMA-Factory: Comprehensive fine-tuning UI + CLI
TRL (HuggingFace): SFT, DPO, RLHF implementations
Together AI / Modal / Lambda: Cloud GPU rentals when local hardware is insufficient

Common Fine-Tuning Mistakes

Catastrophic forgetting: Training too aggressively on a narrow dataset can degrade general capabilities. Keep LoRA rank moderate and learning rate low (1e-4 to 3e-4).

Data leakage: Validation set contaminated with training examples gives falsely optimistic metrics. Use held-out test sets from a different time period or source.

Not evaluating on real tasks: Fine-tuning metrics (training loss, SFT eval loss) don’t always correlate with downstream task performance. Always evaluate on actual use cases.

Over-training: More epochs isn’t always better. Monitor validation loss carefully — stop when it plateaus or begins to rise.

Ignoring prompt format: The model needs to see the same prompt format during fine-tuning as it will during inference. Inconsistent formatting causes confusing behavior.