AI  /  Generative AI

Generative AI 26 guides · updated 2026

From transformer foundations to production RAG, tool-using agents, and the Model Context Protocol — the GenAI stack as it's actually being built in 2026.

Fine-Tuning

You have a powerful foundation model. It can write code, answer questions, and summarize documents. But it uses an inconsistent tone, doesn’t follow your company’s specific format, sometimes refuses tasks it shouldn’t, and doesn’t know your domain-specific terminology. Fine-tuning is how you fix that.


Fine-Tuning vs. Prompting vs. RAG

Before spending money on fine-tuning, ask yourself whether it’s actually the right tool:

ApproachBest ForCostLatencyEffort
PromptingFormat/style changes, general tasksLowBaselineLow
RAGKnowledge injection, up-to-date factsMediumSlightly higherMedium
Fine-tuningStyle, behavior, format consistencyHigh (upfront)Can be lowerHigh
Fine-tuning + RAGComplex domain appsHighestVariableHighest

Fine-tuning wins when:

Fine-tuning loses when:


Supervised Fine-Tuning (SFT)

The simplest form: show the model examples of input → desired output, and train it to reproduce those outputs.

Example training pair:
{
"instruction": "Summarize this legal contract in plain English.",
"input": "WHEREAS Party A agrees to... [1200 words of legalese]",
"output": "This contract is a service agreement between Company X and
Vendor Y for software development services running from
Jan 2025 to Dec 2025, at a total cost of $240,000..."
}

Training on a few thousand such examples, with a small learning rate (so you don’t destroy the base model’s capabilities), will make the model reliably follow your desired output format and style.

Dataset size guidelines:


Parameter-Efficient Fine-Tuning: LoRA

Full fine-tuning updates all model parameters — for a 70B model, that’s 280GB of gradients and optimizer states. For most teams, that’s impractical.

LoRA (Low-Rank Adaptation) is the solution. Instead of updating the full weight matrix W, you learn two small matrices A and B such that the update is W + A×B, where the rank r of A×B is much smaller than the original weight dimensions.

Original weight: W (d × d) e.g., 4096 × 4096 = 16.7M parameters
LoRA decomposition:
A (d × r) × B (r × d) e.g., 4096 × 16 + 16 × 4096 = 131K parameters
Update: W' = W + (A × B) Only A and B are trained
↑ frozen ↑ trained

With rank r=16, LoRA reduces trainable parameters by ~100× compared to full fine-tuning, while matching or approaching full fine-tuning quality on most tasks.

Common LoRA targets: Q and V projection matrices in attention layers (original paper recommendation), though training all attention projections and FFN layers typically works better.


QLoRA: Fine-Tuning Large Models on Consumer Hardware

QLoRA combines quantization with LoRA to make fine-tuning 65B+ models feasible on a single A100 (or even a 4090).

QLoRA approach:
1. Load the base model in 4-bit NF4 quantization (reduces 70B from ~140GB to ~35GB)
2. Keep frozen base model weights in 4-bit
3. Add LoRA adapters in float16
4. Train only the LoRA adapters (float16)
5. At inference, dequantize 4-bit → float16 on the fly

This made fine-tuning frontier-scale models accessible to researchers and companies without massive GPU clusters. As of 2025, it remains one of the most important practical techniques in the field.


RLHF: Alignment via Human Feedback

Reinforcement Learning from Human Feedback taught models to be helpful, harmless, and honest. The process:

Step 1: Collect preference data
Human rater sees two outputs → selects preferred one
Thousands of such comparisons
Step 2: Train reward model
RM takes (prompt, response) → score
Trained to predict human preferences
Step 3: RL fine-tuning
Policy LLM generates responses
RM scores them
PPO/GRPO updates policy to maximize reward
KL penalty keeps policy close to SFT model (prevents reward hacking)

RLHF is why ChatGPT and Claude feel so different from raw GPT-3 base models. The instruction-following, the helpful tone, the appropriate refusals — all of that comes from RLHF.

Cost: RLHF requires human raters at scale, making it expensive. A well-annotated RLHF dataset from professional raters costs hundreds of thousands of dollars.


DPO: Simpler Alignment Without RL

Direct Preference Optimization (2023) simplified RLHF dramatically. Instead of training a separate reward model and doing RL, DPO directly optimizes the LLM on preference data using a classification-style loss.

DPO objective (simplified):
For each (prompt, chosen_response, rejected_response):
Maximize: log P_model(chosen) / P_ref(chosen)
Minimize: log P_model(rejected) / P_ref(rejected)

Results are comparable to RLHF on most benchmarks, with dramatically simpler implementation and more stable training. Widely adopted in open-source fine-tuning (Axolotl, TRL, LLaMA-Factory all support it).

GRPO (Group Relative Policy Optimization, used in DeepSeek-R1) is a variant that eliminates the reference model entirely, further simplifying the pipeline while achieving strong reasoning capabilities.


Practical Fine-Tuning Stack (2026)

For most teams, the practical stack looks like:

Model: LLaMA 3.1 8B or Mistral 7B v0.3 (good balance of size and capability)
Method: QLoRA with r=64, alpha=128
Framework: Unsloth (2x faster than standard training) or Axolotl
Hardware: 1–4× A100 80GB or H100
Data: 500–10K high-quality instruction pairs
Time: 2–12 hours depending on dataset size

Tools worth knowing:


Common Fine-Tuning Mistakes

Catastrophic forgetting: Training too aggressively on a narrow dataset can degrade general capabilities. Keep LoRA rank moderate and learning rate low (1e-4 to 3e-4).

Data leakage: Validation set contaminated with training examples gives falsely optimistic metrics. Use held-out test sets from a different time period or source.

Not evaluating on real tasks: Fine-tuning metrics (training loss, SFT eval loss) don’t always correlate with downstream task performance. Always evaluate on actual use cases.

Over-training: More epochs isn’t always better. Monitor validation loss carefully — stop when it plateaus or begins to rise.

Ignoring prompt format: The model needs to see the same prompt format during fine-tuning as it will during inference. Inconsistent formatting causes confusing behavior.