Pre-Training
Pre-training is where a language model gets its capabilities. Before it knows how to follow instructions, write code, or reason through problems, a model first undergoes pre-training — learning the structure of language, facts about the world, and patterns across every domain imaginable, all from raw text.
The Core Objective: Next Token Prediction
The pre-training task is almost embarrassingly simple. Given a sequence of tokens, predict the next one.
Input: "The capital of France is"Target: "Paris"
Input: "def fibonacci(n):\n if n <= 1:"Target: "return"
Input: "The Pythagorean theorem states that a² + b² ="Target: "c²"There are no human labels. The training signal comes from the data itself — every document is both the input and the target, just shifted by one position. This is self-supervised learning at scale.
The genius of this objective is that to predict the next token well across all domains of human knowledge, the model must learn:
- Grammar and syntax
- Semantic relationships
- Factual knowledge
- Reasoning patterns
- Code execution behavior
- Mathematical structure
It must learn all of this to minimize prediction loss. Nothing is explicitly taught.
Training Data: The Foundation of Foundation Models
The quality and composition of training data arguably matters more than architecture choices for the resulting model’s capabilities.
What Goes In?
Common Crawl: A monthly snapshot of the web — petabytes of HTML. Raw quality is low (ads, spam, duplicate content), so it requires heavy filtering. Still the largest data source for most models.
High-quality web: Curated subsets of CommonCrawl filtered by quality signals — outbound links from Wikipedia, educational domains, technical blogs, programming forums.
Books: Project Gutenberg, Books3, ThePile’s book corpus. Rich vocabulary, coherent long-form reasoning.
Code: GitHub (after filtering out forks, low-quality repos, auto-generated code). Models trained with more code are better at structured reasoning even on non-coding tasks.
Wikipedia & Wikidata: High-quality factual knowledge in 300+ languages.
Scientific papers: ArXiv, PubMed, Semantic Scholar. Crucial for STEM capabilities.
Curated conversational data: Dialogues, Q&A forums (StackExchange, Reddit AMA), instruction-response pairs.
Data Mixing Ratios
The proportion of each data source in the training mix significantly affects the resulting model’s strengths.
Example data mix (LLaMA 3 approximate):Web text (filtered Common Crawl): ~75%Code (GitHub): ~8%Books: ~5%Scientific papers: ~4%Wikipedia / Wikidata: ~3%Other curated: ~5%Models with more code in their training mix tend to be better reasoners. Models with more books produce smoother prose. Getting this ratio right is a key part of model development.
Data Quality Filtering
Raw web data is noisy. Standard filtering steps:
- Deduplication at URL, paragraph, and near-duplicate level (MinHash LSH)
- Language identification (langdetect / fastText)
- Quality scoring based on perplexity against a reference model
- Toxicity filtering using classifiers
- Personal information redaction (email addresses, phone numbers)
The Training Setup: Distributed Training at Scale
No single GPU can hold a frontier-scale model. Training requires distributing across hundreds or thousands of GPUs using a combination of strategies.
┌──────────────────────────────────────────────────────┐│ Distributed Training Strategies │├──────────────────────────────────────────────────────┤│ Data Parallelism Each GPU gets different batches ││ of data; gradients are averaged │├──────────────────────────────────────────────────────┤│ Model Parallelism Model layers split across GPUs ││ (Tensor Parallel) Matrices split horizontally │├──────────────────────────────────────────────────────┤│ Pipeline Parallelism Model stages split across GPUs ││ with micro-batch pipelining │├──────────────────────────────────────────────────────┤│ Sequence Parallelism Long sequences split across ││ GPUs (for attention) │└──────────────────────────────────────────────────────┘State-of-the-art training runs combine all four. Meta’s LLaMA 3 405B used 16,000 H100 GPUs for several months. Google’s Gemini Ultra reportedly used tens of thousands of TPUs.
Mixed Precision Training
Training in float16 or bfloat16 instead of float32 halves memory usage and doubles throughput, with negligible accuracy loss when done correctly (gradient scaling keeps the loss landscape stable).
Most modern training uses bfloat16 (Brain Float 16) — it has the same exponent range as float32, making it more numerically stable than float16 for training.
Key Hyperparameters and Their Intuitions
| Hyperparameter | Typical Range | Effect |
|---|---|---|
| Learning rate | 1e-4 to 3e-4 | Speed vs. stability of learning |
| Batch size | 1M–4M tokens | Larger = more stable gradients |
| Sequence length | 4K–32K tokens | Longer = better long-range learning, more memory |
| Warmup steps | 1–2K steps | Gradually increase LR to avoid early instability |
| Weight decay | 0.1 | Regularization, prevents overfitting |
| Gradient clipping | 1.0 | Prevents exploding gradients |
The learning rate schedule is particularly important. Cosine decay (starting at max LR, decaying to near-zero by end of training) is almost universally used.
The Compute Equation: Chinchilla Scaling Laws
How do you decide how many parameters to train and how many tokens to train on? DeepMind’s 2022 Chinchilla paper gave a practical answer.
For optimal performance given a compute budget C (in FLOPs):
Optimal tokens ≈ 20 × parameters(approximately)So a 7B parameter model is optimally trained on 140B tokens. LLaMA 3’s 8B model was trained on 15T tokens — far beyond the “Chinchilla optimal” for that size. This was intentional: they wanted a smaller model that performs better at inference, accepting higher training cost to get better inference efficiency.
This shift — training smaller models on more data for better inference economics — is a defining trend of 2024–2026.
What a Model Learns During Pre-Training
Pre-training is when a model acquires essentially all of its knowledge and most of its capabilities. Fine-tuning and alignment (RLHF, DPO) later shape how it expresses that knowledge, but they don’t add substantial new knowledge.
This has an important practical implication: if a model doesn’t know something after pre-training, no amount of instruction tuning will make it know it. You need either RAG (retrieve the knowledge at inference time) or to include the knowledge in pre-training data.
The pre-trained model is sometimes called the “base model” or “foundation model.” It’s a general-purpose pattern matcher over all of human knowledge in written form — extraordinarily capable, but not aligned to be helpful by default. That’s what fine-tuning is for.