Large Language Models Explained: Pretraining, Fine-Tuning, and Tokenization
Large Language Models are, underneath the impressive capabilities, transformer architectures (covered in Transformers) trained at enormous scale on text — everything covered so far in this series (gradients, backpropagation, attention, optimizers) applies directly to how they’re built and trained. What makes them distinct is the specific training pipeline — tokenization, pretraining, and fine-tuning — and understanding this pipeline is what separates using an LLM from actually understanding one.
Tokenization: Converting Text Into Numbers a Model Can Process
Neural networks operate on numbers, not raw text — tokenization is the process of splitting text into discrete units (tokens) and mapping each to an integer ID.
# Simplified conceptual tokenization (real tokenizers use subword algorithms like BPE)text = "deep learning is powerful"tokens = text.split() # ["deep", "learning", "is", "powerful"] -- word-level, simplified
vocab = {"deep": 1023, "learning": 892, "is": 15, "powerful": 4521}token_ids = [vocab[t] for t in tokens] # [1023, 892, 15, 4521]Modern LLMs use subword tokenization (Byte-Pair Encoding or similar algorithms) rather than whole-word tokenization specifically to handle rare or unseen words gracefully — an uncommon word gets split into smaller, more common subword pieces rather than being treated as one unknown, out-of-vocabulary token entirely.
# Subword tokenization handles rare words by splitting them"unbelievable" → ["un", "believ", "able"] # three known subword pieces, not one unknown wordEmbeddings: Turning Token IDs Into Meaningful Vectors
A token ID alone (like 1023) carries no meaning — an embedding layer maps each token ID to a dense, learned vector that captures semantic information, directly connecting to the categorical encoding discussion in Feature Engineering.
import torch.nn as nn
vocab_size = 50000embedding_dim = 768
embedding_layer = nn.Embedding(vocab_size, embedding_dim)token_ids = torch.tensor([1023, 892, 15, 4521])embeddings = embedding_layer(token_ids) # shape (4, 768) -- one 768-dim vector per tokenThese embeddings are learned during training, not fixed in advance — words used in similar contexts end up with similar embedding vectors, which is exactly why embedding-space arithmetic and similarity comparisons (covered in Norms and Distance Metrics) tend to reflect genuine semantic relationships.
Pretraining: Learning Language From Raw, Unlabeled Text
Pretraining is a self-supervised process — the model is trained on a simple, automatically-generatable objective (predict the next token, given everything before it) applied to enormous amounts of raw, unlabeled text, connecting directly to the unsupervised/self-supervised learning discussion in Unsupervised Learning.
# Conceptual next-token-prediction objectivetext = "The quick brown fox jumps over the lazy"# Model is trained to predict "dog" given everything before it,# using cross-entropy loss over the entire vocabulary as possible next tokens
target_next_token = "dog"predicted_distribution = model(text_so_far) # a probability distribution over the whole vocabularyloss = cross_entropy(target_next_token, predicted_distribution)This objective requires no manual labeling whatsoever — every sentence in a massive text corpus automatically provides its own training signal (predict each next word from the words before it), which is exactly why pretraining can scale to trillions of tokens without a correspondingly enormous manual labeling effort.
Fine-Tuning: Specializing a Pretrained Model
After pretraining produces a model with broad, general language understanding, fine-tuning adapts it to a specific task or behavior using a much smaller, often labeled dataset.
# Fine-tuning on a smaller, task-specific labeled datasetmodel = load_pretrained_model("base-llm")
for batch in fine_tuning_dataset: # much smaller than the pretraining corpus inputs, labels = batch predictions = model(inputs) loss = compute_loss(predictions, labels) loss.backward() optimizer.step()This two-stage approach — broad pretraining, then narrow fine-tuning — is directly analogous to transfer learning in computer vision (pretraining a CNN on a large general image dataset, then fine-tuning on a smaller specific dataset), applied at a much larger scale in the language domain.
Instruction fine-tuning further adapts a pretrained model to follow instructions and answer questions helpfully, using datasets of example instructions and desired responses. RLHF (Reinforcement Learning from Human Feedback), covered in Reinforcement Learning Basics, refines the model further using human preference judgments as a reward signal, an additional stage beyond standard supervised fine-tuning.
The Full Pipeline, End to End
1. Collect a massive raw text corpus (books, web pages, code, etc.)2. Tokenize the corpus into subword tokens3. Pretrain: next-token prediction, self-supervised, at enormous scale4. Fine-tune: instruction-following behavior on curated examples5. RLHF (optional but common): align outputs with human preferences6. Deploy: serve the model for inference, covered in Deep Learning DeploymentEach stage builds directly on architecture and training concepts covered earlier in this series — the transformer blocks, attention mechanism, AdamW optimizer (covered in Optimizers), and learning rate warmup/decay schedules (covered in Learning Rate Scheduling) are all used essentially unchanged, just applied at a scale of billions of parameters and trillions of training tokens.
Context Window: A Practical Constraint Worth Understanding
Every LLM has a maximum context window — the number of tokens it can process in a single input, directly tied to the self-attention computation covered in Transformers, where every token attends to every other token, an operation whose computational cost grows quadratically with sequence length. This is a genuinely practical constraint: a document longer than the context window simply can’t be processed in a single pass, requiring it to be chunked, summarized in pieces, or handled through techniques like retrieval-augmented generation instead. Context window size has grown substantially across model generations as researchers have found more computationally efficient variants of attention, but the fundamental tradeoff between context length and computational cost remains a real, practical consideration when choosing a model or designing a system around one.
Summary
| Stage | Purpose |
|---|---|
| Tokenization | Converts text into a sequence of discrete, numeric tokens |
| Embeddings | Maps tokens to dense, meaningful vector representations |
| Pretraining | Self-supervised learning of broad language understanding from raw text |
| Fine-tuning | Specializes the pretrained model for specific tasks or behaviors |
LLMs aren’t a fundamentally different kind of neural network requiring an entirely new theoretical framework — they’re transformers, trained via the same core mechanisms covered throughout this series, distinguished mainly by their training pipeline’s scale and the specific self-supervised objective that makes learning from raw, unlabeled text at that scale possible.