Large Language Models Explained: Pretraining, Fine-Tuning, and Tokenization

How LLMs actually work — tokenization, embeddings, pretraining on raw text, and fine-tuning — explained for practitioners, not just users.

Large Language Models Explained: Pretraining, Fine-Tuning, and Tokenization

Large Language Models are, underneath the impressive capabilities, transformer architectures (covered in Transformers) trained at enormous scale on text — everything covered so far in this series (gradients, backpropagation, attention, optimizers) applies directly to how they’re built and trained. What makes them distinct is the specific training pipeline — tokenization, pretraining, and fine-tuning — and understanding this pipeline is what separates using an LLM from actually understanding one.


Tokenization: Converting Text Into Numbers a Model Can Process

Neural networks operate on numbers, not raw text — tokenization is the process of splitting text into discrete units (tokens) and mapping each to an integer ID.

# Simplified conceptual tokenization (real tokenizers use subword algorithms like BPE)
text = "deep learning is powerful"
tokens = text.split() # ["deep", "learning", "is", "powerful"] -- word-level, simplified
vocab = {"deep": 1023, "learning": 892, "is": 15, "powerful": 4521}
token_ids = [vocab[t] for t in tokens] # [1023, 892, 15, 4521]

Modern LLMs use subword tokenization (Byte-Pair Encoding or similar algorithms) rather than whole-word tokenization specifically to handle rare or unseen words gracefully — an uncommon word gets split into smaller, more common subword pieces rather than being treated as one unknown, out-of-vocabulary token entirely.

# Subword tokenization handles rare words by splitting them
"unbelievable" → ["un", "believ", "able"] # three known subword pieces, not one unknown word

Embeddings: Turning Token IDs Into Meaningful Vectors

A token ID alone (like 1023) carries no meaning — an embedding layer maps each token ID to a dense, learned vector that captures semantic information, directly connecting to the categorical encoding discussion in Feature Engineering.

import torch.nn as nn
vocab_size = 50000
embedding_dim = 768
embedding_layer = nn.Embedding(vocab_size, embedding_dim)
token_ids = torch.tensor([1023, 892, 15, 4521])
embeddings = embedding_layer(token_ids) # shape (4, 768) -- one 768-dim vector per token

These embeddings are learned during training, not fixed in advance — words used in similar contexts end up with similar embedding vectors, which is exactly why embedding-space arithmetic and similarity comparisons (covered in Norms and Distance Metrics) tend to reflect genuine semantic relationships.


Pretraining: Learning Language From Raw, Unlabeled Text

Pretraining is a self-supervised process — the model is trained on a simple, automatically-generatable objective (predict the next token, given everything before it) applied to enormous amounts of raw, unlabeled text, connecting directly to the unsupervised/self-supervised learning discussion in Unsupervised Learning.

# Conceptual next-token-prediction objective
text = "The quick brown fox jumps over the lazy"
# Model is trained to predict "dog" given everything before it,
# using cross-entropy loss over the entire vocabulary as possible next tokens
target_next_token = "dog"
predicted_distribution = model(text_so_far) # a probability distribution over the whole vocabulary
loss = cross_entropy(target_next_token, predicted_distribution)

This objective requires no manual labeling whatsoever — every sentence in a massive text corpus automatically provides its own training signal (predict each next word from the words before it), which is exactly why pretraining can scale to trillions of tokens without a correspondingly enormous manual labeling effort.


Fine-Tuning: Specializing a Pretrained Model

After pretraining produces a model with broad, general language understanding, fine-tuning adapts it to a specific task or behavior using a much smaller, often labeled dataset.

# Fine-tuning on a smaller, task-specific labeled dataset
model = load_pretrained_model("base-llm")
for batch in fine_tuning_dataset: # much smaller than the pretraining corpus
inputs, labels = batch
predictions = model(inputs)
loss = compute_loss(predictions, labels)
loss.backward()
optimizer.step()

This two-stage approach — broad pretraining, then narrow fine-tuning — is directly analogous to transfer learning in computer vision (pretraining a CNN on a large general image dataset, then fine-tuning on a smaller specific dataset), applied at a much larger scale in the language domain.

Instruction fine-tuning further adapts a pretrained model to follow instructions and answer questions helpfully, using datasets of example instructions and desired responses. RLHF (Reinforcement Learning from Human Feedback), covered in Reinforcement Learning Basics, refines the model further using human preference judgments as a reward signal, an additional stage beyond standard supervised fine-tuning.


The Full Pipeline, End to End

1. Collect a massive raw text corpus (books, web pages, code, etc.)
2. Tokenize the corpus into subword tokens
3. Pretrain: next-token prediction, self-supervised, at enormous scale
4. Fine-tune: instruction-following behavior on curated examples
5. RLHF (optional but common): align outputs with human preferences
6. Deploy: serve the model for inference, covered in Deep Learning Deployment

Each stage builds directly on architecture and training concepts covered earlier in this series — the transformer blocks, attention mechanism, AdamW optimizer (covered in Optimizers), and learning rate warmup/decay schedules (covered in Learning Rate Scheduling) are all used essentially unchanged, just applied at a scale of billions of parameters and trillions of training tokens.

Context Window: A Practical Constraint Worth Understanding

Every LLM has a maximum context window — the number of tokens it can process in a single input, directly tied to the self-attention computation covered in Transformers, where every token attends to every other token, an operation whose computational cost grows quadratically with sequence length. This is a genuinely practical constraint: a document longer than the context window simply can’t be processed in a single pass, requiring it to be chunked, summarized in pieces, or handled through techniques like retrieval-augmented generation instead. Context window size has grown substantially across model generations as researchers have found more computationally efficient variants of attention, but the fundamental tradeoff between context length and computational cost remains a real, practical consideration when choosing a model or designing a system around one.

Summary

StagePurpose
TokenizationConverts text into a sequence of discrete, numeric tokens
EmbeddingsMaps tokens to dense, meaningful vector representations
PretrainingSelf-supervised learning of broad language understanding from raw text
Fine-tuningSpecializes the pretrained model for specific tasks or behaviors

LLMs aren’t a fundamentally different kind of neural network requiring an entirely new theoretical framework — they’re transformers, trained via the same core mechanisms covered throughout this series, distinguished mainly by their training pipeline’s scale and the specific self-supervised objective that makes learning from raw, unlabeled text at that scale possible.