Large Language Models (LLMs)

You’ve used them. You’ve maybe built products on them. But do you actually understand what’s happening when you send a message to Claude or GPT-4? This guide gives you a real understanding of how LLMs work — not the hype, not the dismissals, just the mechanics.

What Makes a Language Model “Large”?

The word “large” is relative, but in practice it refers to models with billions (sometimes hundreds of billions) of parameters trained on trillions of tokens of text.

Model	Parameters	Training Tokens	Context Window
GPT-2 (2019)	1.5B	~40B	1,024
GPT-3 (2020)	175B	~300B	4,096
LLaMA 3 70B (2024)	70B	15T	128K
GPT-4 (2023)	~1T (est.)	>10T	128K
Claude 3.5 Sonnet (2024)	Unknown	Unknown	200K
Gemini 1.5 Pro (2024)	Unknown	Unknown	1M+

The jump from GPT-2 to GPT-3 wasn’t just 100× more parameters — it was the emergence of capabilities that smaller models simply didn’t have. This is the scaling hypothesis that drove the entire LLM industry.

Under the Hood: What an LLM Actually Does

An LLM is, at its core, a next-token predictor. Given a sequence of tokens, it outputs a probability distribution over all possible next tokens.

Input:  "The best way to learn programming is to"
Model:  [Computes probability over 50,000+ vocabulary tokens]
Top-5:  "practice" (0.31), "build" (0.22), "write" (0.18), "just" (0.09), "actually" (0.07)
Output: "practice" ← sampled based on temperature setting

That’s it. Repeat this thousands of times and you get paragraphs, essays, code, or conversations.

The model isn’t “thinking” in any philosophical sense — it’s performing an extraordinarily sophisticated pattern completion operation. But at enough scale, that pattern completion produces outputs that are genuinely useful, creative, and sometimes surprising.

The Training Process (Condensed)

LLM training happens in stages:

Stage 1: Pre-training

The model reads a massive corpus — Common Crawl, GitHub, Wikipedia, books, scientific papers — and learns to predict the next token. No labels needed. This is where the model learns grammar, facts, code syntax, reasoning patterns, and world knowledge.

Text: "The Eiffel Tower is located in ___"
Model predicts: "Paris" (from seeing this pattern millions of times)

Pre-training takes weeks to months on thousands of GPUs and costs tens of millions of dollars for frontier models.

Stage 2: Supervised Fine-Tuning (SFT)

Human-written demonstrations of good assistant behavior are used to fine-tune the pre-trained model. This teaches it to follow instructions, structure responses, and behave like a helpful assistant rather than just completing text.

Stage 3: Reinforcement Learning from Human Feedback (RLHF)

Human raters compare pairs of model outputs and rate which is better. These preferences train a reward model. The LLM is then optimized using RL to generate outputs the reward model scores highly.

Output A: "I can help you with that! Here's a Python function..."
Output B: "Sure! def calculate(x): ..."
Human: A is better (more structured, explains context)
Reward model: learns this preference
LLM: fine-tuned to generate A-style responses

Stage 4 (2025+): Direct Preference Optimization (DPO)

Increasingly replacing RLHF for alignment. Directly optimizes the model on preference data without training a separate reward model. Simpler, more stable, widely adopted in open-source models.

Emergent Capabilities

Something strange happens at scale: models develop capabilities that weren’t explicitly trained. These are called emergent abilities.

Chain-of-thought reasoning: Large models (>50B params) can reason step-by-step when prompted, while small models can’t do this reliably
In-context learning: The ability to learn a new task from a few examples provided in the prompt — without any weight updates
Instruction following: Understanding nuanced, multi-part instructions
Code generation: Writing syntactically correct code in dozens of languages

Nobody fully understands why emergence happens. The leading hypothesis is that at some scale threshold, the model builds sufficiently rich internal representations to support these higher-order operations.

What LLMs Are Good At (and Bad At)

Strong Areas

Summarization and information extraction
Code generation and explanation
Translation across 100+ languages
Creative writing and ideation
Structured data extraction from text
Question answering over provided context

Known Weaknesses

Hallucination: Confidently generating false information. Especially problematic for specific numbers, dates, citations, and medical/legal facts.
Arithmetic: While reasoning models (o3, Gemini Thinking) are better, general LLMs struggle with multi-step numerical computation. Use a calculator tool.
Knowledge cutoff: No awareness of events after training data cutoff.
Sycophancy: Tendency to agree with the user’s framing even when wrong.
Long-context retrieval: Finding a specific fact in a 100K-token context is harder than it sounds (the “lost in the middle” problem).

The 2025–2026 Frontier

The frontier is moving fast. Key developments:

Reasoning Models: OpenAI o3, Gemini 2.0 Flash Thinking, and Claude 3.7 Sonnet all use extended “thinking” before answering — generating internal chain-of-thought tokens not shown to the user. Dramatically better at math, science, and multi-step logic.

Multimodal LLMs: GPT-4o, Claude 3.5, and Gemini 1.5 all natively understand images, PDFs, and in some cases audio. The distinction between “language model” and “foundation model” is blurring.

Small but Mighty: Phi-4 (14B), Gemma 3 (9B), and LLaMA 3.2 (3B) are showing that instruction-tuned small models on high-quality data can match earlier large models on many tasks.

Open Source Catching Up: As of 2025, open-source models (LLaMA 3.1 405B, Qwen 2.5 72B, Mistral Large) are competitive with or better than GPT-3.5 on most benchmarks, and approaching GPT-4 on specialized tasks.

Picking the Right LLM for Your Use Case

Need low latency + low cost?      → Gemini Flash, Claude Haiku, LLaMA 3.1 8B
Need best reasoning?              → o3, Claude 3.7 Sonnet, Gemini 2.0 Pro
Need very long context?           → Gemini 1.5 Pro (1M), Claude 3.5 (200K)
Need to run locally?              → LLaMA 3.2, Mistral 7B, Phi-4
Need best code generation?        → GPT-4o, Claude 3.5 Sonnet, DeepSeek-Coder-V2
Need multilingual?                → Qwen 2.5, Aya (Cohere), mT5

No single model is best at everything. The right choice depends on your latency budget, cost per token, privacy requirements, and task type.