AI  /  Generative AI

Generative AI 26 guides · updated 2026

From transformer foundations to production RAG, tool-using agents, and the Model Context Protocol — the GenAI stack as it's actually being built in 2026.

Large Language Models (LLMs)

You’ve used them. You’ve maybe built products on them. But do you actually understand what’s happening when you send a message to Claude or GPT-4? This guide gives you a real understanding of how LLMs work — not the hype, not the dismissals, just the mechanics.


What Makes a Language Model “Large”?

The word “large” is relative, but in practice it refers to models with billions (sometimes hundreds of billions) of parameters trained on trillions of tokens of text.

ModelParametersTraining TokensContext Window
GPT-2 (2019)1.5B~40B1,024
GPT-3 (2020)175B~300B4,096
LLaMA 3 70B (2024)70B15T128K
GPT-4 (2023)~1T (est.)>10T128K
Claude 3.5 Sonnet (2024)UnknownUnknown200K
Gemini 1.5 Pro (2024)UnknownUnknown1M+

The jump from GPT-2 to GPT-3 wasn’t just 100× more parameters — it was the emergence of capabilities that smaller models simply didn’t have. This is the scaling hypothesis that drove the entire LLM industry.


Under the Hood: What an LLM Actually Does

An LLM is, at its core, a next-token predictor. Given a sequence of tokens, it outputs a probability distribution over all possible next tokens.

Input: "The best way to learn programming is to"
Model: [Computes probability over 50,000+ vocabulary tokens]
Top-5: "practice" (0.31), "build" (0.22), "write" (0.18), "just" (0.09), "actually" (0.07)
Output: "practice" ← sampled based on temperature setting

That’s it. Repeat this thousands of times and you get paragraphs, essays, code, or conversations.

The model isn’t “thinking” in any philosophical sense — it’s performing an extraordinarily sophisticated pattern completion operation. But at enough scale, that pattern completion produces outputs that are genuinely useful, creative, and sometimes surprising.


The Training Process (Condensed)

LLM training happens in stages:

Stage 1: Pre-training

The model reads a massive corpus — Common Crawl, GitHub, Wikipedia, books, scientific papers — and learns to predict the next token. No labels needed. This is where the model learns grammar, facts, code syntax, reasoning patterns, and world knowledge.

Text: "The Eiffel Tower is located in ___"
Model predicts: "Paris" (from seeing this pattern millions of times)

Pre-training takes weeks to months on thousands of GPUs and costs tens of millions of dollars for frontier models.

Stage 2: Supervised Fine-Tuning (SFT)

Human-written demonstrations of good assistant behavior are used to fine-tune the pre-trained model. This teaches it to follow instructions, structure responses, and behave like a helpful assistant rather than just completing text.

Stage 3: Reinforcement Learning from Human Feedback (RLHF)

Human raters compare pairs of model outputs and rate which is better. These preferences train a reward model. The LLM is then optimized using RL to generate outputs the reward model scores highly.

Output A: "I can help you with that! Here's a Python function..."
Output B: "Sure! def calculate(x): ..."
Human: A is better (more structured, explains context)
Reward model: learns this preference
LLM: fine-tuned to generate A-style responses

Stage 4 (2025+): Direct Preference Optimization (DPO)

Increasingly replacing RLHF for alignment. Directly optimizes the model on preference data without training a separate reward model. Simpler, more stable, widely adopted in open-source models.


Emergent Capabilities

Something strange happens at scale: models develop capabilities that weren’t explicitly trained. These are called emergent abilities.

Nobody fully understands why emergence happens. The leading hypothesis is that at some scale threshold, the model builds sufficiently rich internal representations to support these higher-order operations.


What LLMs Are Good At (and Bad At)

Strong Areas

Known Weaknesses


The 2025–2026 Frontier

The frontier is moving fast. Key developments:

Reasoning Models: OpenAI o3, Gemini 2.0 Flash Thinking, and Claude 3.7 Sonnet all use extended “thinking” before answering — generating internal chain-of-thought tokens not shown to the user. Dramatically better at math, science, and multi-step logic.

Multimodal LLMs: GPT-4o, Claude 3.5, and Gemini 1.5 all natively understand images, PDFs, and in some cases audio. The distinction between “language model” and “foundation model” is blurring.

Small but Mighty: Phi-4 (14B), Gemma 3 (9B), and LLaMA 3.2 (3B) are showing that instruction-tuned small models on high-quality data can match earlier large models on many tasks.

Open Source Catching Up: As of 2025, open-source models (LLaMA 3.1 405B, Qwen 2.5 72B, Mistral Large) are competitive with or better than GPT-3.5 on most benchmarks, and approaching GPT-4 on specialized tasks.


Picking the Right LLM for Your Use Case

Need low latency + low cost? → Gemini Flash, Claude Haiku, LLaMA 3.1 8B
Need best reasoning? → o3, Claude 3.7 Sonnet, Gemini 2.0 Pro
Need very long context? → Gemini 1.5 Pro (1M), Claude 3.5 (200K)
Need to run locally? → LLaMA 3.2, Mistral 7B, Phi-4
Need best code generation? → GPT-4o, Claude 3.5 Sonnet, DeepSeek-Coder-V2
Need multilingual? → Qwen 2.5, Aya (Cohere), mT5

No single model is best at everything. The right choice depends on your latency budget, cost per token, privacy requirements, and task type.