Reinforcement Learning
Unlike supervised learning (learn from labeled data) or unsupervised learning (find patterns), reinforcement learning is about learning through doing. An agent takes actions in an environment, receives rewards or penalties, and gradually learns which actions lead to better outcomes.
It’s how AlphaGo learned to beat world champions. How robots learn to walk. And increasingly, how LLMs are aligned to be helpful.
Core Components
┌──────────────────┐ State ──▶│ │──▶ Action │ AGENT │ Reward ──│ │ └──────────────────┘ │ │ takes action ▼ ┌──────────────────┐ │ ENVIRONMENT │──▶ New State │ │──▶ Reward └──────────────────┘- State (s): The agent’s current observation of the environment
- Action (a): What the agent does (move, buy, accept, rotate)
- Reward (r): Scalar feedback (positive = good, negative = bad)
- Policy (π): The mapping from states to actions that the agent learns
- Value function (V): Expected cumulative future reward from a state
The goal: learn a policy that maximizes cumulative reward over time.
The Exploration-Exploitation Dilemma
Every RL agent faces a fundamental tension:
Exploration: Try new actions to discover if they’re better
Exploitation: Stick with what you know works
Go too heavy on exploitation and you get stuck in local optima. Go too heavy on exploration and you never leverage what you’ve learned.
ε-greedy strategy: With probability ε, take a random action; otherwise take the best known action. Decay ε over time as the agent becomes more confident.
Key Algorithms
Q-Learning (Value-Based)
Learn the value of (state, action) pairs. The Q-table stores: “if I’m in state s and take action a, what’s the expected total reward?”
import numpy as np
# Simple Q-learning on a grid worldn_states, n_actions = 16, 4Q = np.zeros((n_states, n_actions))alpha, gamma, epsilon = 0.1, 0.95, 0.1
for episode in range(10000): state = env.reset() done = False while not done: # ε-greedy action selection if np.random.random() < epsilon: action = env.action_space.sample() # explore else: action = np.argmax(Q[state]) # exploit
next_state, reward, done, _ = env.step(action)
# Q-update (Bellman equation) Q[state, action] += alpha * ( reward + gamma * np.max(Q[next_state]) - Q[state, action] ) state = next_stateDeep Q-Network (DQN): Replace the Q-table with a neural network when the state space is too large (e.g., raw pixel input from Atari games). DeepMind used DQN to achieve human-level performance on 49 Atari games in 2015.
Policy Gradient Methods
Instead of learning action values, directly learn the policy — a probability distribution over actions given a state.
REINFORCE: Increase the probability of actions that led to high rewards, decrease those that led to low rewards.
PPO (Proximal Policy Optimization): The most widely used policy gradient algorithm in production. Stable, sample-efficient, and robust. Used by OpenAI for training ChatGPT (via RLHF) and for robotics.
Actor-Critic Methods
Combine value-based and policy gradient: an “actor” selects actions, a “critic” evaluates them. Reduces variance compared to pure policy gradient.
RL in Practice: Environments
import gymnasium as gym
# Classic RL environmentsenv = gym.make("CartPole-v1") # Balance a pole on a cartenv = gym.make("LunarLander-v2") # Land a spacecraftenv = gym.make("Humanoid-v4") # Train a walking humanoid
# Training loop structureobs, _ = env.reset()for step in range(1000): action = agent.select_action(obs) obs, reward, terminated, truncated, info = env.step(action) agent.update(obs, reward, terminated) if terminated or truncated: obs, _ = env.reset()RLHF: Reinforcement Learning from Human Feedback
The most impactful application of RL in 2024–2026 is aligning large language models. RLHF is how ChatGPT, Claude, and Gemini went from raw text predictors to helpful assistants:
1. SFT phase: Fine-tune LLM on human demonstrations2. Reward model: Train a model to predict human preferences (human raters score pairs of outputs)3. PPO phase: Use RL to fine-tune the LLM to maximize reward model score, while staying close to SFT modelDPO (Direct Preference Optimization) has largely replaced PPO for LLM alignment due to its simplicity, but the RL framing remains foundational.
When to Use RL
RL is the right tool when:
- There’s no labeled dataset (no one has catalogued “right” actions)
- The task involves sequential decisions
- You can define a reward function
- You can simulate or interact with the environment cheaply
It’s usually the wrong tool when:
- You have labeled training data (use supervised learning)
- Simulation is expensive or impossible
- The reward signal is sparse or delayed (hard to train)
- You need interpretable decisions
Real-world RL deployments are rarer than hype suggests — simulation-to-reality gaps, reward hacking, and sample inefficiency are genuine challenges. But for game AI, robotics, recommendation systems, and LLM alignment, RL delivers results that no other approach can match.