Reinforcement Learning: Learning Through Rewards and Punishments

Understand reinforcement learning — agents, environments, rewards, Q-learning, policy gradients, and how RL powers game AI, robotics, and LLM alignment in 2026.

Reinforcement Learning

Unlike supervised learning (learn from labeled data) or unsupervised learning (find patterns), reinforcement learning is about learning through doing. An agent takes actions in an environment, receives rewards or penalties, and gradually learns which actions lead to better outcomes.

It’s how AlphaGo learned to beat world champions. How robots learn to walk. And increasingly, how LLMs are aligned to be helpful.


Core Components

┌──────────────────┐
State ──▶│ │──▶ Action
│ AGENT │
Reward ──│ │
└──────────────────┘
│ takes action
┌──────────────────┐
│ ENVIRONMENT │──▶ New State
│ │──▶ Reward
└──────────────────┘
  • State (s): The agent’s current observation of the environment
  • Action (a): What the agent does (move, buy, accept, rotate)
  • Reward (r): Scalar feedback (positive = good, negative = bad)
  • Policy (π): The mapping from states to actions that the agent learns
  • Value function (V): Expected cumulative future reward from a state

The goal: learn a policy that maximizes cumulative reward over time.


The Exploration-Exploitation Dilemma

Every RL agent faces a fundamental tension:

Exploration: Try new actions to discover if they’re better
Exploitation: Stick with what you know works

Go too heavy on exploitation and you get stuck in local optima. Go too heavy on exploration and you never leverage what you’ve learned.

ε-greedy strategy: With probability ε, take a random action; otherwise take the best known action. Decay ε over time as the agent becomes more confident.


Key Algorithms

Q-Learning (Value-Based)

Learn the value of (state, action) pairs. The Q-table stores: “if I’m in state s and take action a, what’s the expected total reward?”

import numpy as np
# Simple Q-learning on a grid world
n_states, n_actions = 16, 4
Q = np.zeros((n_states, n_actions))
alpha, gamma, epsilon = 0.1, 0.95, 0.1
for episode in range(10000):
state = env.reset()
done = False
while not done:
# ε-greedy action selection
if np.random.random() < epsilon:
action = env.action_space.sample() # explore
else:
action = np.argmax(Q[state]) # exploit
next_state, reward, done, _ = env.step(action)
# Q-update (Bellman equation)
Q[state, action] += alpha * (
reward + gamma * np.max(Q[next_state]) - Q[state, action]
)
state = next_state

Deep Q-Network (DQN): Replace the Q-table with a neural network when the state space is too large (e.g., raw pixel input from Atari games). DeepMind used DQN to achieve human-level performance on 49 Atari games in 2015.

Policy Gradient Methods

Instead of learning action values, directly learn the policy — a probability distribution over actions given a state.

REINFORCE: Increase the probability of actions that led to high rewards, decrease those that led to low rewards.

PPO (Proximal Policy Optimization): The most widely used policy gradient algorithm in production. Stable, sample-efficient, and robust. Used by OpenAI for training ChatGPT (via RLHF) and for robotics.

Actor-Critic Methods

Combine value-based and policy gradient: an “actor” selects actions, a “critic” evaluates them. Reduces variance compared to pure policy gradient.


RL in Practice: Environments

import gymnasium as gym
# Classic RL environments
env = gym.make("CartPole-v1") # Balance a pole on a cart
env = gym.make("LunarLander-v2") # Land a spacecraft
env = gym.make("Humanoid-v4") # Train a walking humanoid
# Training loop structure
obs, _ = env.reset()
for step in range(1000):
action = agent.select_action(obs)
obs, reward, terminated, truncated, info = env.step(action)
agent.update(obs, reward, terminated)
if terminated or truncated:
obs, _ = env.reset()

RLHF: Reinforcement Learning from Human Feedback

The most impactful application of RL in 2024–2026 is aligning large language models. RLHF is how ChatGPT, Claude, and Gemini went from raw text predictors to helpful assistants:

1. SFT phase: Fine-tune LLM on human demonstrations
2. Reward model: Train a model to predict human preferences
(human raters score pairs of outputs)
3. PPO phase: Use RL to fine-tune the LLM to maximize
reward model score, while staying close to SFT model

DPO (Direct Preference Optimization) has largely replaced PPO for LLM alignment due to its simplicity, but the RL framing remains foundational.


When to Use RL

RL is the right tool when:

  • There’s no labeled dataset (no one has catalogued “right” actions)
  • The task involves sequential decisions
  • You can define a reward function
  • You can simulate or interact with the environment cheaply

It’s usually the wrong tool when:

  • You have labeled training data (use supervised learning)
  • Simulation is expensive or impossible
  • The reward signal is sparse or delayed (hard to train)
  • You need interpretable decisions

Real-world RL deployments are rarer than hype suggests — simulation-to-reality gaps, reward hacking, and sample inefficiency are genuine challenges. But for game AI, robotics, recommendation systems, and LLM alignment, RL delivers results that no other approach can match.