Reinforcement Learning Basics: Agents, Rewards, and Why It’s Different

Neither supervised nor unsupervised learning fits how a game-playing AI, a robot, or a recommendation system that adapts to sequential user interactions actually needs to learn — there’s no fixed dataset of “correct answers” to train on, because the right action depends on a sequence of decisions and their consequences over time. Reinforcement learning (RL) is the branch of machine learning built for exactly this kind of problem: learning through trial, error, and delayed feedback.

The Core Framework: Agent, Environment, Actions, Rewards

Reinforcement learning is structured around an agent taking actions in an environment, receiving a reward signal, and observing a new state.

        ┌──────────┐   action    ┌─────────────┐
        │  Agent   │────────────▶│ Environment │
        │          │◀────────────│             │
        └──────────┘  reward +   └─────────────┘
                       new state

state = env.reset()
for step in range(max_steps):
    action = agent.choose_action(state)
    next_state, reward, done = env.step(action)
    agent.learn(state, action, reward, next_state)
    state = next_state
    if done:
        break

This loop — act, observe reward, update, repeat — is fundamentally different from supervised learning’s “here’s the input, here’s the correct output” structure. There is no fixed label for “the correct action in this state”; the agent has to discover good actions purely from the reward signals it receives, often long after the action that actually caused a given outcome.

Why This Is Harder Than Supervised Learning

Delayed reward (the credit assignment problem). A chess move made early in a game might only pay off (or fail to) many moves later — figuring out which of many past actions actually deserves credit for an eventual win or loss is a genuinely hard problem that supervised learning never has to solve, since every supervised example comes with its answer attached immediately.

Exploration vs. exploitation. An agent has to balance trying new, potentially better actions (exploration) against repeating actions already known to work reasonably well (exploitation) — too much exploration wastes time on bad actions, too little means the agent never discovers better strategies.

import random

def epsilon_greedy_action(state, q_values, epsilon=0.1):
    if random.random() < epsilon:
        return random.choice(possible_actions)   # explore
    else:
        return max(q_values[state], key=q_values[state].get)  # exploit best known action

Policies and Value Functions: What the Agent Is Actually Learning

A policy is the agent’s strategy — a mapping from states to actions. A value function estimates how good a given state (or state-action pair) is, in terms of expected future reward.

# A simple Q-value table: expected future reward for each (state, action) pair
q_table = {
    "state_1": {"left": 0.2, "right": 0.8, "jump": -0.1},
    "state_2": {"left": 0.5, "right": 0.3, "jump": 0.9},
}

Q-learning, one of the foundational RL algorithms, works by iteratively updating these Q-values based on observed rewards, gradually converging toward accurate estimates of which action is genuinely best in each state.

Deep Reinforcement Learning: Neural Networks as Function Approximators

For problems with enormous or continuous state spaces (raw pixel input from a video game, a robot’s continuous joint angles), a lookup table like the one above becomes infeasible — there are simply too many possible states to store individually. Deep reinforcement learning replaces the table with a neural network that approximates the value function or policy directly from raw input.

import torch.nn as nn

# A deep Q-network: approximates Q-values directly from raw pixel input
class DQN(nn.Module):
    def __init__(self, num_actions):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(4, 32, kernel_size=8, stride=4), nn.ReLU(),
            nn.Conv2d(32, 64, kernel_size=4, stride=2), nn.ReLU(),
        )
        self.fc = nn.Linear(64 * 9 * 9, num_actions)

    def forward(self, x):
        x = self.conv(x)
        x = x.view(x.size(0), -1)
        return self.fc(x)

This combination — deep networks (specifically CNNs, covered in Convolutional Neural Networks) approximating value functions from raw pixels — is exactly the approach that famously achieved superhuman performance on Atari games, and the same underlying pattern extends to robotics control and game-playing systems at much larger scale.

Where Reinforcement Learning Connects to Modern LLMs

Reinforcement Learning from Human Feedback (RLHF) is a critical part of how modern large language models are refined after initial pretraining — human preferences between different model outputs are used as a reward signal, and the model’s policy (its text generation behavior) is adjusted via RL to produce outputs humans rate more highly. This connects directly to the fine-tuning stage covered in Large Language Models — RL isn’t limited to games and robotics, it’s a core technique in how today’s most widely used language models are actually trained beyond their initial pretraining.

The Discount Factor: Weighing Immediate vs. Future Rewards

A detail central to how RL agents actually evaluate long-term outcomes: future rewards are typically “discounted” by a factor (commonly denoted gamma, between 0 and 1) that reduces how much a distant future reward counts compared to an immediate one — reflecting both the genuine uncertainty of far-future outcomes and a practical need to keep the math well-behaved over long or infinite time horizons.

def discounted_return(rewards, gamma=0.99):
    total = 0
    for t, reward in enumerate(rewards):
        total += (gamma ** t) * reward
    return total

A gamma close to 1 means the agent cares almost as much about distant future rewards as immediate ones (useful for tasks requiring long-term planning); a gamma closer to 0 makes the agent much more short-sighted, prioritizing immediate reward heavily. Choosing this value is itself a meaningful design decision, directly shaping what kind of behavior the trained agent ultimately learns.

Summary

Concept	Meaning
Agent / Environment	The learner and the world it acts within
Reward	The feedback signal guiding what “good” behavior means
Policy	The agent’s learned strategy — states to actions
Deep RL	Using neural networks to approximate policies/values at scale

Reinforcement learning solves a genuinely different problem than supervised or unsupervised learning — sequential decision-making under delayed, sparse feedback — which is exactly why it needs its own distinct framework, vocabulary, and algorithms rather than being a variant of the other two.

Written by NPBlue Engineering Team — Practitioners who writes every guide from hands-on production experience, not paraphrased documentation.

Reviewed for technical accuracy. Spot an error? Let us know.