Neural Networks

Neural networks are the mathematical backbone of every major AI system you’ve heard of — GPT, Stable Diffusion, AlphaFold, Sora. Understanding them isn’t about memorizing formulas. It’s about building a mental model of how a system can learn from examples and generalize to new situations.

The Biological Inspiration (and Why It Only Partially Applies)

The name comes from neurons in the brain. Biological neurons receive signals through dendrites, process them in the cell body, and fire an output signal through the axon if the accumulated input exceeds a threshold.

Artificial neurons borrow this metaphor:

Inputs arrive with weights (how much each input matters)
They’re summed up and passed through an activation function
The output is either sent to the next layer or used as a prediction

Input 1 ──(w₁)──┐
Input 2 ──(w₂)──┤── Σ(inputs × weights) + bias → Activation → Output
Input 3 ──(w₃)──┘

That’s it. One artificial neuron. The magic comes from connecting thousands of them.

From One Neuron to a Network

A single neuron can only model linear relationships. Stack layers of neurons, and you can model arbitrarily complex functions.

Input Layer        Hidden Layer 1    Hidden Layer 2    Output Layer
    x₁ ────────────── h₁₁ ──────────── h₂₁ ──────────── ŷ₁
    x₂ ─────────────╱ h₁₂ ─────────╱── h₂₂ ────────────╱
    x₃ ────────────── h₁₃ ──────────── h₂₃

Each arrow is a weight. The network has millions of these weights (parameters). Training is the process of finding the best values for all of them.

Why Multiple Layers?

Each layer learns a different level of abstraction. For a face recognition task:

Layer 1: Detects edges and gradients
Layer 3: Combines edges into eyes, noses, mouths
Layer 6: Recognizes complete facial structures

This hierarchical representation is what makes deep networks so powerful — they don’t need manually engineered features.

Activation Functions: Adding Non-Linearity

Without activation functions, stacking layers would be pointless — the whole network would collapse to a single linear transformation. Activations introduce non-linearity, which is what allows networks to learn complex patterns.

Function	Formula	Use Case
ReLU	max(0, x)	Default for hidden layers — fast, simple
GELU	Smooth ReLU	Transformers (GPT, BERT)
Sigmoid	1/(1+e⁻ˣ)	Binary classification outputs
Softmax	eˣⁱ / Σeˣʲ	Multi-class probability outputs
SiLU/Swish	x × sigmoid(x)	LLaMA, Mistral hidden layers

ReLU is by far the most common in practice for hidden layers. It simply says: pass positive values through, kill negative ones. Surprisingly effective.

ReLU:          GELU (smoother):
    │                │
  2 │    /         2 │    /
  1 │   /          1 │  ./
  0 │──/           0 │.─/
 -1 │/            -1 │/
    └──────           └──────
     -2 -1 0 1 2      -2 -1 0 1 2

How a Network Learns: Backpropagation

Training a neural network boils down to one loop:

Forward pass — Feed input through the network, get a prediction
Compute loss — Measure how wrong the prediction was (e.g., cross-entropy for classification, MSE for regression)
Backward pass — Use calculus (chain rule) to compute how much each weight contributed to the error
Update weights — Nudge every weight slightly in the direction that reduces the loss

Forward pass:  x → [layers] → ŷ → Loss(ŷ, y)
Backward pass: ∂Loss/∂weights ← chain rule all the way back
Update:        w = w - lr × ∂Loss/∂w

The learning rate controls step size. Too large and you overshoot; too small and training takes forever. Modern optimizers like Adam adapt the learning rate per parameter, which is why they converge faster than vanilla gradient descent.

Key Architectural Components

Fully Connected Layers (Dense)

Every neuron connects to every neuron in the next layer. High expressiveness, but expensive and prone to overfitting on structured data like images.

Convolutional Layers (CNNs)

Share weights across spatial positions — a kernel slides over the input. Dramatically fewer parameters, excellent for images and audio spectrograms.

Image patch → Convolution kernel → Feature map
[3×3 kernel scans the entire image, detecting the same feature everywhere]

Recurrent Layers (RNNs / LSTMs)

Process sequences step by step, maintaining a hidden state. Superseded by transformers for most NLP tasks but still used in time-series applications.

Transformer Blocks (Modern Default)

We’ll cover these in depth in the next article. Built from attention + feedforward layers, they’re the architecture behind every major LLM and most modern vision models too.

Common Challenges

Overfitting

The model memorizes training data instead of learning general patterns. Signs: training loss goes down, validation loss goes up.

Remedies: Dropout (randomly zero out neurons during training), weight decay (penalize large weights), more data, early stopping.

Vanishing / Exploding Gradients

In very deep networks, gradients can shrink to near-zero (vanishing) or grow uncontrollably (exploding) during backpropagation. This stalls training.

Remedies: Batch normalization, residual connections (skip connections), careful weight initialization (Xavier, He).

Dead ReLU Neurons

Neurons that permanently output zero because their inputs are always negative. They never recover because gradient through ReLU is zero when input < 0.

Remedy: Use Leaky ReLU or GELU instead, or monitor activation statistics during training.

The Modern Picture: Scale Changes Everything

Something surprising happens when you make neural networks very large and train them on very large datasets: they start exhibiting abilities that small networks never show. This is called emergent behavior.

A 7 billion parameter model can write decent code. A 70 billion parameter model can reason about multi-step problems. A 700 billion parameter model can pass professional exams.

Nobody fully understands why scale produces these capabilities. But it’s the empirical basis for the entire LLM industry — keep scaling compute and data, and new capabilities keep appearing.

Parameters (log scale)
  1B ──── Decent chatbot
  7B ──── Reasoning, code, multilingual
 70B ──── Near-human on many benchmarks
700B ──── GPT-4 / Claude 3 Opus territory

What This Means for Practitioners

You don’t need to implement a neural network from scratch to work with them effectively. But understanding:

Layers as feature extractors helps you debug model behavior
Loss functions help you frame your problem correctly
Overfitting / underfitting guides your data collection and augmentation strategy
Architecture choice (CNN vs. Transformer vs. RNN) saves you from spending weeks on the wrong approach

The next step: understanding the specific architecture that changed everything — the Transformer.