Neural Networks
Neural networks are the mathematical backbone of every major AI system you’ve heard of — GPT, Stable Diffusion, AlphaFold, Sora. Understanding them isn’t about memorizing formulas. It’s about building a mental model of how a system can learn from examples and generalize to new situations.
The Biological Inspiration (and Why It Only Partially Applies)
The name comes from neurons in the brain. Biological neurons receive signals through dendrites, process them in the cell body, and fire an output signal through the axon if the accumulated input exceeds a threshold.
Artificial neurons borrow this metaphor:
- Inputs arrive with weights (how much each input matters)
- They’re summed up and passed through an activation function
- The output is either sent to the next layer or used as a prediction
Input 1 ──(w₁)──┐Input 2 ──(w₂)──┤── Σ(inputs × weights) + bias → Activation → OutputInput 3 ──(w₃)──┘That’s it. One artificial neuron. The magic comes from connecting thousands of them.
From One Neuron to a Network
A single neuron can only model linear relationships. Stack layers of neurons, and you can model arbitrarily complex functions.
Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer x₁ ────────────── h₁₁ ──────────── h₂₁ ──────────── ŷ₁ x₂ ─────────────╱ h₁₂ ─────────╱── h₂₂ ────────────╱ x₃ ────────────── h₁₃ ──────────── h₂₃Each arrow is a weight. The network has millions of these weights (parameters). Training is the process of finding the best values for all of them.
Why Multiple Layers?
Each layer learns a different level of abstraction. For a face recognition task:
- Layer 1: Detects edges and gradients
- Layer 3: Combines edges into eyes, noses, mouths
- Layer 6: Recognizes complete facial structures
This hierarchical representation is what makes deep networks so powerful — they don’t need manually engineered features.
Activation Functions: Adding Non-Linearity
Without activation functions, stacking layers would be pointless — the whole network would collapse to a single linear transformation. Activations introduce non-linearity, which is what allows networks to learn complex patterns.
| Function | Formula | Use Case |
|---|---|---|
| ReLU | max(0, x) | Default for hidden layers — fast, simple |
| GELU | Smooth ReLU | Transformers (GPT, BERT) |
| Sigmoid | 1/(1+e⁻ˣ) | Binary classification outputs |
| Softmax | eˣⁱ / Σeˣʲ | Multi-class probability outputs |
| SiLU/Swish | x × sigmoid(x) | LLaMA, Mistral hidden layers |
ReLU is by far the most common in practice for hidden layers. It simply says: pass positive values through, kill negative ones. Surprisingly effective.
ReLU: GELU (smoother): │ │ 2 │ / 2 │ / 1 │ / 1 │ ./ 0 │──/ 0 │.─/ -1 │/ -1 │/ └────── └────── -2 -1 0 1 2 -2 -1 0 1 2How a Network Learns: Backpropagation
Training a neural network boils down to one loop:
- Forward pass — Feed input through the network, get a prediction
- Compute loss — Measure how wrong the prediction was (e.g., cross-entropy for classification, MSE for regression)
- Backward pass — Use calculus (chain rule) to compute how much each weight contributed to the error
- Update weights — Nudge every weight slightly in the direction that reduces the loss
Forward pass: x → [layers] → ŷ → Loss(ŷ, y)Backward pass: ∂Loss/∂weights ← chain rule all the way backUpdate: w = w - lr × ∂Loss/∂wThe learning rate controls step size. Too large and you overshoot; too small and training takes forever. Modern optimizers like Adam adapt the learning rate per parameter, which is why they converge faster than vanilla gradient descent.
Key Architectural Components
Fully Connected Layers (Dense)
Every neuron connects to every neuron in the next layer. High expressiveness, but expensive and prone to overfitting on structured data like images.
Convolutional Layers (CNNs)
Share weights across spatial positions — a kernel slides over the input. Dramatically fewer parameters, excellent for images and audio spectrograms.
Image patch → Convolution kernel → Feature map[3×3 kernel scans the entire image, detecting the same feature everywhere]Recurrent Layers (RNNs / LSTMs)
Process sequences step by step, maintaining a hidden state. Superseded by transformers for most NLP tasks but still used in time-series applications.
Transformer Blocks (Modern Default)
We’ll cover these in depth in the next article. Built from attention + feedforward layers, they’re the architecture behind every major LLM and most modern vision models too.
Common Challenges
Overfitting
The model memorizes training data instead of learning general patterns. Signs: training loss goes down, validation loss goes up.
Remedies: Dropout (randomly zero out neurons during training), weight decay (penalize large weights), more data, early stopping.
Vanishing / Exploding Gradients
In very deep networks, gradients can shrink to near-zero (vanishing) or grow uncontrollably (exploding) during backpropagation. This stalls training.
Remedies: Batch normalization, residual connections (skip connections), careful weight initialization (Xavier, He).
Dead ReLU Neurons
Neurons that permanently output zero because their inputs are always negative. They never recover because gradient through ReLU is zero when input < 0.
Remedy: Use Leaky ReLU or GELU instead, or monitor activation statistics during training.
The Modern Picture: Scale Changes Everything
Something surprising happens when you make neural networks very large and train them on very large datasets: they start exhibiting abilities that small networks never show. This is called emergent behavior.
A 7 billion parameter model can write decent code. A 70 billion parameter model can reason about multi-step problems. A 700 billion parameter model can pass professional exams.
Nobody fully understands why scale produces these capabilities. But it’s the empirical basis for the entire LLM industry — keep scaling compute and data, and new capabilities keep appearing.
Parameters (log scale) 1B ──── Decent chatbot 7B ──── Reasoning, code, multilingual 70B ──── Near-human on many benchmarks700B ──── GPT-4 / Claude 3 Opus territoryWhat This Means for Practitioners
You don’t need to implement a neural network from scratch to work with them effectively. But understanding:
- Layers as feature extractors helps you debug model behavior
- Loss functions help you frame your problem correctly
- Overfitting / underfitting guides your data collection and augmentation strategy
- Architecture choice (CNN vs. Transformer vs. RNN) saves you from spending weeks on the wrong approach
The next step: understanding the specific architecture that changed everything — the Transformer.