Optimization Basics: Objective Functions, Cost Functions, and Convexity
Training a neural network is, mathematically, an optimization problem — search for the weights that minimize a specific number. Every architectural decision, every training trick, every hyperparameter exists in service of that single search. Understanding optimization at a conceptual level explains why training deep networks is genuinely hard, in a way that’s fundamentally different from optimizing a simple function.
Objective Functions: Defining “Good”
An objective function is the quantity an optimization process is trying to maximize or minimize — it formally defines what “good” means for a given problem. In deep learning, the objective is almost always something to minimize: prediction error.
def objective(predictions, targets): return ((predictions - targets) ** 2).mean() # mean squared error, a common objectiveThe entire training process — every gradient computed, every weight updated — exists purely to make this one number smaller. Choosing the right objective function is arguably the single most consequential decision in a machine learning project, because the model will faithfully optimize whatever objective you give it, including ones that don’t actually capture what you care about.
Cost Functions vs. Loss Functions: A Small but Useful Distinction
These terms are frequently used interchangeably, but a useful distinction: a loss function measures error for a single training example, while a cost function is the average loss across an entire batch or dataset.
def loss(prediction, target): return (prediction - target) ** 2 # single example
def cost(predictions, targets): return np.mean((predictions - targets) ** 2) # averaged across a batchIn practice, most frameworks and papers use “loss” for both, but understanding that training minimizes an averaged quantity across a batch — not a single example’s error — matters for reasoning about why batch size affects gradient noise and stability, covered in Epochs, Batch Size, and Iterations.
Convex vs. Non-Convex Optimization: Why Deep Learning Is Hard
A convex function has exactly one minimum — imagine a bowl shape. Any optimization algorithm that consistently moves downhill is guaranteed to eventually find the single global minimum, no matter where it starts.
Convex function: a single, guaranteed-findable minimum \ / \ / \ / \__________/ ▼ global minimumA non-convex function has multiple local minima, saddle points, and a much more complex “landscape” — and this is exactly what a deep neural network’s loss function looks like, because of the way weights combine nonlinearly across many layers.
Non-convex function: multiple local minima, no simple guarantee \ /\ /\ / \ / \ / \ / \ / \ / \ / \/ \ / \ / local min \/ local min local min (maybe global?)There’s no guarantee that gradient descent finds the global minimum in a non-convex landscape — it might get stuck in a local minimum, or slow down dramatically near a saddle point where gradients are near zero in most directions but not all.
Why This Matters in Practice, Not Just Theory
The non-convexity of deep learning’s loss landscape is the direct motivation behind several practices covered later in this series:
- Momentum-based optimizers (Optimizers) help the optimization process “roll through” small local dips rather than getting stuck in them.
- Multiple random weight initializations are sometimes used specifically because different starting points on a non-convex landscape can converge to meaningfully different final solutions.
- Learning rate scheduling (Learning Rate Scheduling) helps navigate a complex landscape — large steps early to escape flat regions, smaller steps later to settle precisely into a good minimum.
A Practical Intuition: The Loss Landscape Isn’t as Scary as It Sounds
Despite the theoretical difficulty of non-convex optimization, deep learning works remarkably well in practice — a well-known empirical finding is that in very high-dimensional spaces (millions of weights), most local minima found by gradient descent tend to have similar, good-enough loss values, and true “bad” local minima are rarer than the pure math might suggest. This doesn’t mean non-convexity is irrelevant — training instability, sensitivity to initialization, and the entire field of optimizer research exist because of it — but it explains why gradient descent, despite lacking convexity’s guarantees, remains the dominant and effective approach.
Local Minima vs. Saddle Points: A Correction to Common Intuition
Early deep learning research assumed that getting “stuck” during training was primarily a local-minima problem — the optimizer settling into a suboptimal dip it can’t escape. More recent theoretical and empirical work suggests that in the very high-dimensional parameter spaces typical of deep networks, saddle points (where the gradient is near zero but the point isn’t actually a minimum in every direction — it’s a minimum in some directions and a maximum in others) are actually far more common obstacles than genuine local minima. This distinction matters practically: escaping a saddle point often just requires enough noise or momentum to nudge the optimizer past the near-zero-gradient region, which is part of why momentum-based optimizers (covered in Optimizers) tend to handle these plateaus more gracefully than plain gradient descent.
Summary
| Concept | Meaning |
|---|---|
| Objective function | The quantity training is trying to minimize or maximize |
| Cost function | The objective, averaged across a batch or dataset |
| Convex optimization | Single guaranteed minimum — rare in deep learning |
| Non-convex optimization | Multiple minima and saddle points — the reality of neural network training |
Every optimizer, initialization strategy, and learning rate schedule covered later in this series exists specifically to navigate the non-convex landscape described here. Understanding this shape is what makes those techniques feel like solutions to a real problem, rather than arbitrary tricks.