Optimization Basics: Objective Functions, Cost Functions, and Convexity

How objective functions and cost functions define what a model is trying to learn, and why convex vs non-convex optimization changes everything.

Optimization Basics: Objective Functions, Cost Functions, and Convexity

Training a neural network is, mathematically, an optimization problem — search for the weights that minimize a specific number. Every architectural decision, every training trick, every hyperparameter exists in service of that single search. Understanding optimization at a conceptual level explains why training deep networks is genuinely hard, in a way that’s fundamentally different from optimizing a simple function.


Objective Functions: Defining “Good”

An objective function is the quantity an optimization process is trying to maximize or minimize — it formally defines what “good” means for a given problem. In deep learning, the objective is almost always something to minimize: prediction error.

def objective(predictions, targets):
return ((predictions - targets) ** 2).mean() # mean squared error, a common objective

The entire training process — every gradient computed, every weight updated — exists purely to make this one number smaller. Choosing the right objective function is arguably the single most consequential decision in a machine learning project, because the model will faithfully optimize whatever objective you give it, including ones that don’t actually capture what you care about.


Cost Functions vs. Loss Functions: A Small but Useful Distinction

These terms are frequently used interchangeably, but a useful distinction: a loss function measures error for a single training example, while a cost function is the average loss across an entire batch or dataset.

def loss(prediction, target):
return (prediction - target) ** 2 # single example
def cost(predictions, targets):
return np.mean((predictions - targets) ** 2) # averaged across a batch

In practice, most frameworks and papers use “loss” for both, but understanding that training minimizes an averaged quantity across a batch — not a single example’s error — matters for reasoning about why batch size affects gradient noise and stability, covered in Epochs, Batch Size, and Iterations.


Convex vs. Non-Convex Optimization: Why Deep Learning Is Hard

A convex function has exactly one minimum — imagine a bowl shape. Any optimization algorithm that consistently moves downhill is guaranteed to eventually find the single global minimum, no matter where it starts.

Convex function: a single, guaranteed-findable minimum
\ /
\ /
\ /
\__________/
global minimum

A non-convex function has multiple local minima, saddle points, and a much more complex “landscape” — and this is exactly what a deep neural network’s loss function looks like, because of the way weights combine nonlinearly across many layers.

Non-convex function: multiple local minima, no simple guarantee
\ /\ /\ /
\ / \ / \ /
\ / \ / \ /
\/ \ / \ /
local min \/ local min
local min (maybe global?)

There’s no guarantee that gradient descent finds the global minimum in a non-convex landscape — it might get stuck in a local minimum, or slow down dramatically near a saddle point where gradients are near zero in most directions but not all.


Why This Matters in Practice, Not Just Theory

The non-convexity of deep learning’s loss landscape is the direct motivation behind several practices covered later in this series:

  • Momentum-based optimizers (Optimizers) help the optimization process “roll through” small local dips rather than getting stuck in them.
  • Multiple random weight initializations are sometimes used specifically because different starting points on a non-convex landscape can converge to meaningfully different final solutions.
  • Learning rate scheduling (Learning Rate Scheduling) helps navigate a complex landscape — large steps early to escape flat regions, smaller steps later to settle precisely into a good minimum.

A Practical Intuition: The Loss Landscape Isn’t as Scary as It Sounds

Despite the theoretical difficulty of non-convex optimization, deep learning works remarkably well in practice — a well-known empirical finding is that in very high-dimensional spaces (millions of weights), most local minima found by gradient descent tend to have similar, good-enough loss values, and true “bad” local minima are rarer than the pure math might suggest. This doesn’t mean non-convexity is irrelevant — training instability, sensitivity to initialization, and the entire field of optimizer research exist because of it — but it explains why gradient descent, despite lacking convexity’s guarantees, remains the dominant and effective approach.

Local Minima vs. Saddle Points: A Correction to Common Intuition

Early deep learning research assumed that getting “stuck” during training was primarily a local-minima problem — the optimizer settling into a suboptimal dip it can’t escape. More recent theoretical and empirical work suggests that in the very high-dimensional parameter spaces typical of deep networks, saddle points (where the gradient is near zero but the point isn’t actually a minimum in every direction — it’s a minimum in some directions and a maximum in others) are actually far more common obstacles than genuine local minima. This distinction matters practically: escaping a saddle point often just requires enough noise or momentum to nudge the optimizer past the near-zero-gradient region, which is part of why momentum-based optimizers (covered in Optimizers) tend to handle these plateaus more gracefully than plain gradient descent.

Summary

ConceptMeaning
Objective functionThe quantity training is trying to minimize or maximize
Cost functionThe objective, averaged across a batch or dataset
Convex optimizationSingle guaranteed minimum — rare in deep learning
Non-convex optimizationMultiple minima and saddle points — the reality of neural network training

Every optimizer, initialization strategy, and learning rate schedule covered later in this series exists specifically to navigate the non-convex landscape described here. Understanding this shape is what makes those techniques feel like solutions to a real problem, rather than arbitrary tricks.