Norms and Distance Metrics: L1, L2, Euclidean, and Cosine Similarity
“How similar are these two embeddings?” and “how large are these weights?” are two of the most common questions asked in deep learning, and both are answered using norms and distance metrics. These aren’t separate concepts — a norm measures the “size” of a single vector, and a distance metric typically measures the size of the difference between two vectors. Together, they underpin regularization, embedding search, and similarity comparison across nearly every deep learning application.
L1 Norm: Sum of Absolute Values
The L1 norm of a vector is the sum of the absolute values of its components.
import numpy as np
v = np.array([3, -4, 2])l1_norm = np.sum(np.abs(v)) # 3 + 4 + 2 = 9In deep learning, the L1 norm is most commonly seen as L1 regularization — adding the sum of absolute weight values to the loss function, which pushes many weights toward exactly zero, producing sparse models. This is covered practically in Regularization.
L2 Norm: The Familiar “Length” of a Vector
The L2 norm — also called the Euclidean norm — is the square root of the sum of squared components, matching the everyday geometric notion of a vector’s length.
l2_norm = np.sqrt(np.sum(v ** 2)) # sqrt(9 + 16 + 4) = sqrt(29) ≈ 5.39L2 regularization (also called weight decay) adds the sum of squared weights to the loss, penalizing large weights smoothly rather than pushing them to exactly zero the way L1 does. This distinction — L1 producing sparsity, L2 producing smoothly smaller weights overall — is one of the most practically important norm-related decisions when configuring a training run.
# L2 regularization added directly to the losslambda_reg = 0.01l2_penalty = lambda_reg * np.sum(weights ** 2)total_loss = data_loss + l2_penaltyEuclidean Distance: How Far Apart Are Two Points
Euclidean distance is simply the L2 norm applied to the difference between two vectors — the straight-line distance between two points in space.
a = np.array([1, 2, 3])b = np.array([4, 6, 3])
euclidean_distance = np.sqrt(np.sum((a - b) ** 2)) # 5.0This is the default distance metric for many nearest-neighbor algorithms and clustering methods, and it’s a natural first choice for comparing two embedding vectors — though, as covered next, it isn’t always the right choice for high-dimensional embeddings specifically.
Cosine Similarity: Measuring Direction, Not Magnitude
Cosine similarity measures the angle between two vectors, ignoring their magnitude entirely — it answers “do these vectors point in the same direction,” not “are these vectors close together in absolute terms.”
def cosine_similarity(a, b): dot_product = np.dot(a, b) norm_a = np.linalg.norm(a) norm_b = np.linalg.norm(b) return dot_product / (norm_a * norm_b)
similarity = cosine_similarity(a, b) # ranges from -1 (opposite) to 1 (identical direction)This distinction matters enormously for text and image embeddings from deep learning models. Two sentence embeddings might have very different magnitudes (influenced by sentence length or specific wording) while still pointing in a nearly identical semantic direction — cosine similarity correctly identifies them as similar, while Euclidean distance might not, because it’s sensitive to magnitude differences that don’t actually reflect a difference in meaning.
Choosing the Right Metric for Embedding Search
| Situation | Preferred metric | Why |
|---|---|---|
| Comparing weight vector magnitudes for regularization | L1 or L2 norm | Directly measures weight size |
| Semantic similarity between text embeddings | Cosine similarity | Magnitude-invariant, captures direction/meaning |
| Nearest-neighbor search in low-dimensional space | Euclidean distance | Intuitive, computationally simple |
| Nearest-neighbor search over normalized embeddings | Cosine similarity (often equivalent to Euclidean on unit vectors) | Standard for most modern embedding models |
Most production embedding models (used in semantic search, recommendation systems, and retrieval-augmented generation) explicitly normalize their output vectors to unit length before comparison — at which point cosine similarity and Euclidean distance become mathematically equivalent up to a scaling factor, which is why many vector databases let you choose either metric interchangeably once embeddings are normalized.
Norms as a Debugging Tool
Beyond regularization and similarity, tracking the L2 norm of a model’s gradients during training is a genuinely useful diagnostic — a gradient norm that’s exploding toward very large values is a direct, quantifiable symptom of the Exploding Gradient Problem, and gradient clipping (limiting the gradient’s norm to a maximum value before applying an update) is a standard technique built entirely on this measurement.
import torch
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)Manhattan Distance: A Less Common but Occasionally Useful Alternative
Beyond Euclidean and cosine, Manhattan distance (the L1 norm applied to the difference between two vectors) measures distance as the sum of absolute differences along each dimension, rather than the straight-line distance.
manhattan_distance = np.sum(np.abs(a - b))This is occasionally preferred over Euclidean distance in high-dimensional spaces, where Euclidean distance can become less discriminative (a well-documented phenomenon sometimes called the “curse of dimensionality,” where distances between points tend to become more uniform as dimensionality grows). Manhattan distance is generally less affected by this effect, making it a reasonable alternative worth testing when working with very high-dimensional feature vectors and Euclidean-based similarity search isn’t producing clearly differentiated results.
Summary
| Metric | Measures | Common use |
|---|---|---|
| L1 norm | Sum of absolute values | Sparsity-inducing regularization |
| L2 norm | Euclidean “length” | Weight decay, gradient clipping |
| Euclidean distance | Straight-line distance between points | Nearest-neighbor, clustering |
| Cosine similarity | Angle between vectors, ignoring magnitude | Semantic embedding comparison |
Choosing the right norm or distance metric isn’t a minor implementation detail — using Euclidean distance where cosine similarity was needed (or vice versa) is a common, subtle bug that silently degrades a similarity search or a regularization scheme without ever throwing an error.