Norms and Distance Metrics: L1, L2, Euclidean, and Cosine Similarity

How L1 and L2 norms, Euclidean distance, and cosine similarity are used for regularization, embeddings, and measuring similarity in deep learning.

Norms and Distance Metrics: L1, L2, Euclidean, and Cosine Similarity

“How similar are these two embeddings?” and “how large are these weights?” are two of the most common questions asked in deep learning, and both are answered using norms and distance metrics. These aren’t separate concepts — a norm measures the “size” of a single vector, and a distance metric typically measures the size of the difference between two vectors. Together, they underpin regularization, embedding search, and similarity comparison across nearly every deep learning application.


L1 Norm: Sum of Absolute Values

The L1 norm of a vector is the sum of the absolute values of its components.

import numpy as np
v = np.array([3, -4, 2])
l1_norm = np.sum(np.abs(v)) # 3 + 4 + 2 = 9

In deep learning, the L1 norm is most commonly seen as L1 regularization — adding the sum of absolute weight values to the loss function, which pushes many weights toward exactly zero, producing sparse models. This is covered practically in Regularization.


L2 Norm: The Familiar “Length” of a Vector

The L2 norm — also called the Euclidean norm — is the square root of the sum of squared components, matching the everyday geometric notion of a vector’s length.

l2_norm = np.sqrt(np.sum(v ** 2)) # sqrt(9 + 16 + 4) = sqrt(29) ≈ 5.39

L2 regularization (also called weight decay) adds the sum of squared weights to the loss, penalizing large weights smoothly rather than pushing them to exactly zero the way L1 does. This distinction — L1 producing sparsity, L2 producing smoothly smaller weights overall — is one of the most practically important norm-related decisions when configuring a training run.

# L2 regularization added directly to the loss
lambda_reg = 0.01
l2_penalty = lambda_reg * np.sum(weights ** 2)
total_loss = data_loss + l2_penalty

Euclidean Distance: How Far Apart Are Two Points

Euclidean distance is simply the L2 norm applied to the difference between two vectors — the straight-line distance between two points in space.

a = np.array([1, 2, 3])
b = np.array([4, 6, 3])
euclidean_distance = np.sqrt(np.sum((a - b) ** 2)) # 5.0

This is the default distance metric for many nearest-neighbor algorithms and clustering methods, and it’s a natural first choice for comparing two embedding vectors — though, as covered next, it isn’t always the right choice for high-dimensional embeddings specifically.


Cosine Similarity: Measuring Direction, Not Magnitude

Cosine similarity measures the angle between two vectors, ignoring their magnitude entirely — it answers “do these vectors point in the same direction,” not “are these vectors close together in absolute terms.”

def cosine_similarity(a, b):
dot_product = np.dot(a, b)
norm_a = np.linalg.norm(a)
norm_b = np.linalg.norm(b)
return dot_product / (norm_a * norm_b)
similarity = cosine_similarity(a, b) # ranges from -1 (opposite) to 1 (identical direction)

This distinction matters enormously for text and image embeddings from deep learning models. Two sentence embeddings might have very different magnitudes (influenced by sentence length or specific wording) while still pointing in a nearly identical semantic direction — cosine similarity correctly identifies them as similar, while Euclidean distance might not, because it’s sensitive to magnitude differences that don’t actually reflect a difference in meaning.


SituationPreferred metricWhy
Comparing weight vector magnitudes for regularizationL1 or L2 normDirectly measures weight size
Semantic similarity between text embeddingsCosine similarityMagnitude-invariant, captures direction/meaning
Nearest-neighbor search in low-dimensional spaceEuclidean distanceIntuitive, computationally simple
Nearest-neighbor search over normalized embeddingsCosine similarity (often equivalent to Euclidean on unit vectors)Standard for most modern embedding models

Most production embedding models (used in semantic search, recommendation systems, and retrieval-augmented generation) explicitly normalize their output vectors to unit length before comparison — at which point cosine similarity and Euclidean distance become mathematically equivalent up to a scaling factor, which is why many vector databases let you choose either metric interchangeably once embeddings are normalized.


Norms as a Debugging Tool

Beyond regularization and similarity, tracking the L2 norm of a model’s gradients during training is a genuinely useful diagnostic — a gradient norm that’s exploding toward very large values is a direct, quantifiable symptom of the Exploding Gradient Problem, and gradient clipping (limiting the gradient’s norm to a maximum value before applying an update) is a standard technique built entirely on this measurement.

import torch
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

Manhattan Distance: A Less Common but Occasionally Useful Alternative

Beyond Euclidean and cosine, Manhattan distance (the L1 norm applied to the difference between two vectors) measures distance as the sum of absolute differences along each dimension, rather than the straight-line distance.

manhattan_distance = np.sum(np.abs(a - b))

This is occasionally preferred over Euclidean distance in high-dimensional spaces, where Euclidean distance can become less discriminative (a well-documented phenomenon sometimes called the “curse of dimensionality,” where distances between points tend to become more uniform as dimensionality grows). Manhattan distance is generally less affected by this effect, making it a reasonable alternative worth testing when working with very high-dimensional feature vectors and Euclidean-based similarity search isn’t producing clearly differentiated results.

Summary

MetricMeasuresCommon use
L1 normSum of absolute valuesSparsity-inducing regularization
L2 normEuclidean “length”Weight decay, gradient clipping
Euclidean distanceStraight-line distance between pointsNearest-neighbor, clustering
Cosine similarityAngle between vectors, ignoring magnitudeSemantic embedding comparison

Choosing the right norm or distance metric isn’t a minor implementation detail — using Euclidean distance where cosine similarity was needed (or vice versa) is a common, subtle bug that silently degrades a similarity search or a regularization scheme without ever throwing an error.