Cosine Similarity in NLP

Cosine similarity measures the angle between two vectors. Two vectors pointing in the same direction have cosine similarity of 1.0; perpendicular vectors have 0.0; opposite vectors have -1.0. In NLP, it’s the standard metric for comparing documents, sentences, or word vectors.

The Math

cos(θ) = (A · B) / (||A|| × ||B||)

where:
  A · B = dot product of vectors A and B
  ||A|| = magnitude (Euclidean norm) of A
  ||B|| = magnitude (Euclidean norm) of B

By normalizing for vector length, cosine similarity captures directional similarity — two documents are similar if they use the same vocabulary proportionally, regardless of document length.

Computing Cosine Similarity in Python

import numpy as np

def cosine_similarity(a, b):
    dot_product = np.dot(a, b)
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)

    if norm_a == 0 or norm_b == 0:
        return 0.0

    return dot_product / (norm_a * norm_b)

# Example with simple vectors
vec_a = np.array([1, 2, 3, 4])
vec_b = np.array([1, 2, 3, 4])  # identical
vec_c = np.array([4, 3, 2, 1])  # reversed
vec_d = np.array([0, 0, 0, 0])  # zero vector

print(f"A vs B (identical): {cosine_similarity(vec_a, vec_b):.4f}")  # 1.0000
print(f"A vs C (reversed):  {cosine_similarity(vec_a, vec_c):.4f}")  # 0.8000
print(f"A vs D (zero):      {cosine_similarity(vec_a, vec_d):.4f}")  # 0.0000

Cosine Similarity with TF-IDF Vectors

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity as sklearn_cosine

documents = [
    "Python programming for data science and machine learning",
    "Data science uses Python libraries like pandas and scikit-learn",
    "Machine learning models process numerical features",
    "The history of ancient Rome began with the founding of the city"
]

vectorizer = TfidfVectorizer(stop_words='english')
tfidf = vectorizer.fit_transform(documents)

# Pairwise cosine similarity matrix
sim_matrix = sklearn_cosine(tfidf)
print("Cosine Similarity Matrix (TF-IDF):")
for i, row in enumerate(sim_matrix):
    print(f"Doc {i+1}: {[round(s, 2) for s in row]}")

# Query a new document against the corpus
query = "machine learning with Python"
query_vec = vectorizer.transform([query])
scores = sklearn_cosine(query_vec, tfidf).flatten()
print(f"\nQuery scores: {[round(s, 4) for s in scores]}")
print(f"Best match: Doc {scores.argmax() + 1}")

Cosine Similarity with Dense Embeddings

from sentence_transformers import SentenceTransformer, util
import torch

model = SentenceTransformer('all-MiniLM-L6-v2')

sentences = [
    "How do I reverse a list in Python?",
    "What's the Python syntax for flipping an array?",
    "How do neural networks learn from data?",
    "What is backpropagation in deep learning?",
]

embeddings = model.encode(sentences, convert_to_tensor=True)

# All-pairs similarity
cos_sim = util.cos_sim(embeddings, embeddings)
print("Dense Embedding Cosine Similarity:")
for i in range(len(sentences)):
    for j in range(i + 1, len(sentences)):
        score = cos_sim[i][j].item()
        print(f"  [{score:.4f}] '{sentences[i][:40]}' vs '{sentences[j][:40]}'")

# Semantically related pairs score high even with different wording
# 'How do I reverse a list in Python?' vs 'What's the Python syntax for flipping...' → ~0.82
# 'How do neural networks learn...' vs 'What is backpropagation...' → ~0.79

Finding Most Similar Documents

from sentence_transformers import SentenceTransformer, util
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')

knowledge_base = [
    "Transformers use self-attention to process sequences in parallel.",
    "RNNs and LSTMs process text sequentially, maintaining hidden state.",
    "BERT is a bidirectional transformer encoder trained with masked language modeling.",
    "GPT uses causal (unidirectional) attention for autoregressive generation.",
    "Fine-tuning adapts pretrained models to specific downstream NLP tasks.",
]

kb_embeddings = model.encode(knowledge_base)

def find_top_k(query, k=2):
    q_emb = model.encode([query])
    scores = util.cos_sim(q_emb, kb_embeddings)[0].numpy()
    top_k = np.argsort(scores)[::-1][:k]
    return [(knowledge_base[i], float(scores[i])) for i in top_k]

query = "What architecture does BERT use?"
results = find_top_k(query, k=2)
for doc, score in results:
    print(f"[{score:.4f}] {doc}")

Cosine vs Other Distance Metrics

Metric	Formula	Sensitive to Length	Range	Best for
Cosine similarity	angle between vectors	No	[-1, 1]	Most NLP tasks
Euclidean distance	straight-line distance	Yes	[0, ∞)	Equal-length vectors
Dot product	magnitude + angle	Yes	(-∞, ∞)	When magnitude matters
Manhattan distance	sum of absolute differences	Yes	[0, ∞)	Rare in NLP

Cosine similarity’s length-independence is its key advantage: a short document and a long document can still score 1.0 if they discuss the same topic in the same proportions. This makes it the default choice for document comparison in NLP.

Threshold Interpretation

def interpret_similarity(score):
    if score >= 0.95:
        return "Near-identical"
    elif score >= 0.85:
        return "Very similar"
    elif score >= 0.70:
        return "Similar topic"
    elif score >= 0.50:
        return "Some overlap"
    else:
        return "Different"

These thresholds vary significantly based on the embedding model and domain. Always calibrate thresholds on labeled data specific to your use case.