Cosine Similarity in NLP
Cosine similarity measures the angle between two vectors. Two vectors pointing in the same direction have cosine similarity of 1.0; perpendicular vectors have 0.0; opposite vectors have -1.0. In NLP, it’s the standard metric for comparing documents, sentences, or word vectors.
The Math
cos(θ) = (A · B) / (||A|| × ||B||)
where: A · B = dot product of vectors A and B ||A|| = magnitude (Euclidean norm) of A ||B|| = magnitude (Euclidean norm) of BBy normalizing for vector length, cosine similarity captures directional similarity — two documents are similar if they use the same vocabulary proportionally, regardless of document length.
Computing Cosine Similarity in Python
import numpy as np
def cosine_similarity(a, b): dot_product = np.dot(a, b) norm_a = np.linalg.norm(a) norm_b = np.linalg.norm(b)
if norm_a == 0 or norm_b == 0: return 0.0
return dot_product / (norm_a * norm_b)
# Example with simple vectorsvec_a = np.array([1, 2, 3, 4])vec_b = np.array([1, 2, 3, 4]) # identicalvec_c = np.array([4, 3, 2, 1]) # reversedvec_d = np.array([0, 0, 0, 0]) # zero vector
print(f"A vs B (identical): {cosine_similarity(vec_a, vec_b):.4f}") # 1.0000print(f"A vs C (reversed): {cosine_similarity(vec_a, vec_c):.4f}") # 0.8000print(f"A vs D (zero): {cosine_similarity(vec_a, vec_d):.4f}") # 0.0000Cosine Similarity with TF-IDF Vectors
from sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.metrics.pairwise import cosine_similarity as sklearn_cosine
documents = [ "Python programming for data science and machine learning", "Data science uses Python libraries like pandas and scikit-learn", "Machine learning models process numerical features", "The history of ancient Rome began with the founding of the city"]
vectorizer = TfidfVectorizer(stop_words='english')tfidf = vectorizer.fit_transform(documents)
# Pairwise cosine similarity matrixsim_matrix = sklearn_cosine(tfidf)print("Cosine Similarity Matrix (TF-IDF):")for i, row in enumerate(sim_matrix): print(f"Doc {i+1}: {[round(s, 2) for s in row]}")
# Query a new document against the corpusquery = "machine learning with Python"query_vec = vectorizer.transform([query])scores = sklearn_cosine(query_vec, tfidf).flatten()print(f"\nQuery scores: {[round(s, 4) for s in scores]}")print(f"Best match: Doc {scores.argmax() + 1}")Cosine Similarity with Dense Embeddings
from sentence_transformers import SentenceTransformer, utilimport torch
model = SentenceTransformer('all-MiniLM-L6-v2')
sentences = [ "How do I reverse a list in Python?", "What's the Python syntax for flipping an array?", "How do neural networks learn from data?", "What is backpropagation in deep learning?",]
embeddings = model.encode(sentences, convert_to_tensor=True)
# All-pairs similaritycos_sim = util.cos_sim(embeddings, embeddings)print("Dense Embedding Cosine Similarity:")for i in range(len(sentences)): for j in range(i + 1, len(sentences)): score = cos_sim[i][j].item() print(f" [{score:.4f}] '{sentences[i][:40]}' vs '{sentences[j][:40]}'")
# Semantically related pairs score high even with different wording# 'How do I reverse a list in Python?' vs 'What's the Python syntax for flipping...' → ~0.82# 'How do neural networks learn...' vs 'What is backpropagation...' → ~0.79Finding Most Similar Documents
from sentence_transformers import SentenceTransformer, utilimport numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2')
knowledge_base = [ "Transformers use self-attention to process sequences in parallel.", "RNNs and LSTMs process text sequentially, maintaining hidden state.", "BERT is a bidirectional transformer encoder trained with masked language modeling.", "GPT uses causal (unidirectional) attention for autoregressive generation.", "Fine-tuning adapts pretrained models to specific downstream NLP tasks.",]
kb_embeddings = model.encode(knowledge_base)
def find_top_k(query, k=2): q_emb = model.encode([query]) scores = util.cos_sim(q_emb, kb_embeddings)[0].numpy() top_k = np.argsort(scores)[::-1][:k] return [(knowledge_base[i], float(scores[i])) for i in top_k]
query = "What architecture does BERT use?"results = find_top_k(query, k=2)for doc, score in results: print(f"[{score:.4f}] {doc}")Cosine vs Other Distance Metrics
| Metric | Formula | Sensitive to Length | Range | Best for |
|---|---|---|---|---|
| Cosine similarity | angle between vectors | No | [-1, 1] | Most NLP tasks |
| Euclidean distance | straight-line distance | Yes | [0, ∞) | Equal-length vectors |
| Dot product | magnitude + angle | Yes | (-∞, ∞) | When magnitude matters |
| Manhattan distance | sum of absolute differences | Yes | [0, ∞) | Rare in NLP |
Cosine similarity’s length-independence is its key advantage: a short document and a long document can still score 1.0 if they discuss the same topic in the same proportions. This makes it the default choice for document comparison in NLP.
Threshold Interpretation
def interpret_similarity(score): if score >= 0.95: return "Near-identical" elif score >= 0.85: return "Very similar" elif score >= 0.70: return "Similar topic" elif score >= 0.50: return "Some overlap" else: return "Different"These thresholds vary significantly based on the embedding model and domain. Always calibrate thresholds on labeled data specific to your use case.