Document Similarity in NLP
Document similarity measures how alike two pieces of text are. It powers search engines, recommendation systems, plagiarism detection, deduplication, and semantic clustering. The right similarity method depends on whether you care about exact word overlap or semantic meaning.
Similarity Approaches Overview
| Method | Based on | Captures semantics | Speed |
|---|---|---|---|
| Jaccard similarity | Word overlap | No | Very fast |
| Cosine (TF-IDF) | Weighted word counts | Partial | Fast |
| BM25 | Probabilistic ranking | Partial | Fast |
| Cosine (embeddings) | Dense vectors | Yes | Medium |
| Cross-encoder | Transformer pairs | Yes (best) | Slow |
Jaccard Similarity
The simplest measure: the intersection of words divided by their union.
def jaccard_similarity(doc1, doc2): words1 = set(doc1.lower().split()) words2 = set(doc2.lower().split()) intersection = words1 & words2 union = words1 | words2 return len(intersection) / len(union)
doc_a = "Machine learning models transform text data into numerical representations."doc_b = "Deep learning models process text and transform it into numerical vectors."doc_c = "The French Revolution began in 1789 with the storming of the Bastille."
print(f"A vs B: {jaccard_similarity(doc_a, doc_b):.4f}") # ~0.35 — some overlapprint(f"A vs C: {jaccard_similarity(doc_a, doc_c):.4f}") # ~0.05 — very different topicsCosine Similarity with TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.metrics.pairwise import cosine_similarityimport numpy as np
documents = [ "Python is widely used for natural language processing and data science.", "NLP tasks include text classification, summarization, and named entity recognition.", "Python libraries like spaCy and NLTK support natural language processing.", "Computer vision focuses on processing and analyzing image data.", "Medical imaging uses convolutional neural networks to detect diseases."]
vectorizer = TfidfVectorizer(stop_words='english')tfidf_matrix = vectorizer.fit_transform(documents)
similarity_matrix = cosine_similarity(tfidf_matrix)
print("TF-IDF Cosine Similarity Matrix:")print(np.round(similarity_matrix, 2))
# Find most similar pairfor i in range(len(documents)): for j in range(i + 1, len(documents)): score = similarity_matrix[i][j] if score > 0.2: print(f"\nDocs {i+1} & {j+1} (score: {score:.3f}):") print(f" {documents[i][:60]}...") print(f" {documents[j][:60]}...")Semantic Similarity with Sentence Embeddings
TF-IDF misses semantic connections — “automobile” and “car” appear unrelated. Dense embeddings fix this:
from sentence_transformers import SentenceTransformer, utilimport numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2')
docs = [ "The vehicle sped through the city streets.", "The automobile raced across the urban road.", "Scientists discovered a new species of deep-sea fish.", "Researchers found an unknown organism in the ocean depths.",]
embeddings = model.encode(docs)sim_matrix = util.cos_sim(embeddings, embeddings).numpy()
print("Semantic Similarity Matrix:")print(np.round(sim_matrix, 3))
# Docs 1 & 2 (vehicle/automobile) → high score despite zero shared keywords# Docs 3 & 4 (deep-sea discovery) → high score# Docs 1 & 3 → low score (unrelated topics)Near-Duplicate Detection
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')
def find_near_duplicates(texts, threshold=0.92): embeddings = model.encode(texts, convert_to_tensor=True) duplicates = []
for i in range(len(texts)): for j in range(i + 1, len(texts)): score = util.cos_sim(embeddings[i], embeddings[j]).item() if score >= threshold: duplicates.append((i, j, score))
return duplicates
articles = [ "OpenAI launched its latest GPT model with improved reasoning.", "OpenAI released a new GPT version featuring enhanced reasoning capabilities.", "Apple unveiled its new MacBook lineup at the WWDC event.", "Google DeepMind published research on protein structure prediction.", "A new GPT model from OpenAI was released with better reasoning skills.",]
dupes = find_near_duplicates(articles, threshold=0.88)for i, j, score in dupes: print(f"Duplicate (score {score:.3f}):") print(f" {articles[i]}") print(f" {articles[j]}\n")Document Clustering by Similarity
from sentence_transformers import SentenceTransformerfrom sklearn.cluster import KMeansimport numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2')
docs = [ "Python is a popular programming language.", "JavaScript powers web applications and browsers.", "PyTorch is used for deep learning research.", "React is a JavaScript framework for building UIs.", "TensorFlow provides a platform for machine learning.", "Vue.js is a progressive JavaScript framework."]
embeddings = model.encode(docs)kmeans = KMeans(n_clusters=2, random_state=42)labels = kmeans.fit_predict(embeddings)
print("Clusters:")for cluster_id in set(labels): print(f"\nCluster {cluster_id}:") for i, label in enumerate(labels): if label == cluster_id: print(f" - {docs[i]}")
# Cluster 0: Python/ML docs# Cluster 1: JavaScript/web docsCross-Encoder for High-Accuracy Pairs
For reranking a small set of retrieved candidates, cross-encoders produce higher accuracy:
from sentence_transformers import CrossEncoder
# Cross-encoder jointly encodes both sentences togethercross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
query = "What is transfer learning in NLP?"candidates = [ "Transfer learning uses pre-trained models adapted to new tasks.", "NLP processing pipelines include tokenization and embedding steps.", "Fine-tuning a BERT model involves training on labeled data for a specific task.",]
scores = cross_encoder.predict([(query, c) for c in candidates])ranked = sorted(zip(scores, candidates), reverse=True)for score, doc in ranked: print(f"Score {score:.3f}: {doc}")Cross-encoders are typically used as a reranker: a bi-encoder (sentence-transformer) retrieves the top 100 candidates fast, then a cross-encoder reranks them for precision.