Document Similarity in NLP

Document similarity measures how alike two pieces of text are. It powers search engines, recommendation systems, plagiarism detection, deduplication, and semantic clustering. The right similarity method depends on whether you care about exact word overlap or semantic meaning.

Similarity Approaches Overview

Method	Based on	Captures semantics	Speed
Jaccard similarity	Word overlap	No	Very fast
Cosine (TF-IDF)	Weighted word counts	Partial	Fast
BM25	Probabilistic ranking	Partial	Fast
Cosine (embeddings)	Dense vectors	Yes	Medium
Cross-encoder	Transformer pairs	Yes (best)	Slow

Jaccard Similarity

The simplest measure: the intersection of words divided by their union.

def jaccard_similarity(doc1, doc2):
    words1 = set(doc1.lower().split())
    words2 = set(doc2.lower().split())
    intersection = words1 & words2
    union = words1 | words2
    return len(intersection) / len(union)

doc_a = "Machine learning models transform text data into numerical representations."
doc_b = "Deep learning models process text and transform it into numerical vectors."
doc_c = "The French Revolution began in 1789 with the storming of the Bastille."

print(f"A vs B: {jaccard_similarity(doc_a, doc_b):.4f}")  # ~0.35 — some overlap
print(f"A vs C: {jaccard_similarity(doc_a, doc_c):.4f}")  # ~0.05 — very different topics

Cosine Similarity with TF-IDF

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

documents = [
    "Python is widely used for natural language processing and data science.",
    "NLP tasks include text classification, summarization, and named entity recognition.",
    "Python libraries like spaCy and NLTK support natural language processing.",
    "Computer vision focuses on processing and analyzing image data.",
    "Medical imaging uses convolutional neural networks to detect diseases."
]

vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(documents)

similarity_matrix = cosine_similarity(tfidf_matrix)

print("TF-IDF Cosine Similarity Matrix:")
print(np.round(similarity_matrix, 2))

# Find most similar pair
for i in range(len(documents)):
    for j in range(i + 1, len(documents)):
        score = similarity_matrix[i][j]
        if score > 0.2:
            print(f"\nDocs {i+1} & {j+1} (score: {score:.3f}):")
            print(f"  {documents[i][:60]}...")
            print(f"  {documents[j][:60]}...")

Semantic Similarity with Sentence Embeddings

TF-IDF misses semantic connections — “automobile” and “car” appear unrelated. Dense embeddings fix this:

from sentence_transformers import SentenceTransformer, util
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')

docs = [
    "The vehicle sped through the city streets.",
    "The automobile raced across the urban road.",
    "Scientists discovered a new species of deep-sea fish.",
    "Researchers found an unknown organism in the ocean depths.",
]

embeddings = model.encode(docs)
sim_matrix = util.cos_sim(embeddings, embeddings).numpy()

print("Semantic Similarity Matrix:")
print(np.round(sim_matrix, 3))

# Docs 1 & 2 (vehicle/automobile) → high score despite zero shared keywords
# Docs 3 & 4 (deep-sea discovery) → high score
# Docs 1 & 3 → low score (unrelated topics)

Near-Duplicate Detection

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-MiniLM-L6-v2')

def find_near_duplicates(texts, threshold=0.92):
    embeddings = model.encode(texts, convert_to_tensor=True)
    duplicates = []

    for i in range(len(texts)):
        for j in range(i + 1, len(texts)):
            score = util.cos_sim(embeddings[i], embeddings[j]).item()
            if score >= threshold:
                duplicates.append((i, j, score))

    return duplicates

articles = [
    "OpenAI launched its latest GPT model with improved reasoning.",
    "OpenAI released a new GPT version featuring enhanced reasoning capabilities.",
    "Apple unveiled its new MacBook lineup at the WWDC event.",
    "Google DeepMind published research on protein structure prediction.",
    "A new GPT model from OpenAI was released with better reasoning skills.",
]

dupes = find_near_duplicates(articles, threshold=0.88)
for i, j, score in dupes:
    print(f"Duplicate (score {score:.3f}):")
    print(f"  {articles[i]}")
    print(f"  {articles[j]}\n")

Document Clustering by Similarity

from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')

docs = [
    "Python is a popular programming language.",
    "JavaScript powers web applications and browsers.",
    "PyTorch is used for deep learning research.",
    "React is a JavaScript framework for building UIs.",
    "TensorFlow provides a platform for machine learning.",
    "Vue.js is a progressive JavaScript framework."
]

embeddings = model.encode(docs)
kmeans = KMeans(n_clusters=2, random_state=42)
labels = kmeans.fit_predict(embeddings)

print("Clusters:")
for cluster_id in set(labels):
    print(f"\nCluster {cluster_id}:")
    for i, label in enumerate(labels):
        if label == cluster_id:
            print(f"  - {docs[i]}")

# Cluster 0: Python/ML docs
# Cluster 1: JavaScript/web docs

Cross-Encoder for High-Accuracy Pairs

For reranking a small set of retrieved candidates, cross-encoders produce higher accuracy:

from sentence_transformers import CrossEncoder

# Cross-encoder jointly encodes both sentences together
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

query = "What is transfer learning in NLP?"
candidates = [
    "Transfer learning uses pre-trained models adapted to new tasks.",
    "NLP processing pipelines include tokenization and embedding steps.",
    "Fine-tuning a BERT model involves training on labeled data for a specific task.",
]

scores = cross_encoder.predict([(query, c) for c in candidates])
ranked = sorted(zip(scores, candidates), reverse=True)
for score, doc in ranked:
    print(f"Score {score:.3f}: {doc}")

Cross-encoders are typically used as a reranker: a bi-encoder (sentence-transformer) retrieves the top 100 candidates fast, then a cross-encoder reranks them for precision.