Technology  /  NLP

💬 Natural Language Processing 40 guides · updated 2026

From tokenisation and embeddings to transformer-based language understanding — the NLP fundamentals that underpin every modern LLM.

Document Similarity in NLP

Document similarity measures how alike two pieces of text are. It powers search engines, recommendation systems, plagiarism detection, deduplication, and semantic clustering. The right similarity method depends on whether you care about exact word overlap or semantic meaning.


Similarity Approaches Overview

MethodBased onCaptures semanticsSpeed
Jaccard similarityWord overlapNoVery fast
Cosine (TF-IDF)Weighted word countsPartialFast
BM25Probabilistic rankingPartialFast
Cosine (embeddings)Dense vectorsYesMedium
Cross-encoderTransformer pairsYes (best)Slow

Jaccard Similarity

The simplest measure: the intersection of words divided by their union.

def jaccard_similarity(doc1, doc2):
words1 = set(doc1.lower().split())
words2 = set(doc2.lower().split())
intersection = words1 & words2
union = words1 | words2
return len(intersection) / len(union)
doc_a = "Machine learning models transform text data into numerical representations."
doc_b = "Deep learning models process text and transform it into numerical vectors."
doc_c = "The French Revolution began in 1789 with the storming of the Bastille."
print(f"A vs B: {jaccard_similarity(doc_a, doc_b):.4f}") # ~0.35 — some overlap
print(f"A vs C: {jaccard_similarity(doc_a, doc_c):.4f}") # ~0.05 — very different topics

Cosine Similarity with TF-IDF

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
documents = [
"Python is widely used for natural language processing and data science.",
"NLP tasks include text classification, summarization, and named entity recognition.",
"Python libraries like spaCy and NLTK support natural language processing.",
"Computer vision focuses on processing and analyzing image data.",
"Medical imaging uses convolutional neural networks to detect diseases."
]
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(documents)
similarity_matrix = cosine_similarity(tfidf_matrix)
print("TF-IDF Cosine Similarity Matrix:")
print(np.round(similarity_matrix, 2))
# Find most similar pair
for i in range(len(documents)):
for j in range(i + 1, len(documents)):
score = similarity_matrix[i][j]
if score > 0.2:
print(f"\nDocs {i+1} & {j+1} (score: {score:.3f}):")
print(f" {documents[i][:60]}...")
print(f" {documents[j][:60]}...")

Semantic Similarity with Sentence Embeddings

TF-IDF misses semantic connections — “automobile” and “car” appear unrelated. Dense embeddings fix this:

from sentence_transformers import SentenceTransformer, util
import numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2')
docs = [
"The vehicle sped through the city streets.",
"The automobile raced across the urban road.",
"Scientists discovered a new species of deep-sea fish.",
"Researchers found an unknown organism in the ocean depths.",
]
embeddings = model.encode(docs)
sim_matrix = util.cos_sim(embeddings, embeddings).numpy()
print("Semantic Similarity Matrix:")
print(np.round(sim_matrix, 3))
# Docs 1 & 2 (vehicle/automobile) → high score despite zero shared keywords
# Docs 3 & 4 (deep-sea discovery) → high score
# Docs 1 & 3 → low score (unrelated topics)

Near-Duplicate Detection

from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')
def find_near_duplicates(texts, threshold=0.92):
embeddings = model.encode(texts, convert_to_tensor=True)
duplicates = []
for i in range(len(texts)):
for j in range(i + 1, len(texts)):
score = util.cos_sim(embeddings[i], embeddings[j]).item()
if score >= threshold:
duplicates.append((i, j, score))
return duplicates
articles = [
"OpenAI launched its latest GPT model with improved reasoning.",
"OpenAI released a new GPT version featuring enhanced reasoning capabilities.",
"Apple unveiled its new MacBook lineup at the WWDC event.",
"Google DeepMind published research on protein structure prediction.",
"A new GPT model from OpenAI was released with better reasoning skills.",
]
dupes = find_near_duplicates(articles, threshold=0.88)
for i, j, score in dupes:
print(f"Duplicate (score {score:.3f}):")
print(f" {articles[i]}")
print(f" {articles[j]}\n")

Document Clustering by Similarity

from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
import numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2')
docs = [
"Python is a popular programming language.",
"JavaScript powers web applications and browsers.",
"PyTorch is used for deep learning research.",
"React is a JavaScript framework for building UIs.",
"TensorFlow provides a platform for machine learning.",
"Vue.js is a progressive JavaScript framework."
]
embeddings = model.encode(docs)
kmeans = KMeans(n_clusters=2, random_state=42)
labels = kmeans.fit_predict(embeddings)
print("Clusters:")
for cluster_id in set(labels):
print(f"\nCluster {cluster_id}:")
for i, label in enumerate(labels):
if label == cluster_id:
print(f" - {docs[i]}")
# Cluster 0: Python/ML docs
# Cluster 1: JavaScript/web docs

Cross-Encoder for High-Accuracy Pairs

For reranking a small set of retrieved candidates, cross-encoders produce higher accuracy:

from sentence_transformers import CrossEncoder
# Cross-encoder jointly encodes both sentences together
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
query = "What is transfer learning in NLP?"
candidates = [
"Transfer learning uses pre-trained models adapted to new tasks.",
"NLP processing pipelines include tokenization and embedding steps.",
"Fine-tuning a BERT model involves training on labeled data for a specific task.",
]
scores = cross_encoder.predict([(query, c) for c in candidates])
ranked = sorted(zip(scores, candidates), reverse=True)
for score, doc in ranked:
print(f"Score {score:.3f}: {doc}")

Cross-encoders are typically used as a reranker: a bi-encoder (sentence-transformer) retrieves the top 100 candidates fast, then a cross-encoder reranks them for precision.