Natural Language Processing
Fundamental Concepts
- Tokenization
- Stemming
- Lemmatization
- POS Tagging
- Named Entity Recognition
- Stopword Removal
- Syntax
- Dependency Parsing
- Parsing
- Chunking
Text Processing & Cleaning
- Text Normalization
- Bag of Words
- TF-IDF
- N-grams
- Word Embeddings
- Sentence Embeddings
- Document Similarity
- Cosine Similarity
- Text Vectorization
- Noise Removal
Tools, Libraries & APIs
- NLTK
- spaCy
- TextBlob
- Hugging Face Transformers
- Gensim
- OpenAI
- CoreNLP
- FastText
- Flair NLP
- ElasticSearch + NLP
Program(s)
- Build a Chatbot Using NLP
- Extracting Meaning from Text Using NLP in Python
- Extracting Email Addresses Using NLP in Python
- Extracting Names of People, Cities, and Countries Using NLP
- Format Email Messages Using NLP
- N-gram program
- Resume Skill Extraction Using NLP
- Sentiment Analysis in NLP
- Optimizing Travel Routes Using NLP & TSP Algorithm in Python
๐ Understanding Document Similarity in NLP: Concepts and Python Examples
In Natural Language Processing (NLP), understanding how similar two pieces of text are is one of the most important tasks. This task is known as Document Similarity.
Whether youโre building a plagiarism checker, search engine, recommendation system, or chatbot โ document similarity plays a key role in helping machines compare and understand human-written content.
๐ What Is Document Similarity?
Document similarity measures how alike two or more pieces of text are. The goal is to compute a score that represents how close their meanings are โ not just whether the same words are used.
This can be done in various ways, from simple keyword matching to advanced models that capture sentence meaning and context.
๐ Why Is It Important?
Here are a few real-world uses:
- Plagiarism detection
- Duplicate detection in content or questions
- Semantic search (matching queries with documents)
- Recommender systems based on text similarity
- Chatbots identifying similar user inputs
๐ข Popular Techniques to Measure Document Similarity
- Cosine Similarity with TF-IDF
- Jaccard Similarity
- Semantic Similarity using Sentence Transformers (BERT)
Weโll explore each with a hands-on Python example.
๐งช Example 1: Document Similarity using TF-IDF + Cosine Similarity
TF-IDF (Term Frequency-Inverse Document Frequency) highlights important words in a document. Cosine similarity then measures how aligned their vectors are.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# Sample documents
doc1 = "Artificial intelligence is changing the world."
doc2 = "Machine learning and AI are transforming industries."
doc3 = "Cats and dogs are common household pets."
# Vectorize
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform([doc1, doc2, doc3])
# Calculate cosine similarity
cos_sim = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix)
print("Cosine Similarity with doc1:")
for i, score in enumerate(cos_sim[0]):
print(f"Document {i+1}: {score:.2f}")
Output will show that doc1 and doc2 are more similar than doc3.
โ
Pros: Fast and interpretable
โ Cons: Doesnโt understand context or synonyms
๐งช Example 2: Document Similarity using Jaccard Similarity
Jaccard similarity compares the common words between two documents and divides by the total unique words.
def jaccard_similarity(doc1, doc2):
words_doc1 = set(doc1.lower().split())
words_doc2 = set(doc2.lower().split())
intersection = words_doc1.intersection(words_doc2)
union = words_doc1.union(words_doc2)
return len(intersection) / len(union)
# Test documents
doc1 = "AI is powerful and smart"
doc2 = "AI is smart and useful"
doc3 = "I love hiking and nature"
# Compare
print("Doc1 vs Doc2 Jaccard:", jaccard_similarity(doc1, doc2))
print("Doc1 vs Doc3 Jaccard:", jaccard_similarity(doc1, doc3))
Output: Higher similarity for doc1 vs doc2
โ
Pros: Simple and interpretable
โ Cons: Ignores word order and meaning
๐งช Example 3: Semantic Document Similarity using Sentence-BERT
This method uses transformer-based models like BERT to understand the semantic meaning of the whole document.
pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer, util
# Load model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Sample documents
docs = [
"The economy is experiencing a recession due to inflation.",
"Economic downturns are caused by inflation and high interest rates.",
"Soccer is a popular sport around the world."
]
# Get embeddings
embeddings = model.encode(docs, convert_to_tensor=True)
# Compare doc 0 with others
similarities = util.cos_sim(embeddings[0], embeddings)
print("Semantic Similarity to doc1:")
for i, score in enumerate(similarities[0]):
print(f"Document {i+1}: {score:.2f}")
This technique understands context and synonyms, and produces high-quality similarity scores.
โ
Pros: Captures deep semantic meaning
โ Cons: Requires more memory and compute
๐ง When to Use What?
Technique | Best For | Limitation |
---|---|---|
TF-IDF + Cosine | Keyword-based similarity, fast tasks | Ignores meaning |
Jaccard | Short texts, simple use cases | No semantic understanding |
BERT/SBERT | High-accuracy, semantic tasks | Slower, more resource-intensive |
โจ Real-World Applications
Application | How Similarity Helps |
---|---|
FAQ Bots | Match user question to existing answers |
Search Engines | Rank documents based on relevance |
News Aggregators | Cluster similar stories together |
Content Recommendation Systems | Recommend articles based on topic similarity |
Legal & Medical NLP | Identify similar clauses or patient records |
๐ Tips for Better Document Similarity
- Clean and normalize your text (remove stopwords, lowercasing)
- Use semantic methods (SBERT) for long or complex texts
- Combine similarity scores with business rules for robust systems
- Evaluate results with real-world examples, not just math
โ ๏ธ Challenges in Document Similarity
- Long documents โ embeddings may miss context
- Synonyms or paraphrasing โ traditional methods fail
- Domain-specific texts โ may require fine-tuning models
- Computational cost โ large models are memory-intensive
๐ Bonus: Visualizing Similar Documents with PCA
You can visualize similar documents in 2D using PCA!
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
# Get embeddings
docs = ["AI in healthcare", "Machine learning for hospitals", "Soccer scores today"]
embeddings = model.encode(docs)
# Reduce dimensions
pca = PCA(n_components=2)
reduced = pca.fit_transform(embeddings)
# Plot
plt.scatter(reduced[:,0], reduced[:,1])
for i, doc in enumerate(docs):
plt.annotate(f"Doc {i+1}", (reduced[i,0], reduced[i,1]))
plt.title("Document Similarity (PCA View)")
plt.show()
๐งพ Final Thoughts
Document similarity is a core task in NLP that enables smarter apps, better recommendations, and more useful bots. Whether youโre matching resumes to jobs or queries to documents, the right similarity technique can dramatically improve your results.
Start simple with TF-IDF, experiment with Jaccard, and scale up to Sentence Transformers when you need deeper semantic understanding.
With the code examples shared here, youโre ready to build your own smart text comparison tools.