🔍 Cosine Similarity in NLP: An Easy Guide with Python Examples

In Natural Language Processing (NLP), we often need to measure how similar two pieces of text are. This is where Cosine Similarity comes into play. It’s a simple but powerful mathematical technique to determine how close two vectors (or texts) are in direction — even if they differ in length.

Let’s break down Cosine Similarity in plain English, explore how it works, and walk through three real Python examples to help you apply it in your NLP projects.


📘 What is Cosine Similarity?

Cosine Similarity is a metric used to measure how similar two vectors are by calculating the cosine of the angle between them. In text applications, these vectors typically represent word frequencies, TF-IDF scores, or word embeddings.

Formula:
[ \text{Cosine Similarity} = \frac{A \cdot B}{||A|| \times ||B||} ]

  • A ⋅ B is the dot product of the two vectors
  • ||A|| and ||B|| are the magnitudes (lengths) of the vectors

The output ranges from:

  • 1 (exactly the same direction, i.e., identical)
  • 0 (completely unrelated)
  • -1 (opposite direction — rare in NLP)

🧠 Why Use Cosine Similarity in NLP?

  • It focuses on the direction (not magnitude), which makes it great for comparing documents of different lengths.
  • It’s fast, easy to implement, and effective for many applications.
  • Works well with TF-IDF, BoW, and word embeddings.

✅ Use Cases in NLP

  • Document and sentence similarity
  • Plagiarism detection
  • Search engine matching
  • Question-answer pair matching
  • Recommendation engines

🧪 Example 1: Comparing Documents Using TF-IDF + Cosine Similarity

Let’s start with a simple example using TfidfVectorizer from scikit-learn.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Sample texts
doc1 = "Natural Language Processing makes machines understand text."
doc2 = "Machines understand text using NLP techniques."
doc3 = "Pizza and pasta are Italian foods."

# TF-IDF vectorization
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform([doc1, doc2, doc3])

# Calculate cosine similarity
cos_sim = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix)

# Display results
print("Similarity to doc1:")
for i, score in enumerate(cos_sim[0]):
    print(f"Doc{i+1}: {score:.2f}")

What You Learn: Documents 1 and 2 will have high similarity; doc3 will score low due to different content.


🧪 Example 2: Cosine Similarity with Word Embeddings (Using spaCy)

Use this when you want to consider semantic meaning using pretrained models.

pip install spacy
python -m spacy download en_core_web_md
import spacy
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Load medium English model
nlp = spacy.load('en_core_web_md')

# Texts
sentence1 = nlp("I like deep learning and artificial intelligence.")
sentence2 = nlp("I enjoy working with AI and neural networks.")
sentence3 = nlp("The weather is sunny today.")

# Get sentence vectors
vec1 = sentence1.vector.reshape(1, -1)
vec2 = sentence2.vector.reshape(1, -1)
vec3 = sentence3.vector.reshape(1, -1)

# Compute similarities
print("Sentence 1 vs Sentence 2:", cosine_similarity(vec1, vec2)[0][0])
print("Sentence 1 vs Sentence 3:", cosine_similarity(vec1, vec3)[0][0])

What You Learn: Even if the words differ, similar meaning yields a high similarity.


🧪 Example 3: Manual Cosine Similarity from Scratch

Great for understanding the math behind it. Let’s implement cosine similarity manually.

import numpy as np

def cosine_sim(a, b):
    dot_product = np.dot(a, b)
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)
    return dot_product / (norm_a * norm_b)

# Vectors (could be from BoW, TF-IDF, etc.)
vec1 = np.array([1, 2, 3])
vec2 = np.array([2, 4, 6])
vec3 = np.array([0, 0, 1])

# Compare
print("vec1 vs vec2:", cosine_sim(vec1, vec2))
print("vec1 vs vec3:", cosine_sim(vec1, vec3))

What You Learn: vec1 and vec2 have the same direction → similarity = 1. vec1 and vec3 are less similar.


⚖️ Cosine Similarity vs. Other Metrics

MetricBest ForWeakness
CosineGeneral-purpose text comparisonIgnores word order
Euclidean DistanceNumeric featuresSensitive to magnitude
JaccardSet-based comparisonNot great for long text
BERT SimilarityDeep semantic understandingSlower, resource-intensive

📊 Real-World Application Scenarios

ApplicationHow Cosine Similarity Helps
Document ClusteringGroups similar topics together
Search EnginesFinds most relevant docs to a search query
Chatbot IntentsMatches user input to known questions
Duplicate DetectionChecks for repeated questions or tickets
News RecommendationSuggests similar articles

🔧 Tips for Improving Similarity Accuracy

  1. Preprocess your text: Lowercase, remove stopwords/punctuation.
  2. Use TF-IDF instead of BoW for more meaningful vectorization.
  3. Use sentence embeddings (like spaCy or BERT) for deep semantics.
  4. Normalize your vectors if you’re not using a library that does it for you.

⚠️ Common Pitfalls

  • Word order is ignored: Cosine similarity with TF-IDF won’t capture phrases like “dog bites man” vs “man bites dog”.
  • Doesn’t understand synonyms: “happy” vs “joyful” will be far apart in simple vector spaces.
  • Embedding choice matters: Choose between TF-IDF, Word2Vec, BERT based on your use case.

🎯 Conclusion

Cosine Similarity is a powerful and easy-to-use tool in NLP for comparing how similar texts or documents are. Whether you’re building a search engine, a chatbot, or a recommender system, knowing how to implement cosine similarity is a must-have skill in your NLP toolkit.

With the 3 examples shown above, you now have multiple ways to apply cosine similarity in your own projects — from scratch, with TF-IDF, or using semantic embeddings.