Natural Language Processing
Fundamental Concepts
- Tokenization
- Stemming
- Lemmatization
- POS Tagging
- Named Entity Recognition
- Stopword Removal
- Syntax
- Dependency Parsing
- Parsing
- Chunking
Text Processing & Cleaning
- Text Normalization
- Bag of Words
- TF-IDF
- N-grams
- Word Embeddings
- Sentence Embeddings
- Document Similarity
- Cosine Similarity
- Text Vectorization
- Noise Removal
Tools, Libraries & APIs
- NLTK
- spaCy
- TextBlob
- Hugging Face Transformers
- Gensim
- OpenAI
- CoreNLP
- FastText
- Flair NLP
- ElasticSearch + NLP
Program(s)
- Build a Chatbot Using NLP
- Extracting Meaning from Text Using NLP in Python
- Extracting Email Addresses Using NLP in Python
- Extracting Names of People, Cities, and Countries Using NLP
- Format Email Messages Using NLP
- N-gram program
- Resume Skill Extraction Using NLP
- Sentiment Analysis in NLP
- Optimizing Travel Routes Using NLP & TSP Algorithm in Python
Introduction to Gensim in NLP
Natural Language Processing (NLP) involves working with large amounts of text data and understanding language patterns. One popular Python library that excels in advanced NLP tasks like topic modeling and document similarity is Gensim. Developed by Radim Řehůřek, Gensim is known for its efficiency and scalability when handling large corpora.
This guide will walk you through the essential concepts of Gensim, including:
- What Gensim is
- Topic modeling using Latent Dirichlet Allocation (LDA)
- Document similarity using TF-IDF and Word2Vec
- Three unique example programs with explanation
Let’s dive in!
What is Gensim?
Gensim is a robust open-source Python library designed specifically for unsupervised topic modeling and natural language document similarity. Its core strengths include memory independence, streaming, and ease of integration with large text datasets.
Gensim supports models like:
- TF-IDF (Term Frequency-Inverse Document Frequency)
- LDA (Latent Dirichlet Allocation)
- Word2Vec, FastText
- Doc2Vec
Core Concepts
1. Topic Modeling
Topic modeling is an unsupervised learning technique that discovers abstract topics within a text corpus. Gensim uses LDA to assign each document a mix of topics with a certain distribution.
2. Document Similarity
Document similarity involves comparing two or more documents to determine how alike they are. Gensim uses vector space models such as TF-IDF or Word2Vec to calculate cosine similarity between document vectors.
Example 1: Topic Modeling with LDA
from gensim import corpora, models
from pprint import pprint
# Sample documents
documents = [
"Artificial intelligence and machine learning are revolutionizing technology",
"The automotive industry is heavily investing in AI and self-driving cars",
"Machine learning is a subset of artificial intelligence",
"Cooking recipes and healthy meals are trending online"
]
# Tokenize and preprocess texts
texts = [[word.lower() for word in doc.split()] for doc in documents]
# Create dictionary and corpus
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
# Train LDA model
lda_model = models.LdaModel(corpus, num_topics=2, id2word=dictionary, passes=10)
# Print topics
pprint(lda_model.print_topics())
Output (example):
[(0, '0.062*"artificial" + 0.061*"intelligence" + 0.060*"machine" + ...'),
(1, '0.078*"cooking" + 0.065*"recipes" + 0.061*"healthy" + ...')]
Example 2: Document Similarity Using TF-IDF
from gensim import similarities
# Continue from previous example
# Create TF-IDF model
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
# Create index for similarity comparison
index = similarities.MatrixSimilarity(corpus_tfidf)
# Query document similarity
query_doc = "machine learning and artificial intelligence"
query_bow = dictionary.doc2bow(query_doc.lower().split())
query_tfidf = tfidf[query_bow]
# Compute similarities
similarities_scores = index[query_tfidf]
print(list(enumerate(similarities_scores)))
Output:
[(0, 0.81), (1, 0.68), (2, 0.79), (3, 0.05)]
Example 3: Document Similarity Using Word2Vec
from gensim.models import Word2Vec
import numpy as np
tokens = [doc.lower().split() for doc in documents]
# Train Word2Vec model
model = Word2Vec(sentences=tokens, vector_size=100, window=5, min_count=1, workers=2)
# Function to compute average vector for a document
def document_vector(doc):
doc = doc.lower().split()
return np.mean([model.wv[word] for word in doc if word in model.wv], axis=0)
# Compute cosine similarity manually
def cosine_sim(vec1, vec2):
return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
vec_a = document_vector(documents[0])
vec_b = document_vector(documents[2])
print("Similarity Score:", cosine_sim(vec_a, vec_b))
Output:
Similarity Score: 0.89
Benefits of Using Gensim
- Memory Efficient: Suitable for streaming large datasets.
- Unsupervised Learning: No need for labeled data.
- Well-documented: Easy to integrate and understand.
- Flexible Models: Choose from various similarity and topic modeling techniques.
Use Cases
- Topic detection in news articles or blogs
- Similar document retrieval in search engines
- Content recommendation systems
- Semantic clustering for analytics
Conclusion
Gensim is an indispensable tool for anyone working with textual data in NLP. Its strong capabilities in topic modeling and document similarity make it a go-to solution for building intelligent applications. Whether you’re analyzing customer reviews, classifying research papers, or building search engines, Gensim has you covered.
Explore Gensim’s documentation further and start integrating it into your NLP workflows today!