Natural Language Processing
Fundamental Concepts
- Tokenization
- Stemming
- Lemmatization
- POS Tagging
- Named Entity Recognition
- Stopword Removal
- Syntax
- Dependency Parsing
- Parsing
- Chunking
Text Processing & Cleaning
- Text Normalization
- Bag of Words
- TF-IDF
- N-grams
- Word Embeddings
- Sentence Embeddings
- Document Similarity
- Cosine Similarity
- Text Vectorization
- Noise Removal
Tools, Libraries & APIs
- NLTK
- spaCy
- TextBlob
- Hugging Face Transformers
- Gensim
- OpenAI
- CoreNLP
- FastText
- Flair NLP
- ElasticSearch + NLP
Program(s)
- Build a Chatbot Using NLP
- Extracting Meaning from Text Using NLP in Python
- Extracting Email Addresses Using NLP in Python
- Extracting Names of People, Cities, and Countries Using NLP
- Format Email Messages Using NLP
- N-gram program
- Resume Skill Extraction Using NLP
- Sentiment Analysis in NLP
- Optimizing Travel Routes Using NLP & TSP Algorithm in Python
🧠 Bag of Words (BoW) in NLP: A Beginner’s Guide with Python Examples
Natural Language Processing (NLP) is all about teaching machines to understand and process human language. But there’s a catch: machines don’t understand words—they understand numbers.
So how do we convert text into numbers in a meaningful way?
Enter the Bag of Words (BoW) model.
BoW is one of the simplest and most widely used techniques for text vectorization. Despite its simplicity, it’s the foundation of many powerful models in NLP.
📘 What is Bag of Words (BoW)?
Bag of Words is a way to represent text data as a collection (bag) of its words, ignoring grammar and word order, but keeping track of word frequency.
In BoW:
- Each unique word in the dataset becomes a feature (column).
- Each document becomes a row vector with numbers showing how many times each word appears.
The result is a numerical representation of the text that can be used for machine learning.
🧾 Simple Example
Let’s say we have two sentences:
- “I love NLP"
- "NLP is fun”
Vocabulary = {I, love, NLP, is, fun}
Now we turn each sentence into a vector:
Sentence | I | love | NLP | is | fun |
---|---|---|---|---|---|
I love NLP | 1 | 1 | 1 | 0 | 0 |
NLP is fun | 0 | 0 | 1 | 1 | 1 |
This table is our Bag of Words representation.
✅ Why Use Bag of Words?
- Simple to understand and implement
- Works well for small to medium datasets
- A good baseline for classification, clustering, and sentiment analysis
🔥 Limitations of BoW
- Ignores word order (e.g., “not good” vs. “good”)
- Doesn’t capture semantics or context
- Can result in sparse matrices (many zeros)
Still, it’s a fantastic starting point for anyone new to NLP.
🧪 Code Example 1: Basic BoW with CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
# Sample data
corpus = [
"I love NLP",
"NLP is fun",
"I love machine learning"
]
# Initialize BoW transformer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
# Show feature names
print("Vocabulary:", vectorizer.get_feature_names_out())
# Convert to dense array
print("BoW Matrix:\n", X.toarray())
Output:
Vocabulary: ['fun' 'is' 'learning' 'love' 'machine' 'nlp']
BoW Matrix:
[[0 0 0 1 0 1]
[1 1 0 0 0 1]
[0 0 1 1 1 0]]
📦 Code Example 2: BoW with Custom Preprocessing
You can preprocess text (lowercase, remove stopwords) before vectorizing.
import re
from sklearn.feature_extraction.text import CountVectorizer
# Custom clean function
def preprocess(text):
text = text.lower()
text = re.sub(r'[^a-z\s]', '', text)
return text
docs = ["Wow!! I love NLP.", "NLP, is really COOL!", "love love love this field!"]
cleaned_docs = [preprocess(doc) for doc in docs]
# Create BoW matrix
vec = CountVectorizer()
matrix = vec.fit_transform(cleaned_docs)
print("Features:", vec.get_feature_names_out())
print("BoW Matrix:\n", matrix.toarray())
Output:
Features: ['cool' 'field' 'is' 'love' 'nlp' 'really' 'this' 'wow']
BoW Matrix:
[[0 0 0 1 1 0 0 1]
[1 0 1 0 1 1 0 0]
[0 1 0 3 0 0 1 0]]
Notice how the word “love” is repeated and reflected with frequency 3.
📊 Code Example 3: Visualizing Word Frequencies
Let’s count word frequency from BoW and plot it using Matplotlib.
import matplotlib.pyplot as plt
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
text = ["Text normalization is important in NLP", "NLP uses BoW for representation", "BoW represents text using word counts"]
vec = CountVectorizer()
X = vec.fit_transform(text)
word_freq = np.sum(X.toarray(), axis=0)
# Words and frequencies
words = vec.get_feature_names_out()
freq_dict = dict(zip(words, word_freq))
# Plot
plt.figure(figsize=(10,5))
plt.bar(freq_dict.keys(), freq_dict.values(), color='skyblue')
plt.title("Word Frequencies (Bag of Words)")
plt.xlabel("Words")
plt.ylabel("Frequency")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
This chart helps visualize which words are most common in the corpus.
🧠 When Should You Use BoW?
BoW is excellent for:
- Sentiment analysis
- Spam detection
- Document classification
- Text similarity (e.g., comparing job descriptions)
It may not be the best choice when:
- Word order matters (e.g., chatbots, grammar analysis)
- You need to capture semantic meaning (try Word2Vec or BERT)
🔄 BoW vs TF-IDF
Feature | Bag of Words | TF-IDF |
---|---|---|
Counts frequencies | ✅ | ❌ (weights important words) |
Captures importance | ❌ | ✅ |
Simplicity | ✅ Easy | Slightly more complex |
You can even combine BoW with TF-IDF or other models for more advanced NLP tasks.
✨ Summary
Feature | Explanation |
---|---|
What is BoW? | Text → numerical vector of word counts |
Ignores Grammar? | Yes |
Context-aware? | No |
Benefits | Easy, fast, interpretable |
Limitations | Sparse, loses meaning/context |
✅ Final Thoughts
Bag of Words is one of the first steps in transforming messy, unstructured text into structured data. While it’s not the most advanced model, it’s powerful, easy to implement, and very effective for many NLP tasks.
Whether you’re working on a news classifier, a spam detector, or even a simple chatbot, Bag of Words gives you a great foundation. Learn it well, then explore TF-IDF, Word Embeddings, and Transformers later.