🧠 Bag of Words (BoW) in NLP: A Beginner’s Guide with Python Examples

Natural Language Processing (NLP) is all about teaching machines to understand and process human language. But there’s a catch: machines don’t understand words—they understand numbers.

So how do we convert text into numbers in a meaningful way?

Enter the Bag of Words (BoW) model.

BoW is one of the simplest and most widely used techniques for text vectorization. Despite its simplicity, it’s the foundation of many powerful models in NLP.

📘 What is Bag of Words (BoW)?

Bag of Words is a way to represent text data as a collection (bag) of its words, ignoring grammar and word order, but keeping track of word frequency.

In BoW:

Each unique word in the dataset becomes a feature (column).
Each document becomes a row vector with numbers showing how many times each word appears.

The result is a numerical representation of the text that can be used for machine learning.

🧾 Simple Example

Let’s say we have two sentences:

“I love NLP"
"NLP is fun”

Vocabulary = {I, love, NLP, is, fun}

Now we turn each sentence into a vector:

Sentence	I	love	NLP	is	fun
I love NLP	1	1	1	0	0
NLP is fun	0	0	1	1	1

This table is our Bag of Words representation.

✅ Why Use Bag of Words?

Simple to understand and implement
Works well for small to medium datasets
A good baseline for classification, clustering, and sentiment analysis

🔥 Limitations of BoW

Ignores word order (e.g., “not good” vs. “good”)
Doesn’t capture semantics or context
Can result in sparse matrices (many zeros)

Still, it’s a fantastic starting point for anyone new to NLP.

🧪 Code Example 1: Basic BoW with `CountVectorizer`

from sklearn.feature_extraction.text import CountVectorizer

# Sample data
corpus = [
    "I love NLP",
    "NLP is fun",
    "I love machine learning"
]

# Initialize BoW transformer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

# Show feature names
print("Vocabulary:", vectorizer.get_feature_names_out())

# Convert to dense array
print("BoW Matrix:\n", X.toarray())

Output:

Vocabulary: ['fun' 'is' 'learning' 'love' 'machine' 'nlp']
BoW Matrix:
[[0 0 0 1 0 1]
 [1 1 0 0 0 1]
 [0 0 1 1 1 0]]

📦 Code Example 2: BoW with Custom Preprocessing

You can preprocess text (lowercase, remove stopwords) before vectorizing.

import re
from sklearn.feature_extraction.text import CountVectorizer

# Custom clean function
def preprocess(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    return text

docs = ["Wow!! I love NLP.", "NLP, is really COOL!", "love love love this field!"]
cleaned_docs = [preprocess(doc) for doc in docs]

# Create BoW matrix
vec = CountVectorizer()
matrix = vec.fit_transform(cleaned_docs)

print("Features:", vec.get_feature_names_out())
print("BoW Matrix:\n", matrix.toarray())

Output:

Features: ['cool' 'field' 'is' 'love' 'nlp' 'really' 'this' 'wow']
BoW Matrix:
[[0 0 0 1 1 0 0 1]
 [1 0 1 0 1 1 0 0]
 [0 1 0 3 0 0 1 0]]

Notice how the word “love” is repeated and reflected with frequency 3.

📊 Code Example 3: Visualizing Word Frequencies

Let’s count word frequency from BoW and plot it using Matplotlib.

import matplotlib.pyplot as plt
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

text = ["Text normalization is important in NLP", "NLP uses BoW for representation", "BoW represents text using word counts"]

vec = CountVectorizer()
X = vec.fit_transform(text)
word_freq = np.sum(X.toarray(), axis=0)

# Words and frequencies
words = vec.get_feature_names_out()
freq_dict = dict(zip(words, word_freq))

# Plot
plt.figure(figsize=(10,5))
plt.bar(freq_dict.keys(), freq_dict.values(), color='skyblue')
plt.title("Word Frequencies (Bag of Words)")
plt.xlabel("Words")
plt.ylabel("Frequency")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

This chart helps visualize which words are most common in the corpus.

🧠 When Should You Use BoW?

BoW is excellent for:

Sentiment analysis
Spam detection
Document classification
Text similarity (e.g., comparing job descriptions)

It may not be the best choice when:

Word order matters (e.g., chatbots, grammar analysis)
You need to capture semantic meaning (try Word2Vec or BERT)

🔄 BoW vs TF-IDF

Feature	Bag of Words	TF-IDF
Counts frequencies	✅	❌ (weights important words)
Captures importance	❌	✅
Simplicity	✅ Easy	Slightly more complex

You can even combine BoW with TF-IDF or other models for more advanced NLP tasks.

✨ Summary

Feature	Explanation
What is BoW?	Text → numerical vector of word counts
Ignores Grammar?	Yes
Context-aware?	No
Benefits	Easy, fast, interpretable
Limitations	Sparse, loses meaning/context

✅ Final Thoughts

Bag of Words is one of the first steps in transforming messy, unstructured text into structured data. While it’s not the most advanced model, it’s powerful, easy to implement, and very effective for many NLP tasks.

Whether you’re working on a news classifier, a spam detector, or even a simple chatbot, Bag of Words gives you a great foundation. Learn it well, then explore TF-IDF, Word Embeddings, and Transformers later.

Natural Language Processing

Fundamental Concepts

Text Processing & Cleaning

Tools, Libraries & APIs

Program(s)

🧠 Bag of Words (BoW) in NLP: A Beginner’s Guide with Python Examples

📘 What is Bag of Words (BoW)?

🧾 Simple Example

✅ Why Use Bag of Words?

🔥 Limitations of BoW

🧪 Code Example 1: Basic BoW with `CountVectorizer`

Output:

📦 Code Example 2: BoW with Custom Preprocessing

Output:

📊 Code Example 3: Visualizing Word Frequencies

🧠 When Should You Use BoW?

🔄 BoW vs TF-IDF

✨ Summary

✅ Final Thoughts

Natural Language Processing

Fundamental Concepts

Text Processing & Cleaning

Tools, Libraries & APIs

Program(s)

🧠 Bag of Words (BoW) in NLP: A Beginner’s Guide with Python Examples

📘 What is Bag of Words (BoW)?

🧾 Simple Example

✅ Why Use Bag of Words?

🔥 Limitations of BoW

🧪 Code Example 1: Basic BoW with CountVectorizer

Output:

📦 Code Example 2: BoW with Custom Preprocessing

Output:

📊 Code Example 3: Visualizing Word Frequencies

🧠 When Should You Use BoW?

🔄 BoW vs TF-IDF

✨ Summary

✅ Final Thoughts

🧪 Code Example 1: Basic BoW with `CountVectorizer`