🧠 Bag of Words (BoW) in NLP: A Beginner’s Guide with Python Examples

Natural Language Processing (NLP) is all about teaching machines to understand and process human language. But there’s a catch: machines don’t understand words—they understand numbers.

So how do we convert text into numbers in a meaningful way?

Enter the Bag of Words (BoW) model.

BoW is one of the simplest and most widely used techniques for text vectorization. Despite its simplicity, it’s the foundation of many powerful models in NLP.


📘 What is Bag of Words (BoW)?

Bag of Words is a way to represent text data as a collection (bag) of its words, ignoring grammar and word order, but keeping track of word frequency.

In BoW:

  • Each unique word in the dataset becomes a feature (column).
  • Each document becomes a row vector with numbers showing how many times each word appears.

The result is a numerical representation of the text that can be used for machine learning.


🧾 Simple Example

Let’s say we have two sentences:

  1. “I love NLP"
  2. "NLP is fun”

Vocabulary = {I, love, NLP, is, fun}

Now we turn each sentence into a vector:

SentenceIloveNLPisfun
I love NLP11100
NLP is fun00111

This table is our Bag of Words representation.


✅ Why Use Bag of Words?

  • Simple to understand and implement
  • Works well for small to medium datasets
  • A good baseline for classification, clustering, and sentiment analysis

🔥 Limitations of BoW

  • Ignores word order (e.g., “not good” vs. “good”)
  • Doesn’t capture semantics or context
  • Can result in sparse matrices (many zeros)

Still, it’s a fantastic starting point for anyone new to NLP.


🧪 Code Example 1: Basic BoW with CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer

# Sample data
corpus = [
    "I love NLP",
    "NLP is fun",
    "I love machine learning"
]

# Initialize BoW transformer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

# Show feature names
print("Vocabulary:", vectorizer.get_feature_names_out())

# Convert to dense array
print("BoW Matrix:\n", X.toarray())

Output:

Vocabulary: ['fun' 'is' 'learning' 'love' 'machine' 'nlp']
BoW Matrix:
[[0 0 0 1 0 1]
 [1 1 0 0 0 1]
 [0 0 1 1 1 0]]

📦 Code Example 2: BoW with Custom Preprocessing

You can preprocess text (lowercase, remove stopwords) before vectorizing.

import re
from sklearn.feature_extraction.text import CountVectorizer

# Custom clean function
def preprocess(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    return text

docs = ["Wow!! I love NLP.", "NLP, is really COOL!", "love love love this field!"]
cleaned_docs = [preprocess(doc) for doc in docs]

# Create BoW matrix
vec = CountVectorizer()
matrix = vec.fit_transform(cleaned_docs)

print("Features:", vec.get_feature_names_out())
print("BoW Matrix:\n", matrix.toarray())

Output:

Features: ['cool' 'field' 'is' 'love' 'nlp' 'really' 'this' 'wow']
BoW Matrix:
[[0 0 0 1 1 0 0 1]
 [1 0 1 0 1 1 0 0]
 [0 1 0 3 0 0 1 0]]

Notice how the word “love” is repeated and reflected with frequency 3.


📊 Code Example 3: Visualizing Word Frequencies

Let’s count word frequency from BoW and plot it using Matplotlib.

import matplotlib.pyplot as plt
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

text = ["Text normalization is important in NLP", "NLP uses BoW for representation", "BoW represents text using word counts"]

vec = CountVectorizer()
X = vec.fit_transform(text)
word_freq = np.sum(X.toarray(), axis=0)

# Words and frequencies
words = vec.get_feature_names_out()
freq_dict = dict(zip(words, word_freq))

# Plot
plt.figure(figsize=(10,5))
plt.bar(freq_dict.keys(), freq_dict.values(), color='skyblue')
plt.title("Word Frequencies (Bag of Words)")
plt.xlabel("Words")
plt.ylabel("Frequency")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

This chart helps visualize which words are most common in the corpus.


🧠 When Should You Use BoW?

BoW is excellent for:

  • Sentiment analysis
  • Spam detection
  • Document classification
  • Text similarity (e.g., comparing job descriptions)

It may not be the best choice when:

  • Word order matters (e.g., chatbots, grammar analysis)
  • You need to capture semantic meaning (try Word2Vec or BERT)

🔄 BoW vs TF-IDF

FeatureBag of WordsTF-IDF
Counts frequencies❌ (weights important words)
Captures importance
Simplicity✅ EasySlightly more complex

You can even combine BoW with TF-IDF or other models for more advanced NLP tasks.


✨ Summary

FeatureExplanation
What is BoW?Text → numerical vector of word counts
Ignores Grammar?Yes
Context-aware?No
BenefitsEasy, fast, interpretable
LimitationsSparse, loses meaning/context

✅ Final Thoughts

Bag of Words is one of the first steps in transforming messy, unstructured text into structured data. While it’s not the most advanced model, it’s powerful, easy to implement, and very effective for many NLP tasks.

Whether you’re working on a news classifier, a spam detector, or even a simple chatbot, Bag of Words gives you a great foundation. Learn it well, then explore TF-IDF, Word Embeddings, and Transformers later.