Naive Bayes: Probabilistic Classification with Bayes' Theorem

Master Naive Bayes classifiers — Bayes' theorem, conditional independence, Gaussian and Multinomial variants, Laplace smoothing, and text classification applications.

Naive Bayes

Naive Bayes is one of the fastest classifiers in machine learning — training a model on 100,000 text documents takes milliseconds. Despite its name, it’s not naive about performance: it’s still the baseline to beat in text classification and performs remarkably well even when its core assumption (feature independence) is clearly violated.


Bayes’ Theorem

The foundation:

P(Class | Features) = P(Features | Class) × P(Class)
────────────────────────────────
P(Features)
Where:
P(Class | Features) = Posterior: what we want to compute
P(Features | Class) = Likelihood: how often these features appear in this class
P(Class) = Prior: how common this class is overall
P(Features) = Evidence: normalizing constant (same for all classes)

For classification, we just need the class with the highest posterior — so we ignore P(Features) and compute:

Predicted class = argmax_c [ P(Features | c) × P(c) ]

The “Naive” Assumption

Computing P(Features | Class) for all feature combinations is intractable. The “naive” assumption simplifies this by assuming features are conditionally independent given the class:

P(f₁, f₂, ..., fₙ | Class) ≈ P(f₁ | Class) × P(f₂ | Class) × ... × P(fₙ | Class)

This lets us estimate each feature’s probability separately from training data — simple and fast, even if the independence assumption is wrong.


Variants of Naive Bayes

Gaussian Naive Bayes

For continuous features — assumes each feature follows a normal distribution within each class.

from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)
y_prob = gnb.predict_proba(X_test)
print(classification_report(y_test, y_pred))

Multinomial Naive Bayes

For count-based features (e.g., word counts in text). The classic algorithm for text classification.

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
text_pipeline = Pipeline([
('tfidf', TfidfVectorizer(max_features=10000, stop_words='english')),
('nb', MultinomialNB(alpha=1.0)) # alpha = Laplace smoothing
])
text_pipeline.fit(X_train_text, y_train)
y_pred = text_pipeline.predict(X_test_text)

Bernoulli Naive Bayes

For binary features — models whether each feature is present or absent (not how many times).

from sklearn.naive_bayes import BernoulliNB
bnb = BernoulliNB(alpha=1.0)
bnb.fit(X_train_binary, y_train)

Laplace Smoothing

Without smoothing, if a word never appears in a class during training, its probability is zero — and zero probability kills the entire product. Laplace smoothing adds a small count (alpha) to every feature:

P(word | class) = (count(word, class) + α)
─────────────────────────────────
(count(all words in class) + α × vocabulary_size)
When α=0: No smoothing (dangerous — zero probabilities)
When α=1: Laplace/add-one smoothing (classic choice)
When α<1: Lidstone smoothing

Text Classification Example

from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
# Fetch newsgroups data
data = fetch_20newsgroups(subset='train')
X_train, y_train = data.data, data.target
X_test_data = fetch_20newsgroups(subset='test')
# Build pipeline
pipeline = Pipeline([
('tfidf', TfidfVectorizer(sublinear_tf=True, max_df=0.5, max_features=30000)),
('nb', MultinomialNB(alpha=0.1))
])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test_data.data)
print(f"Accuracy: {accuracy_score(X_test_data.target, y_pred):.4f}")

Multinomial NB typically achieves 80–85% on 20 Newsgroups — competitive with SVMs and far faster.


Updating Incrementally (Online Learning)

Naive Bayes can be updated incrementally — rare among sklearn models:

from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
# Process data in batches (useful for large datasets that don't fit in memory)
for X_batch, y_batch in data_generator():
nb.partial_fit(X_batch, y_batch, classes=all_classes)

Strengths and Limitations

Strengths:

  • Extremely fast to train and predict
  • Handles high-dimensional sparse data well (text, genes)
  • Works well with small training sets
  • Naturally handles multiclass problems
  • Online learning via partial_fit

Limitations:

  • Assumes feature independence (often violated)
  • Produces poorly calibrated probability estimates (probabilities are extreme)
  • Gaussian NB assumes normality — use CalibratedClassifierCV for reliable probabilities
  • Not competitive on structured tabular data vs. tree-based models

Naive Bayes is ideal as a fast baseline and often the best choice when you have limited labeled data and high-dimensional text features.