Naive Bayes

Naive Bayes is one of the fastest classifiers in machine learning — training a model on 100,000 text documents takes milliseconds. Despite its name, it’s not naive about performance: it’s still the baseline to beat in text classification and performs remarkably well even when its core assumption (feature independence) is clearly violated.

Bayes’ Theorem

The foundation:

P(Class | Features) = P(Features | Class) × P(Class)
                       ────────────────────────────────
                              P(Features)

Where:
  P(Class | Features) = Posterior: what we want to compute
  P(Features | Class) = Likelihood: how often these features appear in this class
  P(Class)            = Prior: how common this class is overall
  P(Features)         = Evidence: normalizing constant (same for all classes)

For classification, we just need the class with the highest posterior — so we ignore P(Features) and compute:

Predicted class = argmax_c [ P(Features | c) × P(c) ]

The “Naive” Assumption

Computing P(Features | Class) for all feature combinations is intractable. The “naive” assumption simplifies this by assuming features are conditionally independent given the class:

P(f₁, f₂, ..., fₙ | Class) ≈ P(f₁ | Class) × P(f₂ | Class) × ... × P(fₙ | Class)

This lets us estimate each feature’s probability separately from training data — simple and fast, even if the independence assumption is wrong.

Variants of Naive Bayes

Gaussian Naive Bayes

For continuous features — assumes each feature follows a normal distribution within each class.

from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report

gnb = GaussianNB()
gnb.fit(X_train, y_train)

y_pred = gnb.predict(X_test)
y_prob = gnb.predict_proba(X_test)

print(classification_report(y_test, y_pred))

Multinomial Naive Bayes

For count-based features (e.g., word counts in text). The classic algorithm for text classification.

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline

text_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=10000, stop_words='english')),
    ('nb', MultinomialNB(alpha=1.0))  # alpha = Laplace smoothing
])

text_pipeline.fit(X_train_text, y_train)
y_pred = text_pipeline.predict(X_test_text)

Bernoulli Naive Bayes

For binary features — models whether each feature is present or absent (not how many times).

from sklearn.naive_bayes import BernoulliNB

bnb = BernoulliNB(alpha=1.0)
bnb.fit(X_train_binary, y_train)

Laplace Smoothing

Without smoothing, if a word never appears in a class during training, its probability is zero — and zero probability kills the entire product. Laplace smoothing adds a small count (alpha) to every feature:

P(word | class) = (count(word, class) + α)
                  ─────────────────────────────────
                  (count(all words in class) + α × vocabulary_size)

When α=0: No smoothing (dangerous — zero probabilities)
When α=1: Laplace/add-one smoothing (classic choice)
When α<1: Lidstone smoothing

Text Classification Example

from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# Fetch newsgroups data
data = fetch_20newsgroups(subset='train')
X_train, y_train = data.data, data.target
X_test_data = fetch_20newsgroups(subset='test')

# Build pipeline
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(sublinear_tf=True, max_df=0.5, max_features=30000)),
    ('nb', MultinomialNB(alpha=0.1))
])

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test_data.data)
print(f"Accuracy: {accuracy_score(X_test_data.target, y_pred):.4f}")

Multinomial NB typically achieves 80–85% on 20 Newsgroups — competitive with SVMs and far faster.

Updating Incrementally (Online Learning)

Naive Bayes can be updated incrementally — rare among sklearn models:

from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB()

# Process data in batches (useful for large datasets that don't fit in memory)
for X_batch, y_batch in data_generator():
    nb.partial_fit(X_batch, y_batch, classes=all_classes)

Strengths and Limitations

Strengths:

Extremely fast to train and predict
Handles high-dimensional sparse data well (text, genes)
Works well with small training sets
Naturally handles multiclass problems
Online learning via partial_fit

Limitations:

Assumes feature independence (often violated)
Produces poorly calibrated probability estimates (probabilities are extreme)
Gaussian NB assumes normality — use CalibratedClassifierCV for reliable probabilities
Not competitive on structured tabular data vs. tree-based models

Naive Bayes is ideal as a fast baseline and often the best choice when you have limited labeled data and high-dimensional text features.