Naive Bayes
Naive Bayes is one of the fastest classifiers in machine learning — training a model on 100,000 text documents takes milliseconds. Despite its name, it’s not naive about performance: it’s still the baseline to beat in text classification and performs remarkably well even when its core assumption (feature independence) is clearly violated.
Bayes’ Theorem
The foundation:
P(Class | Features) = P(Features | Class) × P(Class) ──────────────────────────────── P(Features)
Where: P(Class | Features) = Posterior: what we want to compute P(Features | Class) = Likelihood: how often these features appear in this class P(Class) = Prior: how common this class is overall P(Features) = Evidence: normalizing constant (same for all classes)For classification, we just need the class with the highest posterior — so we ignore P(Features) and compute:
Predicted class = argmax_c [ P(Features | c) × P(c) ]The “Naive” Assumption
Computing P(Features | Class) for all feature combinations is intractable. The “naive” assumption simplifies this by assuming features are conditionally independent given the class:
P(f₁, f₂, ..., fₙ | Class) ≈ P(f₁ | Class) × P(f₂ | Class) × ... × P(fₙ | Class)This lets us estimate each feature’s probability separately from training data — simple and fast, even if the independence assumption is wrong.
Variants of Naive Bayes
Gaussian Naive Bayes
For continuous features — assumes each feature follows a normal distribution within each class.
from sklearn.naive_bayes import GaussianNBfrom sklearn.metrics import accuracy_score, classification_report
gnb = GaussianNB()gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)y_prob = gnb.predict_proba(X_test)
print(classification_report(y_test, y_pred))Multinomial Naive Bayes
For count-based features (e.g., word counts in text). The classic algorithm for text classification.
from sklearn.naive_bayes import MultinomialNBfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.pipeline import Pipeline
text_pipeline = Pipeline([ ('tfidf', TfidfVectorizer(max_features=10000, stop_words='english')), ('nb', MultinomialNB(alpha=1.0)) # alpha = Laplace smoothing])
text_pipeline.fit(X_train_text, y_train)y_pred = text_pipeline.predict(X_test_text)Bernoulli Naive Bayes
For binary features — models whether each feature is present or absent (not how many times).
from sklearn.naive_bayes import BernoulliNB
bnb = BernoulliNB(alpha=1.0)bnb.fit(X_train_binary, y_train)Laplace Smoothing
Without smoothing, if a word never appears in a class during training, its probability is zero — and zero probability kills the entire product. Laplace smoothing adds a small count (alpha) to every feature:
P(word | class) = (count(word, class) + α) ───────────────────────────────── (count(all words in class) + α × vocabulary_size)
When α=0: No smoothing (dangerous — zero probabilities)When α=1: Laplace/add-one smoothing (classic choice)When α<1: Lidstone smoothingText Classification Example
from sklearn.datasets import fetch_20newsgroupsfrom sklearn.naive_bayes import MultinomialNBfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.pipeline import Pipelinefrom sklearn.metrics import accuracy_score
# Fetch newsgroups datadata = fetch_20newsgroups(subset='train')X_train, y_train = data.data, data.targetX_test_data = fetch_20newsgroups(subset='test')
# Build pipelinepipeline = Pipeline([ ('tfidf', TfidfVectorizer(sublinear_tf=True, max_df=0.5, max_features=30000)), ('nb', MultinomialNB(alpha=0.1))])
pipeline.fit(X_train, y_train)y_pred = pipeline.predict(X_test_data.data)print(f"Accuracy: {accuracy_score(X_test_data.target, y_pred):.4f}")Multinomial NB typically achieves 80–85% on 20 Newsgroups — competitive with SVMs and far faster.
Updating Incrementally (Online Learning)
Naive Bayes can be updated incrementally — rare among sklearn models:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
# Process data in batches (useful for large datasets that don't fit in memory)for X_batch, y_batch in data_generator(): nb.partial_fit(X_batch, y_batch, classes=all_classes)Strengths and Limitations
Strengths:
- Extremely fast to train and predict
- Handles high-dimensional sparse data well (text, genes)
- Works well with small training sets
- Naturally handles multiclass problems
- Online learning via
partial_fit
Limitations:
- Assumes feature independence (often violated)
- Produces poorly calibrated probability estimates (probabilities are extreme)
- Gaussian NB assumes normality — use CalibratedClassifierCV for reliable probabilities
- Not competitive on structured tabular data vs. tree-based models
Naive Bayes is ideal as a fast baseline and often the best choice when you have limited labeled data and high-dimensional text features.