Categorical Encoding

Machine learning models require numerical inputs, but real-world data is full of categorical variables — text labels like “red”, “medium”, “New York”. Encoding converts these into numbers. The encoding choice matters significantly for both model performance and interpretation.

Types of Categorical Variables

Nominal (no order): color, city, product_type
  → One-hot encoding, target encoding, hashing

Ordinal (natural order): size (S/M/L), satisfaction (low/med/high), education level
  → Ordinal encoding with explicit rank

High cardinality nominal: zip_code, user_id, product_id (1000+ unique values)
  → Target encoding, frequency encoding, hashing, embeddings

Binary: yes/no, true/false, active/inactive
  → Simple 0/1 encoding (label encoding)

One-Hot Encoding

Creates a binary column for each category. Default choice for nominal categoricals with low cardinality (<20 values):

from sklearn.preprocessing import OneHotEncoder
import pandas as pd

encoder = OneHotEncoder(
    drop='first',              # Drop first column (avoid multicollinearity in linear models)
    sparse_output=False,       # Return dense array
    handle_unknown='ignore',   # Silently handle unseen categories at test time
    min_frequency=10           # Treat rare categories as 'Other' (sklearn 1.1+)
)

X_encoded = encoder.fit_transform(X_categorical_train)
X_test_encoded = encoder.transform(X_categorical_test)

# Feature names after encoding
feature_names = encoder.get_feature_names_out(input_features=categorical_cols)

Warning: One-hot encoding of a column with 500 unique values creates 500 binary columns — curse of dimensionality. Use target encoding instead.

Ordinal Encoding

For categories with a natural order:

from sklearn.preprocessing import OrdinalEncoder

# Explicit order
encoder = OrdinalEncoder(
    categories=[
        ['Low', 'Medium', 'High'],         # satisfaction_level: 0, 1, 2
        ['High School', 'Bachelor', 'Master', 'PhD']  # education: 0, 1, 2, 3
    ],
    handle_unknown='use_encoded_value',
    unknown_value=-1
)

X_ordinal = encoder.fit_transform(X[['satisfaction_level', 'education']])

Target Encoding

Replaces each category with the mean target value for that category. Excellent for high-cardinality features:

# sklearn 1.3+ TargetEncoder
from sklearn.preprocessing import TargetEncoder

target_enc = TargetEncoder(
    cv=5,              # Cross-validation to avoid leakage within training set
    smooth='auto',     # Blends category mean with global mean for rare categories
    random_state=42
)

# Must pass y during fit
X_encoded = target_enc.fit_transform(X[high_cardinality_cols], y)

# category_encoders library for more options
from category_encoders import TargetEncoder as CE_TargetEncoder
enc = CE_TargetEncoder(cols=['zip_code', 'product_id'])
X_encoded = enc.fit_transform(X, y)

Key risk: Target encoding without cross-validation causes target leakage — the category value encodes information about the target directly from the same sample. Always use cross-fitting or the sklearn TargetEncoder.

Frequency / Count Encoding

Replaces category with its frequency in the training set. Preserves information about rare vs. common categories:

def frequency_encode(df, col, train_df=None):
    """Encode by frequency — fit on train, apply to test."""
    if train_df is None:
        train_df = df
    freq_map = train_df[col].value_counts().to_dict()
    return df[col].map(freq_map).fillna(0).astype(int)

X_train['city_freq'] = frequency_encode(X_train, 'city')
X_test['city_freq'] = frequency_encode(X_test, 'city', train_df=X_train)

Hashing Trick

For extremely high cardinality (user IDs, product IDs with millions of values):

from sklearn.feature_extraction import FeatureHasher

hasher = FeatureHasher(n_features=1024, input_type='string')

# Converts categories to a fixed-size hash vector
X_hashed = hasher.fit_transform([[city] for city in X_train['city']])

Collisions (two categories mapping to the same hash) trade precision for bounded memory.

Embeddings for Very High Cardinality

For categorical variables with tens of thousands of unique values (neural network approach):

import torch
import torch.nn as nn

# Learned embedding — like Word2Vec for categorical features
embedding = nn.Embedding(
    num_embeddings=50000,  # Vocabulary size
    embedding_dim=16       # How many dimensions per category
)

# During forward pass
category_ids = torch.LongTensor([42, 156, 3])  # Category indices
embedded = embedding(category_ids)  # Shape: (3, 16)

Encoding Choice Guide

Category Type	Cardinality	Best Encoding
Nominal	Low (<20)	One-Hot
Nominal	Medium (20–100)	One-Hot or Target
Nominal	High (100+)	Target, Frequency, or Hashing
Ordinal	Any	Ordinal
Binary	2	Label (0/1)
For tree models	Any	Ordinal or Target (trees handle numbers well)
For linear/NN	Low	One-Hot
For linear/NN	High	Target or Embedding