Categorical Encoding: Converting Categories to Numbers for ML

Learn categorical encoding techniques — one-hot encoding, ordinal encoding, target encoding, frequency encoding, hashing, and handling high-cardinality categoricals.

Categorical Encoding

Machine learning models require numerical inputs, but real-world data is full of categorical variables — text labels like “red”, “medium”, “New York”. Encoding converts these into numbers. The encoding choice matters significantly for both model performance and interpretation.


Types of Categorical Variables

Nominal (no order): color, city, product_type
→ One-hot encoding, target encoding, hashing
Ordinal (natural order): size (S/M/L), satisfaction (low/med/high), education level
→ Ordinal encoding with explicit rank
High cardinality nominal: zip_code, user_id, product_id (1000+ unique values)
→ Target encoding, frequency encoding, hashing, embeddings
Binary: yes/no, true/false, active/inactive
→ Simple 0/1 encoding (label encoding)

One-Hot Encoding

Creates a binary column for each category. Default choice for nominal categoricals with low cardinality (<20 values):

from sklearn.preprocessing import OneHotEncoder
import pandas as pd
encoder = OneHotEncoder(
drop='first', # Drop first column (avoid multicollinearity in linear models)
sparse_output=False, # Return dense array
handle_unknown='ignore', # Silently handle unseen categories at test time
min_frequency=10 # Treat rare categories as 'Other' (sklearn 1.1+)
)
X_encoded = encoder.fit_transform(X_categorical_train)
X_test_encoded = encoder.transform(X_categorical_test)
# Feature names after encoding
feature_names = encoder.get_feature_names_out(input_features=categorical_cols)

Warning: One-hot encoding of a column with 500 unique values creates 500 binary columns — curse of dimensionality. Use target encoding instead.


Ordinal Encoding

For categories with a natural order:

from sklearn.preprocessing import OrdinalEncoder
# Explicit order
encoder = OrdinalEncoder(
categories=[
['Low', 'Medium', 'High'], # satisfaction_level: 0, 1, 2
['High School', 'Bachelor', 'Master', 'PhD'] # education: 0, 1, 2, 3
],
handle_unknown='use_encoded_value',
unknown_value=-1
)
X_ordinal = encoder.fit_transform(X[['satisfaction_level', 'education']])

Target Encoding

Replaces each category with the mean target value for that category. Excellent for high-cardinality features:

# sklearn 1.3+ TargetEncoder
from sklearn.preprocessing import TargetEncoder
target_enc = TargetEncoder(
cv=5, # Cross-validation to avoid leakage within training set
smooth='auto', # Blends category mean with global mean for rare categories
random_state=42
)
# Must pass y during fit
X_encoded = target_enc.fit_transform(X[high_cardinality_cols], y)
# category_encoders library for more options
from category_encoders import TargetEncoder as CE_TargetEncoder
enc = CE_TargetEncoder(cols=['zip_code', 'product_id'])
X_encoded = enc.fit_transform(X, y)

Key risk: Target encoding without cross-validation causes target leakage — the category value encodes information about the target directly from the same sample. Always use cross-fitting or the sklearn TargetEncoder.


Frequency / Count Encoding

Replaces category with its frequency in the training set. Preserves information about rare vs. common categories:

def frequency_encode(df, col, train_df=None):
"""Encode by frequency — fit on train, apply to test."""
if train_df is None:
train_df = df
freq_map = train_df[col].value_counts().to_dict()
return df[col].map(freq_map).fillna(0).astype(int)
X_train['city_freq'] = frequency_encode(X_train, 'city')
X_test['city_freq'] = frequency_encode(X_test, 'city', train_df=X_train)

Hashing Trick

For extremely high cardinality (user IDs, product IDs with millions of values):

from sklearn.feature_extraction import FeatureHasher
hasher = FeatureHasher(n_features=1024, input_type='string')
# Converts categories to a fixed-size hash vector
X_hashed = hasher.fit_transform([[city] for city in X_train['city']])

Collisions (two categories mapping to the same hash) trade precision for bounded memory.


Embeddings for Very High Cardinality

For categorical variables with tens of thousands of unique values (neural network approach):

import torch
import torch.nn as nn
# Learned embedding — like Word2Vec for categorical features
embedding = nn.Embedding(
num_embeddings=50000, # Vocabulary size
embedding_dim=16 # How many dimensions per category
)
# During forward pass
category_ids = torch.LongTensor([42, 156, 3]) # Category indices
embedded = embedding(category_ids) # Shape: (3, 16)

Encoding Choice Guide

Category TypeCardinalityBest Encoding
NominalLow (<20)One-Hot
NominalMedium (20–100)One-Hot or Target
NominalHigh (100+)Target, Frequency, or Hashing
OrdinalAnyOrdinal
Binary2Label (0/1)
For tree modelsAnyOrdinal or Target (trees handle numbers well)
For linear/NNLowOne-Hot
For linear/NNHighTarget or Embedding