Categorical Encoding
Machine learning models require numerical inputs, but real-world data is full of categorical variables — text labels like “red”, “medium”, “New York”. Encoding converts these into numbers. The encoding choice matters significantly for both model performance and interpretation.
Types of Categorical Variables
Nominal (no order): color, city, product_type → One-hot encoding, target encoding, hashing
Ordinal (natural order): size (S/M/L), satisfaction (low/med/high), education level → Ordinal encoding with explicit rank
High cardinality nominal: zip_code, user_id, product_id (1000+ unique values) → Target encoding, frequency encoding, hashing, embeddings
Binary: yes/no, true/false, active/inactive → Simple 0/1 encoding (label encoding)One-Hot Encoding
Creates a binary column for each category. Default choice for nominal categoricals with low cardinality (<20 values):
from sklearn.preprocessing import OneHotEncoderimport pandas as pd
encoder = OneHotEncoder( drop='first', # Drop first column (avoid multicollinearity in linear models) sparse_output=False, # Return dense array handle_unknown='ignore', # Silently handle unseen categories at test time min_frequency=10 # Treat rare categories as 'Other' (sklearn 1.1+))
X_encoded = encoder.fit_transform(X_categorical_train)X_test_encoded = encoder.transform(X_categorical_test)
# Feature names after encodingfeature_names = encoder.get_feature_names_out(input_features=categorical_cols)Warning: One-hot encoding of a column with 500 unique values creates 500 binary columns — curse of dimensionality. Use target encoding instead.
Ordinal Encoding
For categories with a natural order:
from sklearn.preprocessing import OrdinalEncoder
# Explicit orderencoder = OrdinalEncoder( categories=[ ['Low', 'Medium', 'High'], # satisfaction_level: 0, 1, 2 ['High School', 'Bachelor', 'Master', 'PhD'] # education: 0, 1, 2, 3 ], handle_unknown='use_encoded_value', unknown_value=-1)
X_ordinal = encoder.fit_transform(X[['satisfaction_level', 'education']])Target Encoding
Replaces each category with the mean target value for that category. Excellent for high-cardinality features:
# sklearn 1.3+ TargetEncoderfrom sklearn.preprocessing import TargetEncoder
target_enc = TargetEncoder( cv=5, # Cross-validation to avoid leakage within training set smooth='auto', # Blends category mean with global mean for rare categories random_state=42)
# Must pass y during fitX_encoded = target_enc.fit_transform(X[high_cardinality_cols], y)
# category_encoders library for more optionsfrom category_encoders import TargetEncoder as CE_TargetEncoderenc = CE_TargetEncoder(cols=['zip_code', 'product_id'])X_encoded = enc.fit_transform(X, y)Key risk: Target encoding without cross-validation causes target leakage — the category value encodes information about the target directly from the same sample. Always use cross-fitting or the sklearn TargetEncoder.
Frequency / Count Encoding
Replaces category with its frequency in the training set. Preserves information about rare vs. common categories:
def frequency_encode(df, col, train_df=None): """Encode by frequency — fit on train, apply to test.""" if train_df is None: train_df = df freq_map = train_df[col].value_counts().to_dict() return df[col].map(freq_map).fillna(0).astype(int)
X_train['city_freq'] = frequency_encode(X_train, 'city')X_test['city_freq'] = frequency_encode(X_test, 'city', train_df=X_train)Hashing Trick
For extremely high cardinality (user IDs, product IDs with millions of values):
from sklearn.feature_extraction import FeatureHasher
hasher = FeatureHasher(n_features=1024, input_type='string')
# Converts categories to a fixed-size hash vectorX_hashed = hasher.fit_transform([[city] for city in X_train['city']])Collisions (two categories mapping to the same hash) trade precision for bounded memory.
Embeddings for Very High Cardinality
For categorical variables with tens of thousands of unique values (neural network approach):
import torchimport torch.nn as nn
# Learned embedding — like Word2Vec for categorical featuresembedding = nn.Embedding( num_embeddings=50000, # Vocabulary size embedding_dim=16 # How many dimensions per category)
# During forward passcategory_ids = torch.LongTensor([42, 156, 3]) # Category indicesembedded = embedding(category_ids) # Shape: (3, 16)Encoding Choice Guide
| Category Type | Cardinality | Best Encoding |
|---|---|---|
| Nominal | Low (<20) | One-Hot |
| Nominal | Medium (20–100) | One-Hot or Target |
| Nominal | High (100+) | Target, Frequency, or Hashing |
| Ordinal | Any | Ordinal |
| Binary | 2 | Label (0/1) |
| For tree models | Any | Ordinal or Target (trees handle numbers well) |
| For linear/NN | Low | One-Hot |
| For linear/NN | High | Target or Embedding |