Feature Engineering

There’s an old saying in machine learning: “garbage in, garbage out.” The flip side is equally true — better features produce better models, often more dramatically than switching to a fancier algorithm. Feature engineering is the craft of transforming raw data into meaningful inputs that help models learn faster and generalize better.

What Is a Feature?

A feature is any measurable property used as input to a model. Raw data is rarely ready to use as-is. Feature engineering transforms raw signals into representations the model can work with effectively.

Raw data:              After feature engineering:
──────────────────     ──────────────────────────────────────
timestamp              hour_of_day, day_of_week, is_weekend
user_id + item_id      user_avg_rating, item_popularity, user_item_category_match
text: "great product"  tf_idf_score, sentiment_score, word_count, has_exclamation
lat/lon coordinates    distance_to_nearest_store, city_population_density

Core Feature Engineering Techniques

Numerical Transformations

Log transform: Compress right-skewed distributions (income, prices, counts).

import numpy as np
df['log_price'] = np.log1p(df['price'])  # log(1 + price) handles zeros

Polynomial features: Capture non-linear relationships.

from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X[['age', 'income']])
# Creates: age, income, age², income², age×income

Binning / Discretization: Convert continuous to categorical.

df['age_group'] = pd.cut(df['age'], bins=[0,18,35,50,65,100],
                          labels=['teen','young','mid','senior','elder'])

Datetime Features

One of the most consistently valuable feature engineering moves:

df['timestamp'] = pd.to_datetime(df['timestamp'])
df['hour'] = df['timestamp'].dt.hour
df['day_of_week'] = df['timestamp'].dt.dayofweek   # 0=Monday
df['month'] = df['timestamp'].dt.month
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
df['days_since_last_purchase'] = (today - df['last_purchase_date']).dt.days

# Cyclical encoding (hour 23 and hour 0 are neighbors)
df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)

Interaction Features

Combine two features to capture relationships neither alone can express.

# Physical BMI = weight / height²
df['bmi'] = df['weight_kg'] / (df['height_m'] ** 2)

# Fraud: unusual purchase amount relative to personal average
df['amount_vs_avg_ratio'] = df['amount'] / (df['user_avg_amount'] + 1e-6)

# E-commerce: recency × frequency × monetary (RFM)
df['rfm_score'] = df['recency_score'] * df['frequency_score'] * df['monetary_score']

Aggregation Features

Compute statistics across a related group (often very powerful for tabular data):

# User-level aggregations
user_stats = df.groupby('user_id').agg(
    user_avg_amount=('amount', 'mean'),
    user_transaction_count=('amount', 'count'),
    user_max_amount=('amount', 'max'),
    user_unique_merchants=('merchant_id', 'nunique')
).reset_index()

df = df.merge(user_stats, on='user_id', how='left')

Domain-Specific Features: Where Real Gains Happen

Generic features get you to 70–80% of possible performance. Domain knowledge gets you the rest.

E-commerce fraud detection:

Time since account creation (new accounts are riskier)
Number of failed payment attempts in last 24h
Distance between billing and shipping addresses
Whether billing address was recently changed

Credit scoring:

Debt-to-income ratio
Utilization rate (balance / credit limit)
Payment behavior over last 12 months
Credit inquiry frequency

Medical diagnosis:

Rate of change in lab values (not just current value)
Ratio of correlated biomarkers
Days between symptoms and first visit

Text-Derived Features

import re
from sklearn.feature_extraction.text import TfidfVectorizer

# Manual features from text
df['word_count'] = df['review'].str.split().str.len()
df['char_count'] = df['review'].str.len()
df['has_question'] = df['review'].str.contains('\?').astype(int)
df['exclamation_count'] = df['review'].str.count('!')
df['avg_word_length'] = df['review'].apply(lambda x: np.mean([len(w) for w in x.split()]))

# TF-IDF representation
tfidf = TfidfVectorizer(max_features=500, ngram_range=(1,2))
tfidf_features = tfidf.fit_transform(df['review'])

Automated Feature Engineering

Manual feature engineering is art and science. Tools automate some of it:

Featuretools: Deep Feature Synthesis — automatically creates features from relational tables.

import featuretools as ft

es = ft.EntitySet("ecommerce")
es.add_dataframe(dataframe_name="transactions", dataframe=transactions, index="id")
es.add_dataframe(dataframe_name="customers", dataframe=customers, index="customer_id")
es.add_relationship("customers", "customer_id", "transactions", "customer_id")

features, feature_names = ft.dfs(entityset=es, target_dataframe_name="transactions",
                                   max_depth=2)

AutoFeat: Generates mathematical combinations of features and selects useful ones.

The Feature Importance Feedback Loop

After training, use feature importance to inform your next engineering iteration:

from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import matplotlib.pyplot as plt

model = RandomForestClassifier()
model.fit(X_train, y_train)

importance = pd.Series(model.feature_importances_, index=X_train.columns)
importance.nlargest(20).plot(kind='barh', figsize=(10, 8))
plt.title("Top 20 Feature Importances")
plt.show()

Low-importance features cost you nothing to remove and reduce noise. High-importance engineered features suggest directions for further engineering.

Good feature engineering is iterative, domain-driven, and validated empirically. The best features come from deep understanding of the problem, not from automated searches alone.