Feature Engineering
There’s an old saying in machine learning: “garbage in, garbage out.” The flip side is equally true — better features produce better models, often more dramatically than switching to a fancier algorithm. Feature engineering is the craft of transforming raw data into meaningful inputs that help models learn faster and generalize better.
What Is a Feature?
A feature is any measurable property used as input to a model. Raw data is rarely ready to use as-is. Feature engineering transforms raw signals into representations the model can work with effectively.
Raw data: After feature engineering:────────────────── ──────────────────────────────────────timestamp hour_of_day, day_of_week, is_weekenduser_id + item_id user_avg_rating, item_popularity, user_item_category_matchtext: "great product" tf_idf_score, sentiment_score, word_count, has_exclamationlat/lon coordinates distance_to_nearest_store, city_population_densityCore Feature Engineering Techniques
Numerical Transformations
Log transform: Compress right-skewed distributions (income, prices, counts).
import numpy as npdf['log_price'] = np.log1p(df['price']) # log(1 + price) handles zerosPolynomial features: Capture non-linear relationships.
from sklearn.preprocessing import PolynomialFeaturespoly = PolynomialFeatures(degree=2, include_bias=False)X_poly = poly.fit_transform(X[['age', 'income']])# Creates: age, income, age², income², age×incomeBinning / Discretization: Convert continuous to categorical.
df['age_group'] = pd.cut(df['age'], bins=[0,18,35,50,65,100], labels=['teen','young','mid','senior','elder'])Datetime Features
One of the most consistently valuable feature engineering moves:
df['timestamp'] = pd.to_datetime(df['timestamp'])df['hour'] = df['timestamp'].dt.hourdf['day_of_week'] = df['timestamp'].dt.dayofweek # 0=Mondaydf['month'] = df['timestamp'].dt.monthdf['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)df['days_since_last_purchase'] = (today - df['last_purchase_date']).dt.days
# Cyclical encoding (hour 23 and hour 0 are neighbors)df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)Interaction Features
Combine two features to capture relationships neither alone can express.
# Physical BMI = weight / height²df['bmi'] = df['weight_kg'] / (df['height_m'] ** 2)
# Fraud: unusual purchase amount relative to personal averagedf['amount_vs_avg_ratio'] = df['amount'] / (df['user_avg_amount'] + 1e-6)
# E-commerce: recency × frequency × monetary (RFM)df['rfm_score'] = df['recency_score'] * df['frequency_score'] * df['monetary_score']Aggregation Features
Compute statistics across a related group (often very powerful for tabular data):
# User-level aggregationsuser_stats = df.groupby('user_id').agg( user_avg_amount=('amount', 'mean'), user_transaction_count=('amount', 'count'), user_max_amount=('amount', 'max'), user_unique_merchants=('merchant_id', 'nunique')).reset_index()
df = df.merge(user_stats, on='user_id', how='left')Domain-Specific Features: Where Real Gains Happen
Generic features get you to 70–80% of possible performance. Domain knowledge gets you the rest.
E-commerce fraud detection:
- Time since account creation (new accounts are riskier)
- Number of failed payment attempts in last 24h
- Distance between billing and shipping addresses
- Whether billing address was recently changed
Credit scoring:
- Debt-to-income ratio
- Utilization rate (balance / credit limit)
- Payment behavior over last 12 months
- Credit inquiry frequency
Medical diagnosis:
- Rate of change in lab values (not just current value)
- Ratio of correlated biomarkers
- Days between symptoms and first visit
Text-Derived Features
import refrom sklearn.feature_extraction.text import TfidfVectorizer
# Manual features from textdf['word_count'] = df['review'].str.split().str.len()df['char_count'] = df['review'].str.len()df['has_question'] = df['review'].str.contains('\?').astype(int)df['exclamation_count'] = df['review'].str.count('!')df['avg_word_length'] = df['review'].apply(lambda x: np.mean([len(w) for w in x.split()]))
# TF-IDF representationtfidf = TfidfVectorizer(max_features=500, ngram_range=(1,2))tfidf_features = tfidf.fit_transform(df['review'])Automated Feature Engineering
Manual feature engineering is art and science. Tools automate some of it:
Featuretools: Deep Feature Synthesis — automatically creates features from relational tables.
import featuretools as ft
es = ft.EntitySet("ecommerce")es.add_dataframe(dataframe_name="transactions", dataframe=transactions, index="id")es.add_dataframe(dataframe_name="customers", dataframe=customers, index="customer_id")es.add_relationship("customers", "customer_id", "transactions", "customer_id")
features, feature_names = ft.dfs(entityset=es, target_dataframe_name="transactions", max_depth=2)AutoFeat: Generates mathematical combinations of features and selects useful ones.
The Feature Importance Feedback Loop
After training, use feature importance to inform your next engineering iteration:
from sklearn.ensemble import RandomForestClassifierimport pandas as pdimport matplotlib.pyplot as plt
model = RandomForestClassifier()model.fit(X_train, y_train)
importance = pd.Series(model.feature_importances_, index=X_train.columns)importance.nlargest(20).plot(kind='barh', figsize=(10, 8))plt.title("Top 20 Feature Importances")plt.show()Low-importance features cost you nothing to remove and reduce noise. High-importance engineered features suggest directions for further engineering.
Good feature engineering is iterative, domain-driven, and validated empirically. The best features come from deep understanding of the problem, not from automated searches alone.