m.04 · II · Inference & Prediction · Predictive Analytics II

Dummies, Interactions & Validation

The shift from explaining yesterday to predicting tomorrow.

The core idea

Real data is categorical, conditional, and easy to overfit. Dummy variables encode categories into the regression. Interaction terms let a relationship differ by group — temperature might matter more in summer than winter. And the train/validate/test split is the one discipline that separates models that fit yesterday from models that predict tomorrow. — after Hastie, Tibshirani & Friedman

The hero diagram

Training vs validation error.

Training error (dashed) always falls. Validation error (solid) falls, then rises. Stop at the bottom of the U.

The tools on the bench

Ideas that pay rent.

Dummy Variables · Categorical encoding

k categories → k−1 binary variables · reference category = all zeros · coefficient = gap vs reference

Interpret dummy coefficients as "difference from the reference group, other things equal".

Interaction Terms · Conditional effects

x₁ × x₂ · effect of x₁ depends on x₂

Include interactions when business logic says one variable modulates another.

Overfit vs Underfit · Bias-variance tradeoff

Underfit: high bias, misses signal · Overfit: high variance, fits noise

Optimise validation error, never training error.

Train / Validate / Test · Model selection

Train ~60%: fit · Validate ~20%: tune · Test ~20%: final report

Never touch the test set until the end.

Lift & Gain Charts · Predictive evaluation

sort predictions · measure % of positives in top decile · lift = captured / base rate

Lift > 1 means your model beats random.

How to apply

Making a model generalise.

Add dummies for all categorical features. Region, product line, month, gender.
Include interactions where business logic demands. "Temperature effect differs by season" — let the slope differ.
Hold out a validation set. 20% of the data. Never touch it during fitting.
Report on the test set once. That is the number that predicts real-world performance.

Key reading · Session 4 · Christodoulou

The bias-variance trade-off.

Every model balances bias (missing the signal) against variance (fitting the noise). Adding complexity reduces bias but increases variance. The best model is not the most accurate on training data; it is the one that generalises best to new data.

Training error lies. Validation error tells the truth.

← m.03 Regression & Correlation ··· m.05 From Data to Decision →