Ebrahim AlhamedFrameworks Library

m.04 · II · Inference & Prediction · Predictive Analytics II

Dummies, Interactions & Validation

The shift from explaining yesterday to predicting tomorrow.

Real data is categorical, conditional, and easy to overfit. Dummy variables encode categories into the regression. Interaction terms let a relationship differ by group — temperature might matter more in summer than winter. And the train/validate/test split is the one discipline that separates models that fit yesterday from models that predict tomorrow. — after Hastie, Tibshirani & Friedman

Training vs validation error.

Training error (dashed) always falls. Validation error (solid) falls, then rises. Stop at the bottom of the U.

U-curve Training vs validation error against model complexity: training error falls continuously while validation error forms a U-shape, with sweet spot marking the optimal point. validation training sweet spot model complexity error

Ideas that pay rent.

Dummy Variables · Categorical encoding
k categories → k−1 binary variables · reference category = all zeros · coefficient = gap vs reference
Interpret dummy coefficients as "difference from the reference group, other things equal".
Interaction Terms · Conditional effects
x₁ × x₂ · effect of x₁ depends on x₂
Include interactions when business logic says one variable modulates another.
Overfit vs Underfit · Bias-variance tradeoff
Underfit: high bias, misses signal · Overfit: high variance, fits noise
Optimise validation error, never training error.
Train / Validate / Test · Model selection
Train ~60%: fit · Validate ~20%: tune · Test ~20%: final report
Never touch the test set until the end.
Lift & Gain Charts · Predictive evaluation
sort predictions · measure % of positives in top decile · lift = captured / base rate
Lift > 1 means your model beats random.

Making a model generalise.

  1. Add dummies for all categorical features. Region, product line, month, gender.
  2. Include interactions where business logic demands. "Temperature effect differs by season" — let the slope differ.
  3. Hold out a validation set. 20% of the data. Never touch it during fitting.
  4. Report on the test set once. That is the number that predicts real-world performance.

Key reading · Session 4 · Christodoulou

The bias-variance trade-off.

Every model balances bias (missing the signal) against variance (fitting the noise). Adding complexity reduces bias but increases variance. The best model is not the most accurate on training data; it is the one that generalises best to new data.

Training error lies. Validation error tells the truth.

← m.03 Regression & Correlation ··· m.05 From Data to Decision →