The core idea
Real data is categorical, conditional, and easy to overfit. Dummy variables encode categories into the regression. Interaction terms let a relationship differ by group — temperature might matter more in summer than winter. And the train/validate/test split is the one discipline that separates models that fit yesterday from models that predict tomorrow. — after Hastie, Tibshirani & Friedman
The hero diagram
Training vs validation error.
Training error (dashed) always falls. Validation error (solid) falls, then rises. Stop at the bottom of the U.
The tools on the bench
Ideas that pay rent.
How to apply
Making a model generalise.
- Add dummies for all categorical features. Region, product line, month, gender.
- Include interactions where business logic demands. "Temperature effect differs by season" — let the slope differ.
- Hold out a validation set. 20% of the data. Never touch it during fitting.
- Report on the test set once. That is the number that predicts real-world performance.
Key reading · Session 4 · Christodoulou
The bias-variance trade-off.
Every model balances bias (missing the signal) against variance (fitting the noise). Adding complexity reduces bias but increases variance. The best model is not the most accurate on training data; it is the one that generalises best to new data.
Training error lies. Validation error tells the truth.