Ebrahim AlhamedFrameworks Library

m.03 · II · Inference & Prediction · Predictive Analytics I

Regression & Correlation

Correlation is direction; regression is magnitude.

Correlation tells you whether two variables move together. Regression quantifies by how much, in what direction, per unit. The regression coefficient is the single most useful number in business analytics: "for every unit change in X, Y changes by b units, holding everything else constant." With that, you can predict, compare, and — if you controlled the right variables — start to argue about causation. — after Galton & Fisher

Scatter, line and residual.

The line is what you fit. The residuals are what you missed.

Scatter plot with fit line Scatter plot of y (salary) against x (years of experience) with an OLS regression line and sample residuals highlighted, showing a positive trend. x (years of experience) y (salary) residual

Ideas that pay rent.

Simple Linear Regression · OLS
y = a + b·x + ε · intercept a · slope b · residual ε
Minimise the sum of squared residuals. Slope = change in y per unit x.
Correlation · Pearson
r ∈ [−1, +1] · direction + strength
Correlation is not causation. Never.
Multiple Regression · Multivariate OLS
control for confounders · each coefficient isolates one effect · adjusted R² penalises complexity
Add variables for business logic, not for R².
Model Quality Checklist · Regression diagnostics
Adjusted R² · p-values and t-stats · coefficient signs · residual patterns
High R² + sensible signs + random residuals = a model you can defend.

Building a usable regression.

  1. Plot before you fit. Scatter plots reveal non-linearity, outliers, and clustering that equations hide.
  2. Start simple. One predictor, then add. Stop adding when Adjusted R² stops rising.
  3. Check the signs. If a coefficient's sign contradicts business sense, look for a confounder.
  4. Look at residuals. Patterns in residuals mean the model is missing something important.

Key reading · Session 3 · Christodoulou

Confounding and omitted-variable bias.

If you regress exam score on study time alone, you might see a negative relationship — because smart students study less and still score higher. Add IQ as a control, and the real (positive) relationship appears. This is why multiple regression is essential for any causal claim.

Correlation without controls is a ghost story.

← m.02 A/B Testing & Hypothesis Logic ··· m.04 Dummies, Interactions & Validation →