Estimation
In the real world, an outcome (like wages) is rarely explained by just one factor. Simple regression (Chapter 2) has a major drawback: it's hard to claim that your 'x' is uncorrelated with all other unobserved factors.
We just extend the simple model by adding more explanatory variables (x's):
Example: A Better Wage Equation
Instead of just education, we can control for experience:
wage = β0 + β1educ + β2exper + u
Here, we can measure the effect of education on wage, holding experience fixed.
In multiple regression, the interpretation of each slope coefficient is a partial effect.
For the model: y = β0 + β1x1 + β2x2 + u
This is the power of multiple regression: it mathematically isolates the effect of each variable, even when they are correlated in the real world.
You estimate the following model for house prices:
log(price) = β̂0 + 0.3 log(sqrft) + 0.05 bdrms
What is the correct interpretation of the coefficient 0.05 on `bdrms` (number of bedrooms)?
Holding square footage (`sqrft`) constant, one additional bedroom is associated with an approximate 5% increase in house price.
(Because the dependent variable `log(price)` is in logs, we interpret the coefficient as a percentage change: 100 * 0.05 = 5%).
What happens if we leave out a variable that truly belongs in the model?
Suppose the true model is y = β0 + β1x1 + β2x2 + u, but we only estimate y = β̃0 + β̃1x1 + v.
Our estimate β̃1 will be biased if the omitted variable (x2) is correlated with our included variable (x1).
Bias(β̃1) = β2 × Corr(x1, x2) effect
The bias is the product of the true effect of the omitted variable on y (β2) and the relationship between the included and omitted variables.
| Corr(x1, x2) > 0 | Corr(x1, x2) < 0 | |
|---|---|---|
| β2 > 0 | Positive Bias | Negative Bias |
| β2 < 0 | Negative Bias | Positive Bias |
You run a simple regression of wage on educ, but you omit the variable ability.
The variance of an OLS estimator tells us about its precision (a smaller variance is better). For any slope coefficient β̂j, the variance is:
σ2 (Error Variance): More "noise" or unexplained variation in the model increases the variance.
SSTj (Total Sample Variation in xj): More variation in our explanatory variable decreases the variance. More data helps!
Rj2 (Multicollinearity): This is the R-squared from regressing xj on all other x's. If other variables explain xj well (Rj2 is high), the variance increases.
Multicollinearity occurs when two or more independent variables in a regression are highly correlated. It doesn't violate any core assumptions, but it can be a practical problem.
The term 1 / (1 - Rj2) is called the Variance Inflation Factor (VIF). Look what happens as Rj2 gets close to 1:
As the correlation between xj and other variables (Rj2) gets closer to 1, the VIF shoots towards infinity, inflating the variance of β̂j and making the estimate very imprecise.
You are trying to explain a person's weight and include the following variables in your regression:
weight = β0 + β1height_inches + β2height_cm + u
Why will your software likely give you an error? Which assumption is violated?
This is a case of perfect multicollinearity. Since height in centimeters is just height in inches multiplied by a constant (2.54), one variable is an exact linear function of the other.
This violates Assumption MLR.3 (No Perfect Collinearity). The model cannot be estimated because it's impossible to distinguish the individual effects of `height_inches` and `height_cm` when they always move together perfectly.
This famous theorem gives us a powerful reason to use OLS. It states that under the five Gauss-Markov assumptions (MLR.1 - MLR.5):
The OLS estimator is the Best Linear Unbiased Estimator (BLUE).
Means it has the smallest variance among all linear unbiased estimators.
It's a linear function of the dependent variable, y.
On average, it gives you the right answer.
In short: if the assumptions hold, you can't find a better linear unbiased estimator than OLS.
You've now mastered the fundamentals of multiple regression!