Chapter 3: Multiple Regression Analysis

Estimation

Why Do We Need More Than One 'x'?

In the real world, an outcome (like wages) is rarely explained by just one factor. Simple regression (Chapter 2) has a major drawback: it's hard to claim that your 'x' is uncorrelated with all other unobserved factors.

The Power of Multiple Regression:

  • It allows us to control for other observed factors by including them in the model.
  • This helps us get closer to a ceteris paribus interpretation (holding other factors fixed).
  • This is crucial for inferring causality when we can't run a controlled experiment.

The Multiple Regression Model

We just extend the simple model by adding more explanatory variables (x's):

y = β0 + β1x1 + β2x2 + ... + βkxk + u

Example: A Better Wage Equation

Instead of just education, we can control for experience:

wage = β0 + β1educ + β2exper + u

Here, we can measure the effect of education on wage, holding experience fixed.

The Magic Words: "Holding Other Factors Fixed"

In multiple regression, the interpretation of each slope coefficient is a partial effect.

For the model: y = β0 + β1x1 + β2x2 + u

  • β1 measures the change in y for a one-unit change in x1, holding x2 constant.
  • β2 measures the change in y for a one-unit change in x2, holding x1 constant.

This is the power of multiple regression: it mathematically isolates the effect of each variable, even when they are correlated in the real world.

Check Your Understanding

You estimate the following model for house prices:

log(price) = β̂0 + 0.3 log(sqrft) + 0.05 bdrms

What is the correct interpretation of the coefficient 0.05 on `bdrms` (number of bedrooms)?

Answer:

Holding square footage (`sqrft`) constant, one additional bedroom is associated with an approximate 5% increase in house price.

(Because the dependent variable `log(price)` is in logs, we interpret the coefficient as a percentage change: 100 * 0.05 = 5%).

The Big Danger: Omitted Variable Bias

What happens if we leave out a variable that truly belongs in the model?

Suppose the true model is y = β0 + β1x1 + β2x2 + u, but we only estimate y = β̃0 + β̃1x1 + v.

Our estimate β̃1 will be biased if the omitted variable (x2) is correlated with our included variable (x1).

Bias(β̃1) = β2 × Corr(x1, x2) effect

The bias is the product of the true effect of the omitted variable on y (β2) and the relationship between the included and omitted variables.

Corr(x1, x2) > 0 Corr(x1, x2) < 0
β2 > 0 Positive Bias Negative Bias
β2 < 0 Negative Bias Positive Bias

Guess the Bias!

You run a simple regression of wage on educ, but you omit the variable ability.

  1. What is the likely sign of the effect of `ability` on `wage`? (β2)
  2. What is the likely correlation between `educ` and `ability`? (Corr(x1, x2))
  3. Based on this, what is the direction of the omitted variable bias in your estimate for the return to education?

Answer:

  1. The effect of `ability` on `wage` is almost certainly positive (β2 > 0).
  2. People with higher ability tend to get more education, so the correlation between `educ` and `ability` is positive (Corr > 0).
  3. Therefore, the bias is Positive. Your simple regression estimate will likely overstate the true effect of education on wages because it also picks up the effect of ability.

How Precise Are Our Estimates?

The variance of an OLS estimator tells us about its precision (a smaller variance is better). For any slope coefficient β̂j, the variance is:

Var(β̂j) = σ2 / [ SSTj (1 - Rj2) ]

σ2 (Error Variance): More "noise" or unexplained variation in the model increases the variance.

SSTj (Total Sample Variation in xj): More variation in our explanatory variable decreases the variance. More data helps!

Rj2 (Multicollinearity): This is the R-squared from regressing xj on all other x's. If other variables explain xj well (Rj2 is high), the variance increases.

The Problem of Multicollinearity

Multicollinearity occurs when two or more independent variables in a regression are highly correlated. It doesn't violate any core assumptions, but it can be a practical problem.

The term 1 / (1 - Rj2) is called the Variance Inflation Factor (VIF). Look what happens as Rj2 gets close to 1:

As the correlation between xj and other variables (Rj2) gets closer to 1, the VIF shoots towards infinity, inflating the variance of β̂j and making the estimate very imprecise.

Spot the Problem

You are trying to explain a person's weight and include the following variables in your regression:

weight = β0 + β1height_inches + β2height_cm + u

Why will your software likely give you an error? Which assumption is violated?

Answer:

This is a case of perfect multicollinearity. Since height in centimeters is just height in inches multiplied by a constant (2.54), one variable is an exact linear function of the other.

This violates Assumption MLR.3 (No Perfect Collinearity). The model cannot be estimated because it's impossible to distinguish the individual effects of `height_inches` and `height_cm` when they always move together perfectly.

Why Do We Love OLS?

The Gauss-Markov Theorem

This famous theorem gives us a powerful reason to use OLS. It states that under the five Gauss-Markov assumptions (MLR.1 - MLR.5):

The OLS estimator is the Best Linear Unbiased Estimator (BLUE).

Best

Means it has the smallest variance among all linear unbiased estimators.

Linear

It's a linear function of the dependent variable, y.

Unbiased

On average, it gives you the right answer.

In short: if the assumptions hold, you can't find a better linear unbiased estimator than OLS.

Chapter 3 Summary

You've now mastered the fundamentals of multiple regression!

  • Multiple regression lets us estimate the effect of one variable while controlling for others.
  • Slope coefficients must be interpreted with the crucial phrase: "holding other factors fixed."
  • Omitted variable bias is a serious problem that occurs when we exclude a relevant, correlated variable.
  • The variance of OLS estimators depends on the error variance, sample variation in x, and multicollinearity.
  • Multicollinearity (high correlation between x's) can inflate estimator variance, making them imprecise.
  • Under the Gauss-Markov assumptions, OLS is the Best Linear Unbiased Estimator (BLUE).