Chapter 9: Specification & Data Issues

Functional Form, Proxy Variables, and Measurement Error

Functional Form Misspecification

This happens when we use the wrong functional relationship between our variables. For example, the true relationship is quadratic, but we only include a linear term.

True Model: wage = β₀ + β₁exper + β₂exper² + u

Our Model: wage = β₀ + β₁exper + u

Consequence:

It's a type of omitted variable bias! The term `exper`² is now in the error term, and since `exper`² is correlated with `exper`, our estimator for β₁ will be biased and inconsistent.

Testing for Misspecification: RESET

How can we detect if our functional form is wrong? Ramsey's RESET (Regression Specification Error Test) is a general test for this.

The RESET Procedure:

Run your original OLS regression and get the fitted values, ŷ.
Add powers of the fitted values (like ŷ² and ŷ³) to your original equation as new explanatory variables.
Run an F-test for the joint significance of these new variables (ŷ² and ŷ³).
If the F-test is significant, it's evidence of functional form misspecification.

Limitation: RESET tells you if your model might be wrong, but it doesn't tell you how to fix it.

Proxy Variables for Omitted Variables

This is a more difficult problem. What if a key variable is unobservable (like innate 'ability')? Leaving it out causes omitted variable bias.

A proxy variable is an observed variable that is related to the unobserved variable. We can't measure ability, but maybe we can measure `IQ`.

The Plug-in Solution:

We "plug in" the proxy variable for the unobserved variable and run the regression. This can reduce or even eliminate the bias.

log(wage) = β₀ + β₁educ + β₂exper + β₃ability + u

↓

log(wage) = β₀ + β₁educ + β₂exper + α₃IQ + e

When is a Proxy Good Enough?

For the proxy variable solution to work (i.e., give consistent estimates), two key assumptions must hold:

1. Irrelevance of the Proxy

The proxy (`IQ`) must have no partial effect on `y` once the unobserved variable (`ability`) and other x's are controlled for. In other words, the proxy only affects `y` through its relationship with the unobserved variable.

2. "Good Proxy" Assumption

The error in the relationship between the proxy and the unobserved variable must be uncorrelated with the other x's. This means that once we control for the proxy, the other x's are not related to the unobserved variable.

E(ability | educ, exper, IQ) = E(ability | IQ)

This second assumption is the crucial and most difficult one to satisfy.

Measurement Error

This happens when we observe a variable imprecisely. It is different from the proxy variable problem.

Error in `y` (Dependent Variable)

If the measurement error is uncorrelated with the x's, OLS is still unbiased and consistent. The only consequence is a larger error variance (less precise estimates).

Error in `x` (Explanatory Variable)

This is much more serious. Under the Classical Errors-in-Variables (CEV) assumption, OLS is inconsistent and suffers from attenuation bias (the coefficient is biased toward zero).

Check Your Understanding

You want to estimate the effect of education on savings. You have self-reported data on annual income, which you know is measured with error.

If the true effect of income on savings is positive, what will happen to your estimate of the effect of education if income and education are positively correlated?

Answer:

This is tricky! The measurement error in income causes its own coefficient to be biased toward zero (attenuation bias). But because income is measured with error, it is now correlated with the error term. Since income is also correlated with education, this will cause the coefficient on education to be biased as well. The direction of the bias is hard to determine without more information, but it will not be unbiased.

Other Data Issues

Missing Data

If data are "missing completely at random," using only observations with complete data is fine. If the missingness is related to the x's or y, we can have a nonrandom sample problem, which can lead to bias.

Outliers and Influential Observations

Some observations can have a disproportionately large effect on the OLS estimates, especially in small samples. It's good practice to check if your results are sensitive to one or two "weird" data points.

Least Absolute Deviations (LAD)

LAD is an alternative to OLS that is less sensitive to outliers because it minimizes the sum of absolute residuals, not squared residuals. It estimates the conditional median rather than the conditional mean.

Chapter 9 Summary

This chapter covered some of the most serious problems in regression that cause our estimators to be biased and inconsistent, along with some potential solutions.

Functional Form Misspecification is an omitted variable problem that can be detected with tests like RESET.
Proxy variables can be used to mitigate bias from unobserved variables like 'ability', but the assumptions for a good proxy are strong.
Measurement Error in an explanatory variable is a serious problem, causing attenuation bias (bias towards zero) under the CEV assumptions.
Nonrandom samples and outliers can also severely affect OLS results.