Functional Form, Proxy Variables, and Measurement Error
This happens when we use the wrong functional relationship between our variables. For example, the true relationship is quadratic, but we only include a linear term.
True Model: wage = β0 + β1exper + β2exper2 + u
Our Model: wage = β0 + β1exper + u
It's a type of omitted variable bias! The term `exper`2 is now in the error term, and since `exper`2 is correlated with `exper`, our estimator for β1 will be biased and inconsistent.
How can we detect if our functional form is wrong? Ramsey's RESET (Regression Specification Error Test) is a general test for this.
Limitation: RESET tells you if your model might be wrong, but it doesn't tell you how to fix it.
This is a more difficult problem. What if a key variable is unobservable (like innate 'ability')? Leaving it out causes omitted variable bias.
A proxy variable is an observed variable that is related to the unobserved variable. We can't measure ability, but maybe we can measure `IQ`.
We "plug in" the proxy variable for the unobserved variable and run the regression. This can reduce or even eliminate the bias.
log(wage) = β0 + β1educ + β2exper + β3ability + u
↓
log(wage) = β0 + β1educ + β2exper + α3IQ + e
For the proxy variable solution to work (i.e., give consistent estimates), two key assumptions must hold:
The proxy (`IQ`) must have no partial effect on `y` once the unobserved variable (`ability`) and other x's are controlled for. In other words, the proxy only affects `y` through its relationship with the unobserved variable.
The error in the relationship between the proxy and the unobserved variable must be uncorrelated with the other x's. This means that once we control for the proxy, the other x's are not related to the unobserved variable.
E(ability | educ, exper, IQ) = E(ability | IQ)
This second assumption is the crucial and most difficult one to satisfy.
This happens when we observe a variable imprecisely. It is different from the proxy variable problem.
If the measurement error is uncorrelated with the x's, OLS is still unbiased and consistent. The only consequence is a larger error variance (less precise estimates).
This is much more serious. Under the Classical Errors-in-Variables (CEV) assumption, OLS is inconsistent and suffers from attenuation bias (the coefficient is biased toward zero).
You want to estimate the effect of education on savings. You have self-reported data on annual income, which you know is measured with error.
If the true effect of income on savings is positive, what will happen to your estimate of the effect of education if income and education are positively correlated?
This is tricky! The measurement error in income causes its own coefficient to be biased toward zero (attenuation bias). But because income is measured with error, it is now correlated with the error term. Since income is also correlated with education, this will cause the coefficient on education to be biased as well. The direction of the bias is hard to determine without more information, but it will not be unbiased.
If data are "missing completely at random," using only observations with complete data is fine. If the missingness is related to the x's or y, we can have a nonrandom sample problem, which can lead to bias.
Some observations can have a disproportionately large effect on the OLS estimates, especially in small samples. It's good practice to check if your results are sensitive to one or two "weird" data points.
LAD is an alternative to OLS that is less sensitive to outliers because it minimizes the sum of absolute residuals, not squared residuals. It estimates the conditional median rather than the conditional mean.
This chapter covered some of the most serious problems in regression that cause our estimators to be biased and inconsistent, along with some potential solutions.