What Happens When We Have Lots of Data?
The properties we learned before (unbiasedness, BLUE, normal sampling distributions) are finite sample properties. They hold for any sample size, *if* all the CLM assumptions are true.
Consistency is a large sample property. It means that as our sample size gets infinitely large, our estimate gets infinitely close to the true population parameter.
Good News: Under the first four Gauss-Markov assumptions (MLR.1 - MLR.4), OLS estimators are consistent.
The graph shows the sampling distribution of β̂1 for different sample sizes (n1 < n2 < n3). As n grows, the distribution becomes more tightly packed around the true value, β1.
Just like bias, inconsistency is caused by a violation of Assumption MLR.4 (Zero Conditional Mean). If the error term `u` is correlated with any of the x's, OLS is inconsistent.
In the simple regression case, the inconsistency (or asymptotic bias) is:
plim(β̂1) - β1 = Cov(x1, u) / Var(x1)
The Scary Part: Unlike bias, inconsistency does not go away as you add more data. In fact, with more data, your estimator just gets closer and closer to the wrong value!
You're studying the effect of police force size on the city crime rate. You run a simple regression of `crime_rate` on `police_per_capita`. You worry that you have omitted a key variable: the underlying economic health of the city.
If you collect data from thousands of cities, will this solve the omitted variable problem? Why or why not?
No, it will not. The omitted variable (economic health) is likely correlated with both the crime rate and the police force size, causing a violation of MLR.4. This means the OLS estimator is inconsistent.
Adding more data will just make your estimate converge to the wrong, biased value. The only solution is to find a way to control for economic health (i.e., add it to the regression!).
This is the most important result of the chapter. Thanks to the Central Limit Theorem, we can say the following:
Under the Gauss-Markov assumptions (MLR.1 to MLR.5), as the sample size 'n' gets large, the distribution of the OLS estimators gets closer and closer to a normal distribution.
The Punchline: We don't need the normality assumption (MLR.6)! Even if the error term `u` is not normally distributed, in large samples our t-statistics will have approximate t-distributions. This justifies using t-tests, F-tests, and confidence intervals just like we did in Chapter 4.
You want to explain the number of times a person was arrested in a year (`narr86`). Your dependent variable can only take on integer values (0, 1, 2, ...) and is zero for most people.
The distribution of `narr86` is clearly not normal. Can you still use a t-test to see if education affects `narr86` in a multiple regression, assuming you have a sample of 2,725 people?
Yes, absolutely! The dependent variable (and thus the error term) does not have a normal distribution, which violates the CLM assumption MLR.6.
However, because the sample size (n=2,725) is very large, the OLS estimators will be approximately normally distributed. This means the standard t-statistic will have an approximate t-distribution, and our usual inference is justified.
The Lagrange Multiplier (LM) or n-R-squared statistic is a different way to test for exclusion restrictions in large samples. It's an alternative to the F-test.
In large samples, the F-test and LM test usually give the same conclusion.
Asymptotic properties are crucial because they justify the use of OLS in a huge number of real-world applications where the error term is not normally distributed.