An Interactive Learning Experience
Simple linear regression helps us study the relationship between two variables. We want to understand how one variable (the dependent variable, y) changes when another variable (the independent variable, x) changes.
The simple linear regression model is defined by this equation:
y: Dependent Variable (what we want to explain, e.g., wages)
x: Independent Variable (what we use to explain y, e.g., education)
β0: The Intercept (value of y when x=0)
β1: The Slope (how much y changes for a one-unit change in x)
What is 'u'?
The term u is the error term or disturbance. It represents all the other unobserved factors that affect 'y' besides 'x' (like natural ability, experience, etc., in a wage equation).
The Population Regression Function (PRF) describes the average value of y for any given value of x. It's what we *want* to know.
This equation says "The expected value of y, given x, is a linear function of x."
This graph shows that for any value of x, the *average* value of y falls on the line. The bell curves show that actual y values are distributed around that average.
Consider the model: wage = β0 + β1educ + u, where 'wage' is hourly wage and 'educ' is years of education.
What is the correct interpretation of β1?
β1 is the change in hourly wage for one additional year of education, holding all other unobserved factors (in 'u') fixed. It's the "ceteris paribus" effect of education on wages.
We can't see the true population relationship. Instead, we use a sample of data to estimate it. The most common method is Ordinary Least Squares (OLS).
OLS finds the line (the intercept β̂0 and slope β̂1) that minimizes the sum of the squared residuals.
In plain English: OLS draws a line through the data points that is as close as possible to all the points simultaneously.
OLS minimizes the sum of the squares of the vertical red lines (the residuals, û).
The OLS regression line is called the Sample Regression Function (SRF). The "hats" (^) on the betas indicate they are estimates.
ŷ ("y-hat"): The fitted value. It's our prediction for y given a certain value of x.
û = y - ŷ: The residual. It's the difference between the actual value of y and our predicted value, ŷ.
What is the key difference between the error term (u) from the population model and the residual (û) from our sample estimation?
The error term (u) is the unobservable deviation of a data point from the true, unknown population regression line. We can never know it.
The residual (û) is the calculated deviation of a data point from our estimated sample regression line. We compute it from our data.
R-squared, the coefficient of determination, measures the proportion of the sample variation in the dependent variable (y) that is explained by the independent variable (x).
An R² near 0 means that x explains very little of the variation in y. The data points are widely scattered around the regression line.
An R² near 1 means that x explains a large portion of the variation in y. The data points are very close to the regression line.
Important Caveat: A low R² doesn't necessarily mean the model is "bad," especially in social sciences where outcomes are complex. It just means 'x' is one of many factors influencing 'y'.
Sometimes, a straight-line (level-level) relationship isn't appropriate. Using natural logs (log()) allows us to model percentage changes and elasticities.
| Model | Equation | Interpretation of β1 |
|---|---|---|
| Level-Level | y = β0 + β1x | Δy = β1 Δx |
| Log-Level | log(y) = β0 + β1x | %Δy ≈ (100β1)Δx |
| Log-Log | log(y) = β0 + β1log(x) | %Δy ≈ β1%Δx (elasticity) |
Example: In the log-level model log(wage) = β0 + β1educ, a β̂1 of 0.08 means one more year of education increases wages by about 8%.
For our OLS estimates to be unbiased (meaning, on average, they equal the true population values), a set of assumptions must hold. These are often called the Gauss-Markov Assumptions.
SLR.1: Linear in Parameters - The model is y = β0 + β1x + u.
SLR.2: Random Sampling - Our data is a random sample from the population.
SLR.3: Sample Variation in x - The values of x in our sample are not all the same.
SLR.4: Zero Conditional Mean - The average of the unobserved factors is not related to x. (This is the most critical one!)
SLR.5: Homoskedasticity - The variance of the unobserved factors is constant for any value of x.
In English: The unobserved factors (u) must be unrelated, on average, to the explanatory variable (x). The average value of 'u' does not change as 'x' changes.
If this assumption is violated, our OLS estimates will be biased and misleading.
Classic Example of Violation:
In our model wage = β0 + β1educ + u, the error term 'u' contains innate ability. If people with higher ability tend to get more education, then 'educ' is correlated with 'u'. This violates SLR.4, and our estimate of β1 will likely be too high (it will capture the effect of education AND ability).
You estimate the model: crime_rate = β0 + β1police_officers + u.
Why might the Zero Conditional Mean assumption (SLR.4) be violated here? What important factor might be in 'u' that is also correlated with the number of police officers?
The error term 'u' contains factors like the underlying economic conditions or poverty level of a city. Cities with higher poverty (a factor in 'u') might have higher crime rates, but they also tend to hire more police officers.
Because 'police_officers' is correlated with a key factor in 'u', SLR.4 is violated. A simple regression might misleadingly suggest that more police lead to more crime!
Here are the key takeaways from the simple linear regression model.