Chapter 2: The Simple Regression Model

An Interactive Learning Experience

What's the Big Idea?

Simple linear regression helps us study the relationship between two variables. We want to understand how one variable (the dependent variable, y) changes when another variable (the independent variable, x) changes.

Learning Objectives

  • Define the simple linear regression model.
  • Understand how we estimate its parameters using Ordinary Least Squares (OLS).
  • Interpret the key results, like coefficients and R-squared.
  • Learn the crucial assumptions that make our estimates reliable.

The Heart of the Model

The simple linear regression model is defined by this equation:

y = β0 + β1x + u

y: Dependent Variable (what we want to explain, e.g., wages)

x: Independent Variable (what we use to explain y, e.g., education)

β0: The Intercept (value of y when x=0)

β1: The Slope (how much y changes for a one-unit change in x)

What is 'u'?

The term u is the error term or disturbance. It represents all the other unobserved factors that affect 'y' besides 'x' (like natural ability, experience, etc., in a wage equation).

The Average Relationship: PRF

The Population Regression Function (PRF) describes the average value of y for any given value of x. It's what we *want* to know.

E(y | x) = β0 + β1x

This equation says "The expected value of y, given x, is a linear function of x."

This graph shows that for any value of x, the *average* value of y falls on the line. The bell curves show that actual y values are distributed around that average.

Check Your Understanding

Consider the model: wage = β0 + β1educ + u, where 'wage' is hourly wage and 'educ' is years of education.

What is the correct interpretation of β1?

Answer:

β1 is the change in hourly wage for one additional year of education, holding all other unobserved factors (in 'u') fixed. It's the "ceteris paribus" effect of education on wages.

How Do We Estimate β0 and β1?

We can't see the true population relationship. Instead, we use a sample of data to estimate it. The most common method is Ordinary Least Squares (OLS).

The Goal of OLS:

OLS finds the line (the intercept β̂0 and slope β̂1) that minimizes the sum of the squared residuals.

In plain English: OLS draws a line through the data points that is as close as possible to all the points simultaneously.

OLS minimizes the sum of the squares of the vertical red lines (the residuals, û).

The Estimated Relationship: SRF

The OLS regression line is called the Sample Regression Function (SRF). The "hats" (^) on the betas indicate they are estimates.

ŷ = β̂0 + β̂1x

ŷ ("y-hat"): The fitted value. It's our prediction for y given a certain value of x.

û = y - ŷ: The residual. It's the difference between the actual value of y and our predicted value, ŷ.

Error vs. Residual

What is the key difference between the error term (u) from the population model and the residual (û) from our sample estimation?

Answer:

The error term (u) is the unobservable deviation of a data point from the true, unknown population regression line. We can never know it.

The residual (û) is the calculated deviation of a data point from our estimated sample regression line. We compute it from our data.

How Good is the Fit? R-Squared (R²)

R-squared, the coefficient of determination, measures the proportion of the sample variation in the dependent variable (y) that is explained by the independent variable (x).

0 ≤ R² ≤ 1

An R² near 0 means that x explains very little of the variation in y. The data points are widely scattered around the regression line.

An R² near 1 means that x explains a large portion of the variation in y. The data points are very close to the regression line.

Important Caveat: A low R² doesn't necessarily mean the model is "bad," especially in social sciences where outcomes are complex. It just means 'x' is one of many factors influencing 'y'.

Going Beyond Straight Lines: Logarithms

Sometimes, a straight-line (level-level) relationship isn't appropriate. Using natural logs (log()) allows us to model percentage changes and elasticities.

Model Equation Interpretation of β1
Level-Level y = β0 + β1x Δy = β1 Δx
Log-Level log(y) = β0 + β1x %Δy ≈ (100β1)Δx
Log-Log log(y) = β0 + β1log(x) %Δy ≈ β1%Δx (elasticity)

Example: In the log-level model log(wage) = β0 + β1educ, a β̂1 of 0.08 means one more year of education increases wages by about 8%.

The Rules of the Game: OLS Assumptions

For our OLS estimates to be unbiased (meaning, on average, they equal the true population values), a set of assumptions must hold. These are often called the Gauss-Markov Assumptions.

SLR.1: Linear in Parameters - The model is y = β0 + β1x + u.

SLR.2: Random Sampling - Our data is a random sample from the population.

SLR.3: Sample Variation in x - The values of x in our sample are not all the same.

SLR.4: Zero Conditional Mean - The average of the unobserved factors is not related to x. (This is the most critical one!)

SLR.5: Homoskedasticity - The variance of the unobserved factors is constant for any value of x.

The Crucial Assumption: SLR.4

Zero Conditional Mean: E(u | x) = 0

In English: The unobserved factors (u) must be unrelated, on average, to the explanatory variable (x). The average value of 'u' does not change as 'x' changes.

If this assumption is violated, our OLS estimates will be biased and misleading.

Classic Example of Violation:

In our model wage = β0 + β1educ + u, the error term 'u' contains innate ability. If people with higher ability tend to get more education, then 'educ' is correlated with 'u'. This violates SLR.4, and our estimate of β1 will likely be too high (it will capture the effect of education AND ability).

Test Your Intuition

You estimate the model: crime_rate = β0 + β1police_officers + u.

Why might the Zero Conditional Mean assumption (SLR.4) be violated here? What important factor might be in 'u' that is also correlated with the number of police officers?

Answer:

The error term 'u' contains factors like the underlying economic conditions or poverty level of a city. Cities with higher poverty (a factor in 'u') might have higher crime rates, but they also tend to hire more police officers.

Because 'police_officers' is correlated with a key factor in 'u', SLR.4 is violated. A simple regression might misleadingly suggest that more police lead to more crime!

Chapter 2 Summary

Here are the key takeaways from the simple linear regression model.

  • The model y = β0 + β1x + u lets us explain one variable using another.
  • Ordinary Least Squares (OLS) is the standard method for estimating β0 and β1 from a sample of data.
  • R-squared measures the goodness-of-fit of our regression line.
  • We can use logarithms to model relationships involving percentage changes.
  • For our OLS estimates to be unbiased, the Gauss-Markov assumptions must hold, especially the Zero Conditional Mean assumption (E(u|x) = 0).