Chapter 2: The Simple Regression Model

An Interactive Learning Experience

What's the Big Idea?

Simple linear regression helps us study the relationship between two variables. We want to understand how one variable (the dependent variable, y) changes when another variable (the independent variable, x) changes.

Learning Objectives

Define the simple linear regression model.
Understand how we estimate its parameters using Ordinary Least Squares (OLS).
Interpret the key results, like coefficients and R-squared.
Learn the crucial assumptions that make our estimates reliable.

The Heart of the Model

The simple linear regression model is defined by this equation:

y = β₀ + β₁x + u

y: Dependent Variable (what we want to explain, e.g., wages)

x: Independent Variable (what we use to explain y, e.g., education)

β₀: The Intercept (value of y when x=0)

β₁: The Slope (how much y changes for a one-unit change in x)

What is 'u'?

The term u is the error term or disturbance. It represents all the other unobserved factors that affect 'y' besides 'x' (like natural ability, experience, etc., in a wage equation).

The Average Relationship: PRF

The Population Regression Function (PRF) describes the average value of y for any given value of x. It's what we *want* to know.

E(y | x) = β₀ + β₁x

This equation says "The expected value of y, given x, is a linear function of x."

This graph shows that for any value of x, the *average* value of y falls on the line. The bell curves show that actual y values are distributed around that average.

Check Your Understanding

Consider the model: wage = β₀ + β₁educ + u, where 'wage' is hourly wage and 'educ' is years of education.

What is the correct interpretation of β₁?

Answer:

β₁ is the change in hourly wage for one additional year of education, holding all other unobserved factors (in 'u') fixed. It's the "ceteris paribus" effect of education on wages.

How Do We Estimate β₀ and β₁?

We can't see the true population relationship. Instead, we use a sample of data to estimate it. The most common method is Ordinary Least Squares (OLS).

The Goal of OLS:

OLS finds the line (the intercept β̂₀ and slope β̂₁) that minimizes the sum of the squared residuals.

In plain English: OLS draws a line through the data points that is as close as possible to all the points simultaneously.

OLS minimizes the sum of the squares of the vertical red lines (the residuals, û).

The Estimated Relationship: SRF

The OLS regression line is called the Sample Regression Function (SRF). The "hats" (^) on the betas indicate they are estimates.

ŷ = β̂₀ + β̂₁x

ŷ ("y-hat"): The fitted value. It's our prediction for y given a certain value of x.

û = y - ŷ: The residual. It's the difference between the actual value of y and our predicted value, ŷ.

Error vs. Residual

What is the key difference between the error term (u) from the population model and the residual (û) from our sample estimation?

Answer:

The error term (u) is the unobservable deviation of a data point from the true, unknown population regression line. We can never know it.

The residual (û) is the calculated deviation of a data point from our estimated sample regression line. We compute it from our data.

How Good is the Fit? R-Squared (R²)

R-squared, the coefficient of determination, measures the proportion of the sample variation in the dependent variable (y) that is explained by the independent variable (x).

0 ≤ R² ≤ 1

An R² near 0 means that x explains very little of the variation in y. The data points are widely scattered around the regression line.

An R² near 1 means that x explains a large portion of the variation in y. The data points are very close to the regression line.

Important Caveat: A low R² doesn't necessarily mean the model is "bad," especially in social sciences where outcomes are complex. It just means 'x' is one of many factors influencing 'y'.

Going Beyond Straight Lines: Logarithms

Sometimes, a straight-line (level-level) relationship isn't appropriate. Using natural logs (log()) allows us to model percentage changes and elasticities.

Model	Equation	Interpretation of β₁
Level-Level	y = β₀ + β₁x	Δy = β₁ Δx
Log-Level	log(y) = β₀ + β₁x	%Δy ≈ (100β₁)Δx
Log-Log	log(y) = β₀ + β₁log(x)	%Δy ≈ β₁%Δx (elasticity)

Example: In the log-level model log(wage) = β₀ + β₁educ, a β̂₁ of 0.08 means one more year of education increases wages by about 8%.

The Rules of the Game: OLS Assumptions

For our OLS estimates to be unbiased (meaning, on average, they equal the true population values), a set of assumptions must hold. These are often called the Gauss-Markov Assumptions.

SLR.1: Linear in Parameters - The model is y = β₀ + β₁x + u.

SLR.2: Random Sampling - Our data is a random sample from the population.

SLR.3: Sample Variation in x - The values of x in our sample are not all the same.

SLR.4: Zero Conditional Mean - The average of the unobserved factors is not related to x. (This is the most critical one!)

SLR.5: Homoskedasticity - The variance of the unobserved factors is constant for any value of x.

The Crucial Assumption: SLR.4

Zero Conditional Mean: E(u | x) = 0

In English: The unobserved factors (u) must be unrelated, on average, to the explanatory variable (x). The average value of 'u' does not change as 'x' changes.

If this assumption is violated, our OLS estimates will be biased and misleading.

Classic Example of Violation:

In our model wage = β₀ + β₁educ + u, the error term 'u' contains innate ability. If people with higher ability tend to get more education, then 'educ' is correlated with 'u'. This violates SLR.4, and our estimate of β₁ will likely be too high (it will capture the effect of education AND ability).

Test Your Intuition

You estimate the model: crime_rate = β₀ + β₁police_officers + u.

Why might the Zero Conditional Mean assumption (SLR.4) be violated here? What important factor might be in 'u' that is also correlated with the number of police officers?

Answer:

The error term 'u' contains factors like the underlying economic conditions or poverty level of a city. Cities with higher poverty (a factor in 'u') might have higher crime rates, but they also tend to hire more police officers.

Because 'police_officers' is correlated with a key factor in 'u', SLR.4 is violated. A simple regression might misleadingly suggest that more police lead to more crime!

Chapter 2 Summary

Here are the key takeaways from the simple linear regression model.

The model y = β₀ + β₁x + u lets us explain one variable using another.
Ordinary Least Squares (OLS) is the standard method for estimating β₀ and β₁ from a sample of data.
R-squared measures the goodness-of-fit of our regression line.
We can use logarithms to model relationships involving percentage changes.
For our OLS estimates to be unbiased, the Gauss-Markov assumptions must hold, especially the Zero Conditional Mean assumption (E(u|x) = 0).

Chapter 2: The Simple Regression Model

What's the Big Idea?

Learning Objectives

The Heart of the Model

The Average Relationship: PRF

Check Your Understanding

Answer:

How Do We Estimate β0 and β1?

The Goal of OLS:

The Estimated Relationship: SRF

Error vs. Residual

Answer:

How Good is the Fit? R-Squared (R²)

Going Beyond Straight Lines: Logarithms

The Rules of the Game: OLS Assumptions

The Crucial Assumption: SLR.4

Zero Conditional Mean: E(u | x) = 0

Test Your Intuition

Answer:

Chapter 2 Summary

How Do We Estimate β₀ and β₁?