Chapter 7: Qualitative Information

Using Dummy Variables in Regression Analysis

What is Qualitative Information?

So far, we've mostly used quantitative variables (like income, education in years, etc.). But what about variables that represent categories?

Gender (Male / Female)
Location (North / South / East / West)
Program Participation (Yes / No)
Industry (Manufacturing / Service / Retail)

We can include this information in regressions using dummy variables (or binary variables), which take a value of 1 or 0.

A Single Dummy Explanatory Variable

The simplest case is a variable with two categories, like gender. We create one dummy variable:

female = 1 if person is female, 0 if male

wage = β₀ + δ₀female + β₁educ + u

Interpretation: The coefficient δ₀ measures the difference in the average wage between females and males, holding education constant. It's an intercept shift.

The Dummy Variable Trap

In the previous example, `male` was our base group or benchmark, because their average wage is captured by the intercept β₀.

What would happen if we tried to include dummy variables for both categories?

wage = β₀ + δ₀female + γ₀male + β₁educ + u

Answer:

This would cause perfect multicollinearity. For every observation, `female + male = 1`. Since the intercept is already a column of 1s, the regressors are perfectly linearly related.

Your software will give an error. The rule is: if you have `g` categories, include `g-1` dummy variables in the model with an intercept.

Dummy Variables for Multiple Categories

What if a variable has more than two categories, like marital status (single, married, divorced)?

The `g-1` Rule:

Choose one category as the base group (e.g., single).
Create dummy variables for the other g-1 categories (`married` and `divorced`).
Include the `g-1` dummies in the regression.

wage = β₀ + δ₁married + δ₂divorced + ... + u

Here, δ₁ is the average wage difference between married and single people (the base group), and δ₂ is the difference between divorced and single people.

Interactions: Allowing for Different Slopes

So far, we've only allowed the intercept to change. What if the effect of education on wages is different for men and women? We can model this by interacting the dummy with the quantitative variable.

log(wage) = β₀ + δ₀female + β₁educ + δ₁(female × educ) + u

The return to education for men is β₁. For women, it's β₁ + δ₁. The interaction term allows the two lines to have different slopes.

A Binary Dependent Variable: The LPM

What if the outcome we want to explain is itself a yes/no event? We can use a binary dependent variable.

`inlf` = 1 if in the labor force, 0 otherwise.
`approve` = 1 if mortgage is approved, 0 otherwise.

y = β₀ + β₁x₁ + ... + β_kx_k + u

New Interpretation: When y is 0/1, the expected value E(y|x) is the probability that y=1. So the regression explains the probability of the event occurring. This is called the Linear Probability Model (LPM).

β_j is the change in the probability of success when x_j increases by one unit.

Problems with the Linear Probability Model

While simple to estimate and interpret, the LPM has some drawbacks.

1. Predicted Probabilities

The model can predict probabilities less than 0 or greater than 1, which is logically impossible.

2. Heteroskedasticity

The variance of the error term is not constant; it depends on the x's. This means our standard errors and t-stats are not strictly valid (though often okay in large samples).

Despite these issues, the LPM is often a good and simple starting point for binary outcomes, especially for estimating partial effects near the average of the data.

Chapter 7 Summary

Qualitative information is incredibly useful, and dummy variables are the tool we use to incorporate it into our regressions.

A single dummy variable creates an intercept shift between two groups.
For `g` categories, we use `g-1` dummy variables to avoid the dummy variable trap.
Interacting a dummy with another variable allows for slope differences across groups.
When the dependent variable is binary, the OLS regression is a Linear Probability Model, where coefficients measure changes in probability.