Using Dummy Variables in Regression Analysis
So far, we've mostly used quantitative variables (like income, education in years, etc.). But what about variables that represent categories?
We can include this information in regressions using dummy variables (or binary variables), which take a value of 1 or 0.
The simplest case is a variable with two categories, like gender. We create one dummy variable:
female = 1 if person is female, 0 if male
wage = β0 + δ0female + β1educ + u
Interpretation: The coefficient δ0 measures the difference in the average wage between females and males, holding education constant. It's an intercept shift.
In the previous example, `male` was our base group or benchmark, because their average wage is captured by the intercept β0.
What would happen if we tried to include dummy variables for both categories?
wage = β0 + δ0female + γ0male + β1educ + u
This would cause perfect multicollinearity. For every observation, `female + male = 1`. Since the intercept is already a column of 1s, the regressors are perfectly linearly related.
Your software will give an error. The rule is: if you have `g` categories, include `g-1` dummy variables in the model with an intercept.
What if a variable has more than two categories, like marital status (single, married, divorced)?
wage = β0 + δ1married + δ2divorced + ... + u
Here, δ1 is the average wage difference between married and single people (the base group), and δ2 is the difference between divorced and single people.
So far, we've only allowed the intercept to change. What if the effect of education on wages is different for men and women? We can model this by interacting the dummy with the quantitative variable.
log(wage) = β0 + δ0female + β1educ + δ1(female × educ) + u
The return to education for men is β1. For women, it's β1 + δ1. The interaction term allows the two lines to have different slopes.
What if the outcome we want to explain is itself a yes/no event? We can use a binary dependent variable.
y = β0 + β1x1 + ... + βkxk + u
New Interpretation: When y is 0/1, the expected value E(y|x) is the probability that y=1. So the regression explains the probability of the event occurring. This is called the Linear Probability Model (LPM).
βj is the change in the probability of success when xj increases by one unit.
While simple to estimate and interpret, the LPM has some drawbacks.
The model can predict probabilities less than 0 or greater than 1, which is logically impossible.
The variance of the error term is not constant; it depends on the x's. This means our standard errors and t-stats are not strictly valid (though often okay in large samples).
Despite these issues, the LPM is often a good and simple starting point for binary outcomes, especially for estimating partial effects near the average of the data.
Qualitative information is incredibly useful, and dummy variables are the tool we use to incorporate it into our regressions.