# Categorical Variables and Interactions In econometrics, explanatory variables are not always continuous. The neighborhood of a property, the work shift of an employee, the season in which a business operates: all of these are categories, not numbers. How do we incorporate this type of information into a multiple regression model? This section presents three natural extensions of the linear model. First, the simplest case: a variable with exactly two categories, captured by a **dummy variable** (or indicator variable). Second, the generalization to any number of categories, where a specific pitfall arises that is worth understanding beforehand. Third, **interaction variables**, which allow the effect of a continuous variable to differ across groups. All three topics are built around the same running example — property prices in different neighborhoods — so that the progression is cumulative. --- (dummy-variables-en)= ## 1. Categorical Variables: The Two-Group Case ### 1.1 The problem: groups in a continuous regression Consider property prices in a city. Prices depend on the floor area in square meters, but also on the neighborhood. Two apartments of equal size can have very different prices depending on whether one is in a premium neighborhood and the other in a standard one. If we plot price against $m^2$ and color the points by neighborhood, we see two clouds of points separated vertically: the cluster for neighborhood A above, neighborhood B below. A single regression line that ignores neighborhoods will pass between the two clouds, fitting neither well. More than a cosmetic problem, this is a specification problem: we are omitting a relevant variable — the neighborhood — that is likely correlated with floor area, since premium neighborhoods tend to have larger properties. The omission generates the omitted variable bias discussed in the {ref}`previous section `. The solution requires no new statistical method. We just need to represent the categorical variable in a form that can enter the model as an ordinary regressor. ### 1.2 The dummy variable A **dummy variable** (also called an indicator or binary variable) takes the value 1 if the observation belongs to one category and 0 if it belongs to the other. For the neighborhood example: $$ D_i = \begin{cases} 1 & \text{if property } i \text{ is in neighborhood A} \\ 0 & \text{if property } i \text{ is in neighborhood B} \end{cases} $$ Once $D_i$ is defined, we include it in the model exactly like any other regressor: $$ \text{Price}_i = \beta_0 + \beta_1 \cdot m^2_i + \beta_2 \cdot D_i + \varepsilon_i $$ Estimation proceeds by OLS without any modification. The only difference from a continuous regressor is in the interpretation of the coefficient. ### 1.3 Interpretation: two parallel lines To understand what $\beta_2$ measures, write the conditional expected price for each neighborhood separately. For **neighborhood B** ($D_i = 0$): $$ E[\text{Price}_i \mid m^2_i,\, D_i = 0] = \beta_0 + \beta_1 \cdot m^2_i $$ For **neighborhood A** ($D_i = 1$): $$ E[\text{Price}_i \mid m^2_i,\, D_i = 1] = (\beta_0 + \beta_2) + \beta_1 \cdot m^2_i $$ The model implies two **parallel** regression lines: the same slope $\beta_1$, but different intercepts separated by $\beta_2$. **Formal result:** under the assumption $E[\varepsilon_i \mid m^2_i, D_i] = 0$, the OLS estimator of $\beta_2$ is unbiased, $E[\hat{\beta}_2] = \beta_2$. This is a direct application of the unbiasedness result for multiple regression — no additional derivation is needed. **Interpretation of $\beta_2$:** the expected difference in price between a property in neighborhood A and a property in neighborhood B of **equal size**. It captures the average price premium associated with neighborhood A, holding floor area constant. If $\hat{\beta}_2 = 30$, a property in neighborhood A is worth on average 30 units more than an otherwise identical property in neighborhood B. **Interpretation of $\beta_1$:** the effect of one additional square meter on price, holding the neighborhood constant. The model assumes this slope is the same in both neighborhoods — an assumption that the interactions section relaxes. ```{note} Why not code the neighborhood as 1 for A and 2 for B? That coding would impose that the distance from B to A equals the distance from "no neighborhood" to B, which is meaningless. The 0/1 coding avoids that arbitrary constraint: it only states that A and B are different, without assuming any metric between the categories. ``` **Alternative example — Retail.** A retail chain wants to estimate the effect of daily customer counts on revenues. It suspects that customers spend more on weekends regardless of how many enter the store. Define: $$ D_i = \begin{cases} 1 & \text{if day } i \text{ is a weekend} \\ 0 & \text{if day } i \text{ is a weekday} \end{cases} $$ The model $\text{Sales}_i = \beta_0 + \beta_1 \cdot \text{Customers}_i + \beta_2 \cdot D_i + \varepsilon_i$ estimates how much each additional customer is worth ($\beta_1$) and how much sales differ on weekends relative to weekdays for the same footfall level ($\beta_2$). If $\hat{\beta}_2 > 0$, weekends generate additional revenue beyond what the customer count alone would predict — perhaps through a higher average ticket or a better conversion rate. ### Interactive Simulation The following simulation generates data where the true price is: $$ \text{Price}_i = 50 + 2 \cdot m^2_i + 30 \cdot D_i + \varepsilon_i $$ Neighborhood A has a true price premium of 30 units. You can toggle between three models: none (data only), without dummy, and with dummy. **What to look for:** - With the **no dummy** model, the single line averages over both clouds. Note the $R^2$ and compare it to the model with the dummy. - With the **dummy** model, two parallel lines appear. Verify that the vertical gap between them matches $\hat{\beta}_2$ in the results table. - Does the coefficient on $m^2$ change between the two models? Here neighborhood A has larger properties on average: omitting the dummy biases the slope upward.

--- (multiple-categories-en)= ## 2. Categorical Variables with Multiple Categories ### 2.1 From two groups to k groups Suppose the city now has three neighborhoods — A, B, and C — with systematically different price levels. How do we extend the model? One tempting idea is to assign numbers: A = 1, B = 2, C = 3. But this imposes an arbitrary ordering and assumes that the gap from A to B is the same as the gap from B to C, which is almost never true. There is no reason why neighborhood "2" should be worth exactly half as much as neighborhood "3." The correct solution is to create a **dummy variable for each category**. With three neighborhoods: $$ D_{A,i} = \mathbf{1}[\text{property } i \text{ in neighborhood A}], \qquad D_{B,i} = \mathbf{1}[\cdots B], \qquad D_{C,i} = \mathbf{1}[\cdots C] $$ Each dummy is a membership indicator for that category. The next step — including all of them at once — reveals a problem worth understanding carefully. ### 2.2 The dummy variable trap If we try to include all three dummies simultaneously: $$ \text{Price}_i = \beta_0 + \beta_1 m^2_i + \beta_2 D_{A,i} + \beta_3 D_{B,i} + \beta_4 D_{C,i} + \varepsilon_i $$ the model is not estimable. The reason is algebraic: for every observation $i$, $$ D_{A,i} + D_{B,i} + D_{C,i} = 1 $$ because every property belongs to exactly one neighborhood. But that column of ones is already in the model: it is the intercept column $\beta_0$. We have introduced an exact linear combination among the regressors — the most severe form of {ref}`multicollinearity `. The matrix $X'X$ is singular and has no inverse; the OLS estimator does not exist. This situation is known as the **dummy variable trap**. Statistical software typically detects it and automatically drops one of the columns, though not always with a clear warning message. It is important to understand *why* this happens — not just to avoid the mechanical error. ### 2.3 The reference category The solution is to omit one of the dummies. The omitted category is called the **reference category** (or **base group**). With three neighborhoods and neighborhood C as the reference: $$ \text{Price}_i = \beta_0 + \beta_1 m^2_i + \beta_2 D_{A,i} + \beta_3 D_{B,i} + \varepsilon_i $$ The three implicit regression lines are: | Neighborhood | Intercept | Slope | |---|---|---| | C (reference) | $\beta_0$ | $\beta_1$ | | A | $\beta_0 + \beta_2$ | $\beta_1$ | | B | $\beta_0 + \beta_3$ | $\beta_1$ | **Interpretation of the coefficients:** - $\beta_0$: expected price in neighborhood C at $m^2 = 0$. It is the intercept of the reference group's regression line. - $\beta_2$: average price difference between neighborhood A and neighborhood C, controlling for floor area. If $\hat{\beta}_2 = 30$, neighborhood A commands on average 30 units more than neighborhood C for equal-sized properties. - $\beta_3$: average price difference between neighborhood B and neighborhood C, controlling for floor area. All dummy coefficients are **comparisons against the reference category**. To obtain the direct difference between A and B, compute $\hat{\beta}_2 - \hat{\beta}_3$ (with the appropriate standard error for inference). **Does the choice of reference category matter?** The estimated coefficients change, but the quality of fit and the predictions do not. If we instead choose A as the reference, the coefficient values will differ — now $\beta_2$ measures B vs A and $\beta_3$ measures C vs A — but the $R^2$, MSE, and predicted price for each property remain identical. The choice of reference is about interpretation, not estimation. The most common convention is to omit the most frequent category or the one that serves as the natural baseline for the analysis. ```{important} **General rule:** for a categorical variable with $k$ categories, include exactly $k - 1$ dummies. Including $k$ dummies triggers the trap. Including fewer than $k - 1$ implicitly groups categories together, which may be a deliberate decision (if the grouped categories have the same effect) or simply an error. ``` **Alternative example — Hotels.** A hotel wants to model the average daily rate as a function of the occupancy rate and the season. It defines three seasons: high (A), shoulder (B), and low (C). With low season as the reference: $$ \text{Rate}_i = \beta_0 + \beta_1 \cdot \text{Occupancy}_i + \beta_2 D_{A,i} + \beta_3 D_{B,i} + \varepsilon_i $$ The coefficient $\hat{\beta}_2$ captures the price premium of high season over low season, **controlling for occupancy**. This matters: if the hotel is also fuller during high season, a regression without the occupancy control would mix two distinct effects — the pure seasonal effect on the rate and the effect of higher demand. By including occupancy as a regressor, $\hat{\beta}_2$ isolates the price difference attributable to the season itself. ### Interactive Simulation The following simulation generates data with three neighborhoods where the true price satisfies: $$ \text{Price}_i = \alpha_{\text{nbhd}} + 2 \cdot m^2_i + \varepsilon_i, \qquad \alpha_A = 80,\; \alpha_B = 50,\; \alpha_C = 20 $$ You can choose between models with one, two, or all three dummies. **What to look for:** - Models with **one dummy** (e.g., only $D_A$): neighborhoods B and C are grouped on a single line. How poorly does that line fit each group separately? - Models with **two dummies**: three parallel lines appear. Change the reference category and observe how the coefficient values shift while the lines stay in the same positions. - Model with **all three dummies**: observe the rank-deficiency warning. The OLS estimator does not exist.

--- (interaction-variables-en)= ## 3. Interaction Variables ### 3.1 The parallel-lines assumption and its limits The models in the two previous sections share an implicit assumption: **the slope on $m^2$ is the same in every neighborhood**. The dummy shifts the intercept but does not affect the slope. In the graph, the lines for each group are parallel. Is this assumption reasonable? It depends on the context. In a premium neighborhood, buyers may value additional square meters more than in a standard neighborhood — perhaps because large properties in exclusive areas access a luxury segment with non-linear pricing. If that is the case, the slope for neighborhood A should be steeper than for neighborhood B, and the lines should not be parallel. To capture this, we need an **interaction variable**. ### 3.2 The model with an interaction An **interaction variable** (or interaction term) is the product of a dummy and a continuous variable: $$ \text{Price}_i = \beta_0 + \beta_1 m^2_i + \beta_2 D_i + \beta_3 (D_i \times m^2_i) + \varepsilon_i $$ The term $D_i \times m^2_i$ equals $m^2_i$ when the property is in neighborhood A ($D_i = 1$) and zero when it is in neighborhood B ($D_i = 0$). Writing the conditional expectations by group: For **neighborhood B** ($D_i = 0$): $$ E[\text{Price}_i \mid m^2_i,\, D_i = 0] = \beta_0 + \beta_1 \, m^2_i $$ For **neighborhood A** ($D_i = 1$): $$ E[\text{Price}_i \mid m^2_i,\, D_i = 1] = (\beta_0 + \beta_2) + (\beta_1 + \beta_3)\, m^2_i $$ The model now generates **two lines with different slopes**. The slope for neighborhood B is $\beta_1$; the slope for neighborhood A is $\beta_1 + \beta_3$. **Interpretation of the coefficients:** - $\beta_1$: return to one additional square meter in neighborhood B (the reference group, $D = 0$). - $\beta_3$: difference in slopes between A and B. If $\hat{\beta}_3 > 0$, each square meter is worth more in neighborhood A than in neighborhood B. - $\beta_2$: difference in intercepts between A and B, evaluated at $m^2 = 0$ — a point that is typically outside the range of the data and therefore difficult to interpret directly. ### 3.3 The out-of-sample intercept and centering The difficulty with $\beta_2$ in the interaction model is that it measures the gap between groups when $m^2 = 0$, a value that makes no sense for real properties. This does not affect the validity of the model or the predictions, but it makes the dummy coefficient hard to communicate. The standard solution is to **center** the continuous variable. Instead of $m^2_i$, use $\tilde{m}^2_i = m^2_i - \overline{m^2}$, where $\overline{m^2}$ is the sample mean of floor areas. The model becomes: $$ \text{Price}_i = \beta_0 + \beta_1 \tilde{m}^2_i + \beta_2 D_i + \beta_3 (D_i \times \tilde{m}^2_i) + \varepsilon_i $$ With this reparametrization, $\beta_2$ measures the price gap between A and B for a property of **average size** — a concrete and interpretable quantity, comparable to the $\beta_2$ from the model without an interaction. The slopes and all other inference remain unchanged. ### 3.4 When to use the interaction model The interaction model adds a parameter and is more flexible, but also more demanding in terms of precision. The empirical question is whether the difference in slopes is statistically significant. The $t$-statistic on $\hat{\beta}_3$ answers this directly: if $H_0: \beta_3 = 0$ cannot be rejected, the parallel-lines model is sufficient. An alternative way to think about the interaction model: it produces exactly the same coefficients as running separate OLS regressions for each group. This means the global model imposes **homoskedasticity across groups** — the error variance is the same in A and B. If that assumption seems questionable, use robust standard errors in the global model or compare the standard errors of the two separate regressions. **Alternative example — Retail.** A retail chain has two store formats: flagships (large, in central locations) and standard stores. Management wants to know whether each additional customer generates the same revenue across formats. $$ \text{Sales}_i = \beta_0 + \beta_1 \cdot \text{Customers}_i + \beta_2 D_{\text{flagship},i} + \beta_3 (D_{\text{flagship},i} \times \text{Customers}_i) + \varepsilon_i $$ If $\hat{\beta}_3 > 0$, the average ticket is higher at flagships: each customer entering a flagship generates more revenue than one entering a standard store. The model without an interaction would have imposed the same revenue-per-customer in both formats, potentially distorting revenue projections and investment decisions. ### Interactive Simulation The following simulation generates data where the true process has different slopes by neighborhood ($\beta_A = 3$, $\beta_B = 1$). You can toggle between four models: none, pooled (no dummy), dummy only, and the full model (dummy + interaction). **What to look for:** - With the **pooled** model, a single gray line passes through both clouds without fitting either well. - With **dummy only**, the lines are parallel but do not follow each cloud correctly: the model imposes the same slope even though the true slopes differ. - With the **full model**, the lines become non-parallel and fit each group. Verify that $\hat{\beta}_3$ in the table matches the slope difference between neighborhoods. - Notice how $R^2$ improves progressively: Pooled → Dummy only → Full model.

--- ## Appendix: Formal Derivations (apendice-categoricas-en)= ### A.1 The dummy as a special case of the Frisch-Waugh-Lovell Theorem By the Frisch-Waugh-Lovell Theorem (derived in the {ref}`appendix of the previous section `), the estimator $\hat{\beta}_2$ in the model with a dummy is numerically equal to the coefficient from the simple regression of $y$ on $\tilde{D}$, where $\tilde{D}$ are the residuals from projecting $D_i$ onto the other regressors (here, $m^2$ and the constant). When the continuous regressor and the dummy are orthogonal in sample — that is, when $\overline{m^2_A} = \overline{m^2_B}$ (same average floor area in both neighborhoods) — the residuals $\tilde{D}$ coincide with the deviations of $D_i$ from its mean, and $\hat{\beta}_2$ reduces exactly to the difference in group means of $y$: $$ \hat{\beta}_2 = \bar{y}_A - \bar{y}_B \qquad \text{(only when } \overline{m^2_A} = \overline{m^2_B}\text{)} $$ In the general case with correlation between $D$ and $m^2$, this simplification does not hold: $\hat{\beta}_2$ captures the mean difference **after removing the effect of floor area** — which is precisely the advantage of the regression approach over a simple comparison of means. ### A.2 Rank deficiency with k dummies for k categories Let $\mathbf{1}_n$ be the vector of ones of length $n$ (the intercept column in the matrix $X$), and let $\mathbf{d}_A, \mathbf{d}_B, \mathbf{d}_C \in \{0,1\}^n$ be the three dummy vectors. By construction, each observation belongs to exactly one neighborhood, so: $$ \mathbf{d}_A + \mathbf{d}_B + \mathbf{d}_C = \mathbf{1}_n $$ This means the intercept column is an exact linear combination of the three dummy columns. The data matrix $X$ — which includes all four columns — has rank less than 4: $$ \text{rank}(X) \leq k = 3 < k + 1 = 4 $$ and $X'X$ has no inverse. The solution is to drop any one of the four linearly dependent columns. In practice, one of the dummies is omitted, producing the reference category. ### A.3 Equivalence between the interaction model and separate regressions by group Consider the interaction model: $$ y_i = \beta_0 + \beta_1 x_i + \beta_2 D_i + \beta_3 (D_i x_i) + \varepsilon_i $$ **Step 1 — First-order conditions for each subgroup.** For observations with $D_i = 0$ (group B), the interaction term vanishes and the model reduces to $y_i = \beta_0 + \beta_1 x_i + \varepsilon_i$. The OLS conditions restricted to this subset determine $\hat{\beta}_0$ and $\hat{\beta}_1$ exactly as a simple regression on group B. For observations with $D_i = 1$ (group A), the model is $y_i = (\beta_0 + \beta_2) + (\beta_1 + \beta_3) x_i + \varepsilon_i$. The OLS conditions restricted to group A determine $\hat{\beta}_0 + \hat{\beta}_2$ and $\hat{\beta}_1 + \hat{\beta}_3$ as a simple regression on group A. **Step 2 — Conclusion.** The four first-order conditions of the global model decompose into two independent two-equation systems. The solutions are identical to those obtained from estimating two separate regressions, one per group. The difference is that the global model imposes $\text{Var}(\varepsilon_i \mid X) = \sigma^2$ (homoskedasticity across groups), while separate regressions allow different variances. If cross-group homoskedasticity seems doubtful, use robust standard errors in the global model or compare standard errors across the two separate regressions.