Multiple Regression: Interpretation, Properties, and Specification#
So far we have studied simple regression: a model where a single variable explains the behavior of another. But economic reality rarely works that way. A firm’s sales do not depend only on its advertising budget; profitability does not respond solely to revenues; business performance cannot be explained by a single variable. In all these cases, trying to measure the effect of one variable while ignoring the others exposes us to a fundamental problem: are we measuring what we think we are measuring, or are we confounding effects?
Multiple regression is the answer to this question. Its central logic is simple but powerful: by including several explanatory variables simultaneously, the model attempts to isolate the effect of each one holding the others constant. This section presents the model, its statistical properties, and the problems that arise when the specification does not correctly reflect the data-generating process.
1. The Model and the Interpretation of Coefficients#
The multiple regression model extends simple regression to the case of \(k\) explanatory variables:
where \(x_1, x_2, \ldots, x_k\) are the regressors, \(\beta_0, \beta_1, \ldots, \beta_k\) are the population parameters, and \(\varepsilon\) is the error term that captures everything affecting \(y\) not included in the model.
The key difference from simple regression lies in the interpretation of the coefficients. To understand it, consider a concrete example. Suppose a firm wants to explain its monthly sales (\(sales\)) as a function of advertising spend (\(advertising\), in thousands) and the number of salespeople on its commercial team (\(salespeople\)):
In the simple regression \(sales = \beta_0 + \beta_1\, advertising + \varepsilon\), the coefficient \(\beta_1\) captures the raw association between advertising and sales. But that association may be contaminated: firms that invest more in advertising tend to also have larger sales teams. If in the data firms with higher advertising spend tend to have more salespeople, then \(\beta_1\) from the simple model will conflate the effect of advertising with the effect of sales team size.
By adding \(salespeople\) to the model, the coefficient \(\beta_1\) changes its meaning. It now measures the effect of increasing the advertising budget by one thousand once the size of the sales team is already accounted for. In econometric jargon, \(\beta_1\) measures the effect of advertising controlling for the number of salespeople or holding the sales team constant. The comparison is no longer between any two firms with more or less advertising, but between firms with the same team size that differ in their advertising spend.
This shift is substantial. Wooldridge summarizes it elegantly:
“The power of multiple regression analysis is that it allows us to do in nonexperimental environments what natural scientists are able to do in a controlled laboratory setting: keep other factors fixed.”
In a laboratory, a scientist can fix conditions and vary one factor at a time. In economics and social sciences we rarely have that luxury. Multiple regression is our way of approximating, with observational data, that ability to “keep everything else constant.”
Note
This idea of keeping other factors fixed is not just a convenient statistical property: it is one of the concepts that will prove important in the study of causality in econometrics. When we later introduce causal models, one of the central questions will be precisely whether the model controls correctly for the relevant variables. The language we are building now — conditional coefficients, partial effects, omitted variables — will be one of the tools with which we will analyze that question.
2. The OLS Estimator#
In simple regression we derived the estimator by two equivalent routes: the Method of Moments (MoM) and minimization of the sum of squares (OLS). Both extend directly to the multiple case.
From the MoM perspective, the assumption \(E[\varepsilon \mid x_1, \ldots, x_k] = 0\) generates \(k+1\) moment conditions — one per parameter: \(E[\varepsilon] = 0\) and \(E[\varepsilon\, x_j] = 0\) for \(j = 1, \ldots, k\). Replacing population expectations with sample averages gives a system of \(k+1\) equations in \(k+1\) unknowns, exactly identified.
From the OLS perspective, the criterion is the same as always: choose \(\hat{\beta}_0, \hat{\beta}_1, \ldots, \hat{\beta}_k\) to minimize the sum of squared residuals:
Both routes lead to the same estimator. The solution exists and is unique as long as no regressor is an exact linear combination of the others (more on this in section 6). Unlike simple regression, the first-order conditions of the minimization problem do not have a simple scalar solution in the general case: they give rise to a system of linear equations — the normal equations — whose compact solution requires matrix algebra. The details are in the appendix.
3. Statistical Properties via Simulation#
Just as in the Simple Regression section, we will explore the statistical properties of the OLS estimator through simulation before presenting the formal results. The logic is the same: we assume we know the true population model, generate many independent samples, run OLS on each one, and study the distribution of the resulting estimates.
The model we will simulate is:
with known population values for \(\beta_0\), \(\beta_1\), and \(\beta_2\). Unlike simple regression, the model now has two regressors and we can control four parameters: the standard deviation of the error (\(\sigma_\varepsilon\)), the standard deviations of the regressors (\(\sigma_{X_1}\) and \(\sigma_{X_2}\)), and — a key new dimension — the correlation between \(x_1\) and \(x_2\). This last dimension did not exist in the simple model and, as we will see, has a striking effect on the precision of the estimators.
As you explore the simulation, pay attention to:
Are the histograms of \(\hat{\beta}_1\) and \(\hat{\beta}_2\) centered on the true values? Does that change when you vary the parameters?
What happens to the spread of the distributions as you increase \(\sigma_\varepsilon\)?
What happens when you increase \(\sigma_{X_1}\) or \(\sigma_{X_2}\)? Does more variation in the regressors improve or worsen precision?
Vary the correlation between \(x_1\) and \(x_2\) from low values up to 0.9 or higher. What effect does it have on the distribution of the estimators?
4. What Do We Observe?#
Unbiasedness. Across all parameter combinations, the histograms of \(\hat{\beta}_1\) and \(\hat{\beta}_2\) are centered on the true values. This holds regardless of the noise level, the spread of the regressors, or the correlation between them. Individual estimates deviate from the true value, but not systematically in any direction.
Model noise. Increasing \(\sigma_\varepsilon\) noticeably widens both distributions. A noisier model produces more dispersed estimates: although they remain unbiased on average, each individual estimate can stray quite far from the true value. This behavior is identical to what we observed in simple regression.
Regressor spread. Greater \(\sigma_{X_1}\) narrows the distribution of \(\hat{\beta}_1\), and greater \(\sigma_{X_2}\) narrows that of \(\hat{\beta}_2\). More variation in a regressor provides more information to identify its slope, reducing uncertainty about that coefficient.
Correlation between regressors. This is the most striking result of the multiple model, with no analog in simple regression. As the correlation between \(x_1\) and \(x_2\) grows, the distributions of both estimators widen dramatically. With high correlation, estimates can spread enormously even though the model remains unbiased. This sensitivity to the correlation between regressors is the core of the multicollinearity problem, which we develop in section 6.
These patterns are not coincidences. Statistical theory predicts them precisely, as we see next.
5. Formal Results#
Unbiasedness#
The simulation showed that the distributions of \(\hat{\beta}_1\) and \(\hat{\beta}_2\) are centered on the true values, regardless of the parameters chosen. The formal result confirms this is not a coincidence: under the assumption \(E[\varepsilon \mid x_1, \ldots, x_k] = 0\), the OLS estimators are exactly unbiased.
Formal result:
The proof is in the appendix.
Variance#
The simulation also showed that greater \(\sigma_\varepsilon\) widens the distributions, greater \(\sigma_{X_j}\) narrows them, and — crucially — greater correlation between \(x_1\) and \(x_2\) widens them considerably. The variance formula captures exactly these three effects. For the \(j\)-th coefficient:
Formal result:
where \(\sigma^2\) is the error variance, \(SST_j = \sum_{i=1}^n (x_{ij} - \bar{x}_j)^2\) is the total variation in the \(j\)-th regressor, and \(R_j^2\) is the \(R^2\) from the auxiliary regression of \(x_j\) on all other explanatory variables in the model. The formal derivation is in the appendix.
The first two factors are identical to those from simple regression: greater \(\sigma^2\) increases the variance of the estimator; greater \(SST_j\) reduces it. The third factor, \((1 - R_j^2)\), is new relative to simple regression — it has no analog there because with a single regressor there is no auxiliary regression. It measures how much own variation — not shared with the other regressors — \(x_j\) has. When \(x_j\) is highly correlated with the other regressors, \(R_j^2\) approaches 1, the denominator collapses toward zero, and the variance of the estimator explodes. This is exactly what you observed in the simulation when increasing the correlation between \(x_1\) and \(x_2\).
To complete the picture: in simple regression, \(\text{Var}(\hat{\beta}_1) = \sigma^2 / SST_x\). The multiple case is a direct extension with the additional factor \((1 - R_j^2)\) that penalizes collinearity between regressors.
6. The Multicollinearity Problem#
Multicollinearity describes the situation in which the regressors are highly correlated with each other. It is not a specification error in the strict sense, but a feature of the data that amplifies the variance problems described in the previous section.
As the correlation between \(x_1\) and \(x_2\) grows, the \(R_j^2\) of the auxiliary regression also grows, reducing the factor \((1 - R_j^2)\) in the denominator of the variance formula presented above. In the extreme limit, when \(R_j^2 = 1\) — that is, when one regressor is an exact linear combination of the others — the denominator is exactly zero: the OLS estimator is not defined. This is called perfect multicollinearity.
An example of perfect multicollinearity: including in the same model both years of experience and months of experience for each person. Since months are exactly 12 times years, the model cannot distinguish the effect of one variable from the other.
With imperfect multicollinearity — high but not perfect correlation — the estimator exists, but standard errors can be very large. This produces small \(t\)-statistics and coefficients that appear statistically insignificant, even when the variables are jointly important. A classic symptom: the model’s \(F\)-test rejects the joint null hypothesis, but no individual coefficient appears significant.
Two important practical points:
Multicollinearity does not introduce bias — the coefficients remain unbiased. The problem is one of precision, not accuracy.
It cannot be solved simply by collecting more observations, because it is a feature of the relationship between regressors, not of sample size. If the regressors are intrinsically correlated in the population, more data will not change that.
To see this effect in action, return to the simulation in section 3 and push the correlation between \(x_1\) and \(x_2\) to a value like 0.95 — a value arbitrarily close to 1. Observe how the distributions of the estimators spread dramatically. This is the practical limit of multicollinearity: not an error, but a reflection that the data do not contain enough independent variation in each regressor to identify their separate effects with precision.
7. Specification Problems#
We call a specification problem any discrepancy between the model we estimate and the true data-generating process. This discrepancy can arise from several sources: relevant variables we omit, irrelevant variables we include, incorrect functional forms, or — as we will see later — the inclusion of variables that introduce new biases despite appearing to be reasonable controls.
The results of the previous sections are valid under the assumption that the model is correctly specified. In practice, two errors are especially common and have asymmetric consequences: omitting relevant variables creates bias; including irrelevant variables reduces precision without introducing bias — although, as we will see, this second case should not be generalized carelessly.
7.a Omitted Variable Bias#
Suppose the true population model is:
but, due to ignorance or lack of data, we estimate the model without \(x_2\):
where we use the tilde to indicate that we are estimating a model different from the true one. What can we say about \(\tilde{\beta}_1\)?
It can be shown that the expected value of the estimator omitting \(x_2\) is:
where \(\delta_1\) is the slope of the auxiliary regression of \(x_2\) on \(x_1\), that is, the coefficient on \(x_1\) in the regression \(x_2 = \delta_0 + \delta_1 x_1 + v\). The formal derivation is in the appendix.
The term \(\beta_2 \delta_1\) is the omitted variable bias. The estimator \(\tilde{\beta}_1\) does not converge to the true \(\beta_1\), but to \(\beta_1\) plus that bias. Two conditions would make the bias vanish:
\(\beta_2 = 0\): the omitted variable has no effect on \(y\), so it was not relevant.
\(\delta_1 = 0\): the omitted variable is uncorrelated with \(x_1\).
In both cases there is no problem. The bias arises precisely when the omitted variable is relevant and correlated with the included regressors.
The direction of the bias is predictable from the signs:
Sign of \(\beta_2\) |
Correlation \(x_2\)-\(x_1\) |
Bias in \(\tilde{\beta}_1\) |
|---|---|---|
Positive |
Positive |
Upward (overestimates) |
Positive |
Negative |
Downward (underestimates) |
Negative |
Positive |
Downward (underestimates) |
Negative |
Negative |
Upward (overestimates) |
In the sales, advertising, and salespeople example: if firms with higher advertising spend also tend to have more salespeople (\(\delta_1 > 0\)) and sales team size has a positive effect on sales (\(\beta_2 > 0\)), then the simple regression that omits \(salespeople\) overestimates the effect of advertising.
The following simulation allows you to explore these effects: you can vary the magnitude of the omitted variable’s effect (\(\beta_2\)) and its correlation with the included regressor (\(\delta_1\)), observing how the bias in the estimates changes.
What to look for in the simulation:
Adjust the correlation between \(x_1\) and \(x_2\): how does the bias in the estimated coefficient on \(x_1\) change?
Vary the effect of the omitted variable: when \(\beta_2 = 0\), does the bias disappear?
Compare the distributions of the estimator under the correct model vs. the misspecified model.
7.b Inclusion of Irrelevant Variables#
The opposite case: what happens if we include in the model a variable that in fact does not affect \(y\)?
Suppose the true model is \(y = \beta_0 + \beta_1 x_1 + \varepsilon\), but we estimate:
Since the true coefficient on \(x_2\) is zero, the estimated model is perfectly compatible with the true one: we are simply estimating a model where \(\beta_2 = 0\). The unbiasedness results still apply — we expect \(\hat{\beta}_2 \approx 0\) and \(\hat{\beta}_1\) to remain unbiased.
However, there is a cost. Adding \(x_2\) means the auxiliary regression defining \(R_1^2\) (the regression of \(x_1\) on \(x_2\)) may yield an \(R_1^2\) greater than zero. This reduces the factor \((1 - R_1^2)\) in the variance denominator and therefore increases the variance of \(\hat{\beta}_1\). Greater variance implies larger standard errors, smaller \(t\)-statistics, and lower power to reject hypotheses.
Important
This result — that including an irrelevant variable does not create bias — is valid in the particular case just described: when the added variable is genuinely irrelevant and the simple model was already correct. It should not be interpreted as meaning that adding variables is, in general, free of consequences.
As we will see later in the book, in more complex causal settings the inclusion of certain variables can introduce bias, even when the model appears more complete. Two important examples:
Collider: a variable that is caused simultaneously by \(x_1\) and \(y\) (or by factors correlated with them). Controlling for a collider “opens” a spurious association between \(x_1\) and \(y\) that did not exist before. For instance, if both ability and luck affect whether someone is hired at an elite firm, conditioning on “being an elite employee” can create an artificial negative correlation between ability and luck, even if they are independent in the population.
Mediator: a variable on the causal path between \(x_1\) and \(y\). If part of the effect of \(x_1\) on \(y\) operates through the mediator, controlling for it blocks that channel and underestimates the total effect.
The practical conclusion is that the decision of which variables to include cannot be made on statistical grounds alone. It requires thinking about the causal structure of the problem. While one sometimes speaks of a bias-variance trade-off to describe the tension between omitting and adding variables, that framing oversimplifies: in practice, we will need to think carefully about which variables to add to the model to avoid biases caused by the inclusion itself. These ideas will be developed in detail when we address causal inference.
Appendix: Formal Derivations#
A.0 Normal Equations and the OLS Estimator#
Minimizing the RSS with respect to each parameter generates \(k+1\) first-order conditions. For \(\beta_0\):
and for each \(\beta_j\), \(j = 1, \ldots, k\):
These conditions are exactly the sample analogs of the MoM moment conditions: they require the residuals to be orthogonal to each regressor (and to the intercept). Together, they form the normal equations, a linear system of \(k+1\) equations in \(k+1\) unknowns.
In simple regression (\(k=1\)), the system has a closed-form scalar solution. In the general case, the solution is written compactly in matrix notation:
where \(X\) is the \(n \times (k+1)\) data matrix (with a column of ones for the intercept) and \(y\) is the vector of the dependent variable. The matrix \(X'X\) is invertible — and the solution is unique — as long as the regressors are not linearly dependent, i.e., when there is no perfect multicollinearity. The full matrix derivation will be developed in a later section dedicated to the algebra of the linear model.
A.1 Unbiasedness of the OLS Estimators#
The result extends directly from the simple case. By the Frisch-Waugh-Lovell Theorem, \(\hat{\beta}_j\) can be written as:
where \(\tilde{x}_{ij}\) are the residuals from regressing \(x_j\) on the other regressors. Taking the conditional expectation given \(X\) and using \(E[\varepsilon_i \mid X] = 0\):
Since this result holds for any realization of \(X\), we conclude \(E[\hat{\beta}_j] = \beta_j\).
A.2 Omitted Variable Bias#
The true model is \(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \varepsilon\), but we estimate the short regression \(y = \tilde{\beta}_0 + \tilde{\beta}_1 x_1 + u\). We want to find \(E[\tilde{\beta}_1]\).
Step 1 — Expression for the OLS estimator in the short regression.
Step 2 — Substitute the true model.
We replace \(y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \varepsilon_i\):
Using \(\sum_{i=1}^n (x_{1i} - \bar{x}_1) = 0\), the term in \(\beta_0\) vanishes and the term in \(\beta_1\) simplifies to \(\beta_1 SST_1\):
Step 3 — Identify \(\delta_1\).
The coefficient \(\delta_1\) of the auxiliary regression \(x_2 = \delta_0 + \delta_1 x_1 + v\) is exactly:
Therefore:
Step 4 — Take the expectation.
Under the assumption \(E[\varepsilon_i \mid x_1, x_2] = 0\), the last term has expectation zero, and we obtain:
The bias is \(\beta_2 \delta_1\): the product of the true effect of the omitted variable on \(y\) and the (linear) correlation between the omitted variable and the included regressor.
A.3 Variance of \(\hat{\beta}_j\) in Multiple Regression#
The key result for deriving the variance formula is the Frisch-Waugh-Lovell Theorem, which states that the coefficient \(\hat{\beta}_j\) in the multiple regression equals the coefficient from the simple regression of \(y\) on \(\tilde{x}_j\), where \(\tilde{x}_j\) are the residuals from regressing \(x_j\) on all other explanatory variables in the model.
Step 1 — Residuals from the auxiliary regression.
We define \(\tilde{x}_{ij}\) as the residual from regressing \(x_j\) on the other regressors:
where \(\hat{x}_{ij}\) is the part explained by the other regressors and \(\tilde{x}_{ij}\) is the unexplained part (the “net variation” in \(x_j\)). The \(R^2\) of this auxiliary regression is \(R_j^2\), and the sum of squares of the residuals is:
since \(R_j^2 = 1 - \sum \tilde{x}_{ij}^2 / SST_j\).
Step 2 — Expression for the estimator.
By the Frisch-Waugh-Lovell Theorem, \(\hat{\beta}_j\) can be written as:
Substituting the true model \(y_i = \mathbf{x}_i'\boldsymbol{\beta} + \varepsilon_i\) and using the fact that the auxiliary residuals are orthogonal to all other regressors (by OLS construction), the terms in \(\beta_l\) for \(l \neq j\) vanish and we obtain:
Step 3 — Conditional variance given \(X\).
We take the variance conditional on all explanatory variables \(X\). Under homoskedasticity, \(\text{Var}(\varepsilon_i \mid X) = \sigma^2\), and errors are independent across observations:
Step 4 — Substitution.
Replacing \(\sum_{i=1}^n \tilde{x}_{ij}^2 = SST_j(1 - R_j^2)\) from Step 1:
This result directly generalizes the simple regression formula: when \(k = 1\) there is no auxiliary regression, \(R_j^2 = 0\), and the expression reduces to \(\sigma^2 / SST_x\), which is exactly the simple regression formula.