Simple Linear Regression#
The Linear Regression Model#
Up to this point, we have proposed characterizing the relationship between our explanatory variable of interest \(X\) and the outcome \(Y\) through an unknown function \(F\):
The simple linear regression model equation is nothing more than a compact way of writing that story.
A simplifying assumption that allows us to estimate the form of \(F\) is to suppose that its form is linear.
We write it as follows:
Far from being an abstract expression, each term serves a concrete and intuitive role.
Y — The Dependent Variable#
The letter \(Y\), the dependent variable, represents the quantity we wish to understand, describe, or anticipate.
It is the variable whose behavior we observe and seek to summarize. For example:
a person’s wage
a household’s electricity consumption
a student’s exam score
In the model, \(Y\) is not a fixed number but a variable that changes across observations.
X — The Explanatory Variable#
The letter \(X\), the explanatory variable, represents the variable with which we associate the behavior of \(Y\). It may be:
years of education (associated with wages)
temperature (associated with electricity consumption)
hours of study (associated with exam scores)
We call a regression simple when it is limited to a single explanatory variable. In the next chapter, we will extend the model to include multiple variables. For now, this reduced model will allow us to understand the logic of the framework step by step.
In accordance with what was developed in the introductory chapter, \(X\) is a random variable. Each observation \((X_i, Y_i)\) is a realization from a joint distribution, and the model describes how the distribution of \(Y\) changes as a function of the value taken by \(X\).
\(\beta_0\) — The Intercept#
The parameter \(\beta_0\), the intercept, indicates the average value of \(Y\) when \(X\) equals zero.
It can be thought of as a starting point. In some contexts it carries a clear interpretation, such as the fixed cost of a service. In others, it is simply a technical element required for the line to be correctly positioned. What matters is not always its literal meaning, but its function: anchoring the line in the plane.
\(\beta_1\) — The Slope#
The parameter \(\beta_1\) measures how \(Y\) changes, on average, when \(X\) increases by one unit. It is the formalization of the question that interests us: What typically happens to \(Y\) if \(X\) increases slightly?
If \(\beta_1\) is positive, \(Y\) tends to increase when \(X\) increases. If it is negative, the opposite holds. If it is close to zero, the association is weak or absent. Under certain additional conditions that will become clearer when we incorporate a causal model, we will be able to interpret \(\beta_1\) as measuring not merely an association but also the causal effect of interest.
\(\varepsilon\) — The Error Term#
If the line described all observations exactly, we would not need econometrics. Reality, however, is more complex.
The term \(\varepsilon\) captures everything that affects \(Y\) and is not included in \(X\):
individual differences
unobserved factors
measurement error
pure randomness
Rather than viewing the error as a failure of the model, it is better understood as an explicit acknowledgment that the world is not deterministic.
Assumptions on the Error Term#
In order to estimate the parameters \(\beta_0\) and \(\beta_1\) and interpret their results, we need to impose conditions on \(\varepsilon\). The central assumption is:
This means that, on average, the error does not depend on the value taken by \(X\): regardless of the level of the explanatory variable, the expected error is zero. Intuitively, the factors omitted from \(\varepsilon\) must not be systematically related to \(X\).
From this assumption, two moment conditions follow:
Condition 1: \(E[\varepsilon] = 0\)
On average, the model neither systematically overestimates nor underestimates \(Y\).
Condition 2: \(E[\varepsilon X] = 0\)
The error and the explanatory variable are uncorrelated.
These two conditions are what we will use to derive the estimator in the following section.
A note on causality. Estimating the linear model alone does not guarantee a causal interpretation of \(\hat{\beta}_1\). If there exists a relevant variable that affects \(Y\) and is correlated with \(X\), that variable is absorbed into \(\varepsilon\), violating the condition \(E[\varepsilon X] = 0\). In that case, the estimator captures not only the effect of \(X\) but also the indirect influence of the omitted variable. The conditions under which \(\hat{\beta}_1\) can be given a causal interpretation will be studied in the chapter on omitted variable bias.
Estimation#
The Method of Moments#
We have \(n\) observations \((X_1, Y_1), \ldots, (X_n, Y_n)\). The idea behind the Method of Moments (MoM) is to replace the population expectations with their sample analogs: the sample averages. Starting from the two moment conditions derived from the assumption \(E[\varepsilon \mid X] = 0\), we obtain a system of two equations in two unknowns.
Condition 1: \(E[\varepsilon] = 0 \Rightarrow E[Y - \beta_0 - \beta_1 X] = 0\)
Condition 2: \(E[\varepsilon X] = 0 \Rightarrow E[(Y - \beta_0 - \beta_1 X)X] = 0\)
Subtracting \(E[X] \times (1)\) from \((2)\):
From \((1)\): \(\beta_0 = E[Y] - \beta_1 E[X]\)
Replacing the population expectations with their sample analogs, we obtain the MoM estimators (which coincide with the OLS estimators):
A Second Intuition: Ordinary Least Squares (OLS)#
The same estimators can be obtained from a completely different perspective: by minimizing the sum of squared errors (RSS, residual sum of squares).
For each candidate pair \((\beta_0, \beta_1)\), the line generates predictions \(\hat{Y}_i = \beta_0 + \beta_1 X_i\) and residuals \(\varepsilon_i = Y_i - \hat{Y}_i\). The OLS criterion selects the pair that minimizes:
Squaring has two intuitive effects: it eliminates signs (positive and negative errors do not cancel) and penalizes large errors more than small ones.
Minimizing the RSS yields exactly the same formulas derived by MoM. This is not a coincidence: under the assumption \(E[\varepsilon \mid X] = 0\), both approaches lead to the same estimator.
Interpretation of the Estimators#
The expression for \(\hat{\beta}_1\) can be written as:
That is, the estimated slope compares how much \(X\) and \(Y\) move together with how much \(X\) varies on its own. If \(X\) and \(Y\) tend to move in the same direction, the numerator is positive and so is the slope. If there is no systematic association, the numerator is close to zero.
The estimated slope is therefore a summary measure of how \(X\) and \(Y\) move together.
Once the slope is determined, the intercept is chosen so that the line passes through the mean of the data: when \(X\) takes its mean value, the prediction coincides with the mean value of \(Y\).
Goodness of Fit#
Up to this point, we have learned how to draw the “best” possible linear relationship given the available sample of data. We have minimized the distances to obtain a line that is, mathematically, the most efficient. However, in econometrics, being “the best” does not always mean being “sufficient” to explain a phenomenon. This brings us to a fundamental question:
How much of the behavior of \(Y\) can our model explain?
We will call the goodness of fit of the model a measure of how much of the behavior of the dependent variable is captured by the model.
Evaluating fit does not consist of verifying whether the line is “correct” — in the social sciences, no line ever is, perfectly — but rather in quantifying how informative it turns out to be. In technical terms, we are decomposing the phenomenon into two parts:
The explained part: The movement of the data that our theory successfully predicts.
The residual: The mystery, the randomness, or all the factors we did not include in our model.
Note that, in essence, a measure of goodness of fit must tell us how much of what there is to explain about the phenomenon we actually manage to explain with the model. Our strategy will be to create a measure of how much there is to explain based on the variability of \(Y\) and of the explained component.
Total Sum of Squares (TSS)#
Before introducing the regression line, our best prediction for any observation of \(Y\) is its mean \(\bar{Y}\). The dispersion of the data around that mean represents everything that “remains to be explained”: it is the total variability of \(Y\).
We measure it with the Total Sum of Squares (TSS):
A large TSS indicates that \(Y\) is highly variable and that the mean is a poor description of the data. A small TSS indicates the opposite. The goal of the model is to reduce that initial uncertainty by using the information in \(X\).
Decomposition of the Sum of Squares#
For each observation, upon introducing the estimated line \(\hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 X_i\), we can write the algebraic identity:
The total deviation of each point from the mean decomposes into two parts: what the line explains and what it does not.
Squaring and summing over all observations yields three quantities:
Total Sum of Squares (TSS):
The total variation in \(Y\); the starting point.
Explained Sum of Squares (ESS):
The portion of the variation that the model manages to capture.
Residual Sum of Squares (RSS):
The portion that the model was unable to explain.
The fundamental identity is that these three quantities are related exactly as follows:
This equality holds whenever the model includes an intercept. It can be derived algebraically from the properties of the OLS estimator (see Appendix: Proof of TSS = ESS + RSS).
The Coefficient of Determination (\(R^2\))#
From the fundamental identity arises the most widely used measure of fit in econometrics: the \(R^2\) (R-squared). It measures what proportion of the total variation in \(Y\) is explained by the model:
The value of \(R^2\) always lies between 0 and 1:
\(R^2 = 0\): The model has no explanatory power. The line is horizontal and the information in \(X\) contributes nothing.
\(R^2 = 1\): Perfect fit; all points fall exactly on the line. In the social sciences this is practically impossible and, if it occurs, it usually indicates a specification error.
For example, an \(R^2 = 0.60\) means that the model explains 60% of the sample variation in \(Y\); the remaining 40% is left in the residual.
Caveats on the \(R^2\)#
Fit is not causality. A high \(R^2\) indicates that \(X\) and \(Y\) move together in a predictable way, not that one causes the other. Ice cream sales and forest fires may have a high \(R^2\) because both increase in summer; that does not imply causality.
A low \(R^2\) does not invalidate a model. In the social sciences, an \(R^2\) of 0.20 or 0.30 can be a solid result if the interest lies in the marginal effect \(\hat{\beta}_1\) and it is statistically significant. The \(R^2\) measures fit, not the relevance of the coefficients.
\(R^2\) increases mechanically with the range of \(X\). If we expand the range of the independent variable, \(R^2\) tends to rise even if the underlying relationship has not changed.
Appendix: Proof of \(TSS = ESS + RSS\)#
Properties of the OLS Estimator#
The OLS estimator is obtained by minimizing \(RSS = \sum_{i=1}^n (Y_i - \beta_0 - \beta_1 X_i)^2\). The first-order conditions yield two algebraic properties that are key to the proof.
Property 1 (P1): \(\displaystyle\sum_{i=1}^n \hat{\varepsilon}_i = 0\)
Differentiating the RSS with respect to \(\beta_0\) and setting the derivative equal to zero:
On average, the residuals are exactly zero. The line neither systematically overestimates nor underestimates.
Property 2 (P2): \(\displaystyle\sum_{i=1}^n X_i\,\hat{\varepsilon}_i = 0\)
Differentiating the RSS with respect to \(\beta_1\) and setting the derivative equal to zero:
The residuals are orthogonal to the explanatory variable.
Property 3 (P3, derived from P1 and P2): \(\displaystyle\sum_{i=1}^n \hat{Y}_i\,\hat{\varepsilon}_i = 0\)
Since \(\hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 X_i\) is a linear combination of the two quantities already known to be orthogonal to \(\hat{\varepsilon}_i\):
Proof#
Starting point. By definition of the residual, \(\hat{\varepsilon}_i = Y_i - \hat{Y}_i\), so:
Step 1: Square and sum.
Expanding the binomial square:
Step 2: Show that the cross term is zero.
Step 3: Conclude.
The equality \(TSS = ESS + RSS\) rests entirely on the algebraic properties of the OLS estimator: the fact that the residuals sum to zero (P1) and are orthogonal to the fitted values (P3). Both properties are direct consequences of minimizing the sum of squares and including an intercept in the model.