Simple Linear Regression#
A technology company wants to know whether investing more in digital advertising is worth it. It has data from 140 past campaigns: how much was spent on advertising (in thousands of dollars) and how much revenue was generated (also in thousands of dollars). The question is simple: does a higher budget lead to higher revenue? And if so, by how much?
Simple linear regression is the statistical tool that allows us to answer that question rigorously. In this section we will build the model step by step, starting from visual intuition and working our way up to the formulas.
The Linear Regression Model#
Up to this point, we have proposed characterizing the relationship between our explanatory variable of interest \(X\) and the outcome \(Y\) through an unknown function \(F\):
The simple linear regression model equation is nothing more than a compact way of writing that story.
A simplifying assumption that allows us to estimate the form of \(F\) is to suppose that its form is linear.
We write it as follows:
Far from being an abstract expression, each term serves a concrete and intuitive role.
Y — The Dependent Variable#
The letter \(Y\), the dependent variable, represents the quantity we wish to understand, describe, or anticipate.
It is the variable whose behavior we observe and seek to summarize. In our example, \(Y\) is the revenue generated by an advertising campaign (in thousands of USD). More generally:
a person’s wage in the technology industry
the revenue generated by a digital advertising campaign
the user retention rate of an application
In the model, \(Y\) is not a fixed number but a variable that changes across observations.
X — The Explanatory Variable#
The letter \(X\), the explanatory variable, represents the variable with which we associate the behavior of \(Y\). In our example, \(X\) is the advertising budget (in thousands of USD). It may be:
years of industry experience (associated with wages)
advertising budget (associated with revenue)
weekly hours of app usage (associated with retention)
We call a regression simple when it is limited to a single explanatory variable. In the next chapter, we will extend the model to include multiple variables. For now, this reduced model will allow us to understand the logic of the framework step by step.
In accordance with what was developed in the introductory chapter, \(X\) is a random variable. Each observation \((X_i, Y_i)\) is a realization from a joint distribution, and the model describes how the distribution of \(Y\) changes as a function of the value taken by \(X\).
\(\beta_0\) — The Intercept#
The parameter \(\beta_0\), the intercept, indicates the average value of \(Y\) when \(X\) equals zero.
It can be thought of as a starting point. In some contexts it carries a clear interpretation, such as the fixed cost of a service. In others, it is simply a technical element required for the line to be correctly positioned. What matters is not always its literal meaning, but its function: anchoring the line in the plane.
\(\beta_1\) — The Slope#
The parameter \(\beta_1\) measures how \(Y\) changes, on average, when \(X\) increases by one unit. It is the formalization of the question that interests us: What typically happens to \(Y\) if \(X\) increases slightly?
If \(\beta_1\) is positive, \(Y\) tends to increase when \(X\) increases. If it is negative, the opposite holds. If it is close to zero, the association is weak or absent. Under certain additional conditions that will become clearer when we incorporate a causal model, we will be able to interpret \(\beta_1\) as measuring not merely an association but also the causal effect of interest.
\(\varepsilon\) — The Error Term#
If the line described all observations exactly, we would not need econometrics. Reality, however, is more complex.
The term \(\varepsilon\) captures everything that affects \(Y\) and is not included in \(X\):
individual differences
unobserved factors
measurement error
pure randomness
Rather than viewing the error as a failure of the model, it is better understood as an explicit acknowledgment that the world is not deterministic.
The Need for an Estimation Criterion#
The model is specified. We know its structure: \(Y = \beta_0 + \beta_1 X + \varepsilon\). But \(\beta_0\) and \(\beta_1\) are unknown. To estimate them we need data and a criterion that tells us which line is the “best” one given the sample.
The most natural idea is to require that the line make the smallest possible errors across the set of observations. For each candidate pair \((\beta_0, \beta_1)\), the line generates predictions \(\hat{Y}_i = \beta_0 + \beta_1 X_i\) and sample residuals:
A natural measure of total error is the residual sum of squares (RSS):
Why square the residuals? For two practical reasons: first, positive and negative errors do not cancel — without squaring, a line that overestimates as much as it underestimates would appear perfect; second, squaring penalizes large errors disproportionately.
The RSS is the aggregated cost of the errors the line commits over the sample. The criterion we will adopt is to choose the parameters \((\beta_0, \beta_1)\) that minimize it. Before seeing the analytical solution, it is worth experiencing the problem interactively.
Interactive Exploration: Find the Best Line#
Before seeing the formula, it helps to feel the problem. The dashboard below shows the 140 advertising campaigns: each point is an observation \((X_i, Y_i)\). You control the two parameters of the line, \(\beta_0\) and \(\beta_1\), and can see how the RSS changes as you move it.
Task: try to find the values of \(\beta_0\) and \(\beta_1\) that minimize the RSS. Then use the “Show OLS Solution” button to see how close you got.
What Do We Observe?#
The intercept and slope act differently. Moving \(\beta_0\) shifts the line vertically without changing its tilt; moving \(\beta_1\) rotates it around a point. Both affect the RSS, but in different ways.
The RSS measures the total “cost” of the errors. A line that misses slightly in many points can have a higher RSS than one that misses badly in a few points, because squaring penalizes large errors disproportionately.
The OLS solution does not “pass through all the points”. The optimal line is the one that minimizes total squared distance, not the one that touches the most points.
The RSS Surface (optional)#
The previous dashboard allowed us to explore the RSS for one pair \((\beta_0, \beta_1)\) at a time. But it leaves an important question unanswered: what does the RSS look like as we vary all possible pairs systematically?
Minimizing the RSS has valuable properties: there is a unique global minimum — no two lines tie — and that minimum can be identified analytically without needing to explore the surface point by point. The following dashboard visualizes exactly that: the RSS for all possible pairs \((\beta_0, \beta_1)\) simultaneously, as a heat map in parameter space.
Each point on the map corresponds to a different line. The color indicates the RSS value: darker = more error. The gold star marks the global minimum, which is exactly the OLS solution.
Key insight: The bowl shape of the surface confirms visually what the analytical derivation guarantees: there is a unique point where the RSS is minimized, and OLS finds it directly.
Estimation#
Ordinary Least Squares (OLS)#
We have \(n\) observations \((X_1, Y_1), \ldots, (X_n, Y_n)\). The OLS criterion selects the values of \(\beta_0\) and \(\beta_1\) that minimize the sum of squared residuals:
Differentiating the RSS with respect to each parameter and setting the derivatives equal to zero yields the closed-form solution:
The full derivation — including the first-order conditions and verification that this is a minimum — is presented in the Appendix: Proof of TSS = ESS + RSS.
A Second Motivation: The Method of Moments#
The same estimators can be derived from a completely different perspective, one that starts from the conditions the model imposes on the error term in the population.
Central assumption: \(E[\varepsilon \mid X] = 0\)
This means that, on average, the error does not depend on the value taken by \(X\). Intuitively, the factors omitted from \(\varepsilon\) must not be systematically related to \(X\).
In our example: the advertising budget should not be correlated with unobserved factors that also affect revenue — such as seasonality or product quality. If it were, the assumption would be violated.
From this assumption, two moment conditions follow:
Condition 1: \(E[\varepsilon] = 0\) — on average, the model neither systematically overestimates nor underestimates \(Y\).
Condition 2: \(E[\varepsilon X] = 0\) — the error and the explanatory variable are uncorrelated.
A note on causality. Estimating the linear model alone does not guarantee a causal interpretation of \(\hat{\beta}_1\). If a relevant variable affects \(Y\) and is correlated with \(X\), it is absorbed into \(\varepsilon\), violating Condition 2. The conditions under which \(\hat{\beta}_1\) can be given a causal interpretation will be studied in the chapter on omitted variable bias.
The idea behind the Method of Moments (MoM) is to replace the population expectations with their sample analogs. Starting from the two conditions above, we obtain a system of two equations in two unknowns.
Condition 1: \(E[\varepsilon] = 0 \Rightarrow E[Y - \beta_0 - \beta_1 X] = 0\)
Condition 2: \(E[\varepsilon X] = 0 \Rightarrow E[(Y - \beta_0 - \beta_1 X)X] = 0\)
Subtracting \(E[X] \times (1)\) from \((2)\):
From \((1)\): \(\beta_0 = E[Y] - \beta_1 E[X]\)
Replacing the population expectations with their sample analogs yields the MoM estimators, which turn out to be identical to the OLS estimators. This is not a coincidence: under the assumption \(E[\varepsilon \mid X] = 0\), both approaches lead to the same result.
Interpretation of the Estimators#
The expression for \(\hat{\beta}_1\) can be written as:
That is, the estimated slope compares how much \(X\) and \(Y\) move together with how much \(X\) varies on its own. If \(X\) and \(Y\) tend to move in the same direction, the numerator is positive and so is the slope. If there is no systematic association, the numerator is close to zero.
The estimated slope is therefore a summary measure of how \(X\) and \(Y\) move together.
Once the slope is determined, the intercept is chosen so that the line passes through the mean of the data: when \(X\) takes its mean value, the prediction coincides with the mean value of \(Y\).
Goodness of Fit#
Up to this point, we have learned how to draw the “best” possible linear relationship given the available sample of data. We have minimized the distances to obtain a line that is, mathematically, the most efficient. However, in econometrics, being “the best” does not always mean being “sufficient” to explain a phenomenon. This brings us to a fundamental question:
How much of the behavior of \(Y\) can our model explain?
We will call the goodness of fit of the model a measure of how much of the behavior of the dependent variable is captured by the model.
Evaluating fit does not consist of verifying whether the line is “correct” — in the social sciences, no line ever is, perfectly — but rather in quantifying how informative it turns out to be. In technical terms, we are decomposing the phenomenon into two parts:
The explained part: The movement of the data that our theory successfully predicts.
The residual: The mystery, the randomness, or all the factors we did not include in our model.
Total Sum of Squares (TSS)#
Before introducing the regression line, our best prediction for any observation of \(Y\) is its mean \(\bar{Y}\). The dispersion of the data around that mean represents everything that “remains to be explained”: it is the total variability of \(Y\).
We measure it with the Total Sum of Squares (TSS):
A large TSS indicates that \(Y\) is highly variable and that the mean is a poor description of the data. A small TSS indicates the opposite. The goal of the model is to reduce that initial uncertainty by using the information in \(X\).
Decomposition of the Sum of Squares#
For each observation, upon introducing the estimated line \(\hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 X_i\), we can write the algebraic identity:
The total deviation of each point from the mean decomposes into two parts: what the line explains and what it does not.
Squaring and summing over all observations yields three quantities:
Total Sum of Squares (TSS):
The total variation in \(Y\); the starting point.
Explained Sum of Squares (ESS):
The portion of the variation that the model manages to capture.
Residual Sum of Squares (RSS):
The portion that the model was unable to explain.
The fundamental identity is that these three quantities are related exactly as follows:
This equality holds whenever the model includes an intercept. It can be derived algebraically from the properties of the OLS estimator (see Appendix: Proof of TSS = ESS + RSS).
The Coefficient of Determination (\(R^2\))#
From the fundamental identity arises the most widely used measure of fit in econometrics: the \(R^2\) (R-squared). It measures what proportion of the total variation in \(Y\) is explained by the model:
The value of \(R^2\) always lies between 0 and 1:
\(R^2 = 0\): The model has no explanatory power. The line is horizontal and the information in \(X\) contributes nothing.
\(R^2 = 1\): Perfect fit; all points fall exactly on the line. In the social sciences this is practically impossible and, if it occurs, it usually indicates a specification error.
For example, an \(R^2 = 0.60\) means that the model explains 60% of the sample variation in \(Y\); the remaining 40% is left in the residual.
Caveats on the \(R^2\)#
Fit is not causality. A high \(R^2\) indicates that \(X\) and \(Y\) move together in a predictable way, not that one causes the other. Ice cream sales and forest fires may have a high \(R^2\) because both increase in summer; that does not imply causality.
A low \(R^2\) does not invalidate a model. In the social sciences, an \(R^2\) of 0.20 or 0.30 can be a solid result if the interest lies in the marginal effect \(\hat{\beta}_1\) and it is statistically significant. The \(R^2\) measures fit, not the relevance of the coefficients.
\(R^2\) increases mechanically with the range of \(X\). If we expand the range of the independent variable, \(R^2\) tends to rise even if the underlying relationship has not changed.
With the formal definitions in place, the following dashboard invites you to explore how \(R^2\) changes when you modify the noise level (\(\sigma\)), the effect size (\(\beta_1\)), and the sample size (\(n\)) in the data-generating model. You can define different true population models and observe what \(R^2\) an OLS estimator would yield in each case.
What Do We Observe?#
Noise dominates \(R^2\). With high \(\sigma\) and low \(\beta_1\), the \(R^2\) collapses toward zero even when the effect is real. The linear model is still the best available, but the signal is lost in the noise.
A high \(R^2\) does not imply a large effect. With \(\beta_1 = 0\) and low \(\sigma\) (little noise), the \(R^2\) can be high because the data vary very little and the line “follows” them closely. That is not evidence of a strong relationship.
Sample size barely affects \(R^2\). Unlike the precision of the estimators, \(R^2\) measures model fit and does not depend much on how many observations we have.
TSS = ESS + RSS always holds. The stacked bar chart illustrates this visually: what the model explains (ESS, in blue) plus what it cannot explain (RSS, in orange) always sums to the total variation (TSS).
Appendix: Proof of \(TSS = ESS + RSS\)#
Properties of the OLS Estimator#
The OLS estimator is obtained by minimizing \(RSS = \sum_{i=1}^n (Y_i - \beta_0 - \beta_1 X_i)^2\). The first-order conditions yield two algebraic properties that are key to the proof.
Property 1 (P1): \(\displaystyle\sum_{i=1}^n \hat{\varepsilon}_i = 0\)
Differentiating the RSS with respect to \(\beta_0\) and setting the derivative equal to zero:
On average, the residuals are exactly zero. The line neither systematically overestimates nor underestimates.
Property 2 (P2): \(\displaystyle\sum_{i=1}^n X_i\,\hat{\varepsilon}_i = 0\)
Differentiating the RSS with respect to \(\beta_1\) and setting the derivative equal to zero:
The residuals are orthogonal to the explanatory variable.
Property 3 (P3, derived from P1 and P2): \(\displaystyle\sum_{i=1}^n \hat{Y}_i\,\hat{\varepsilon}_i = 0\)
Since \(\hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 X_i\) is a linear combination of the two quantities already known to be orthogonal to \(\hat{\varepsilon}_i\):
Proof#
Starting point. By definition of the residual, \(\hat{\varepsilon}_i = Y_i - \hat{Y}_i\), so:
Step 1: Square and sum.
Expanding the binomial square:
Step 2: Show that the cross term is zero.
Step 3: Conclude.
The equality \(TSS = ESS + RSS\) rests entirely on the algebraic properties of the OLS estimator: the fact that the residuals sum to zero (P1) and are orthogonal to the fitted values (P3). Both properties are direct consequences of minimizing the sum of squares and including an intercept in the model.