Hypothesis Testing in Simple Regression

Hypothesis Testing in Simple Regression#

With OLS we obtain an estimate \(\hat{\beta}_1\) from our sample. But every sample is different — a different draw from the population would yield a different estimate. How can we tell whether the value we observed reflects a genuine population effect, or whether it could plausibly have arisen by chance even if the true effect were zero?

This is the central question of hypothesis testing. In this section we develop the tools to answer it: the distribution of \(\hat{\beta}_1\), the T-statistic, the logic of the test, and the p-value.

1. Why We Need More Than Mean and Variance#

From the previous section, we know two things about the OLS estimator:

\[E[\hat{\beta}_1] = \beta_1, \qquad \text{Var}(\hat{\beta}_1 \mid X) = \frac{\sigma^2}{SST_x}\]

These results tell us that our estimator is unbiased and give us a measure of its spread. But to make probabilistic statements — for example, “how likely is it to observe \(\hat{\beta}_1 = 0.08\) if the true \(\beta_1\) were zero?” — we need to know the full distribution of \(\hat{\beta}_1\), not just its first two moments.

In econometric practice, there are three main routes to that distribution:

Normality assumption on \(\varepsilon\): If the errors are normally distributed, then \(\hat{\beta}_1\) is exactly normal for any sample size \(n\). This is the approach we develop in Sections 2–3.
Asymptotic theory: As \(n \to \infty\), the Central Limit Theorem guarantees approximate normality of \(\hat{\beta}_1\) regardless of the error distribution. We return to this in Section 5.
Bootstrap: One resamples the data repeatedly to build an empirical sampling distribution of \(\hat{\beta}_1\) without any distributional assumption. This is a topic for a later chapter.

2. The Normality Assumption and the Distribution of \(\hat{\beta}_1\)#

The first approach assumes that the population errors are normally distributed:

\[\varepsilon_i \mid X \sim N(0, \sigma^2)\]

Because \(\hat{\beta}_1\) is a linear combination of the \(\varepsilon_i\) (recall from the unbiasedness proof that \(\hat{\beta}_1 = \beta_1 + \sum c_i \varepsilon_i\) where \(c_i = (x_i - \bar{x})/SST_x\)), and any linear combination of independent normal random variables is normal, it follows that:

\[\hat{\beta}_1 \sim N\!\left(\beta_1,\; \frac{\sigma^2}{SST_x}\right)\]

This is an exact result — it holds for every sample size, not just large ones. However, it depends entirely on the normality of \(\varepsilon\).

3. The T-Distribution and the Test Statistic#

From a Z to a T#

If we knew \(\sigma^2\), we could standardize \(\hat{\beta}_1\) and use the standard normal:

\[Z = \frac{\hat{\beta}_1 - \beta_1}{\sqrt{\sigma^2 / SST_x}} \sim N(0, 1)\]

But \(\sigma^2\) is an unknown population parameter. In practice we replace it with the estimate \(\hat{\sigma}^2\) derived from the residuals (see the variance estimation section):

\[\hat{\sigma}^2 = \frac{\sum_{i=1}^n \hat{\varepsilon}_i^2}{n-2}, \qquad SE(\hat{\beta}_1) = \sqrt{\frac{\hat{\sigma}^2}{SST_x}}\]

Substituting the estimated standard error, the standardized statistic is no longer normal — it follows a Student’s \(t\)-distribution:

\[T = \frac{\hat{\beta}_1 - \beta_1}{SE(\hat{\beta}_1)} \sim t_{n-2}\]

The formal derivation of this result is in the appendix.

Why \(n - 2\) degrees of freedom?#

OLS estimates two parameters — \(\hat{\beta}_0\) and \(\hat{\beta}_1\) — which impose two linear constraints on the residuals. The residual vector therefore lies in an \((n-2)\)-dimensional subspace, leaving \(n-2\) free pieces of information to estimate \(\sigma^2\). This is why we divide by \(n-2\) in \(\hat{\sigma}^2\), and why the \(t\)-distribution has \(n-2\) degrees of freedom.

The T-distribution vs. the Normal#

The \(t\)-distribution is symmetric and bell-shaped like the standard normal, but with heavier tails. This extra tail weight reflects the additional uncertainty from having to estimate \(\sigma^2\) rather than knowing it. As the degrees of freedom grow (i.e., as \(n\) increases), the tails become thinner and \(t_{n-2} \to N(0,1)\).

4. The Logic of the Test and the P-Value#

Setting up the test#

A hypothesis test starts with a null hypothesis — a specific claim about the population parameter that we are willing to entertain. The most common null in regression is:

\[H_0: \beta_1 = 0 \quad \text{vs.} \quad H_a: \beta_1 \neq 0\]

If \(H_0\) is true then \(X\) has no linear effect on \(Y\). We can also test other values, for instance \(H_0: \beta_1 = 1\) when the model is log-log and we want to test for unit elasticity.

Under \(H_0: \beta_1 = 0\), the T-statistic simplifies to:

\[\hat{T} = \frac{\hat{\beta}_1}{SE(\hat{\beta}_1)} \sim t_{n-2} \quad \text{(under } H_0\text{)}\]

The rejection region#

The logic is: if \(H_0\) is true, large values of \(|\hat{T}|\) are unlikely. We reject \(H_0\) when \(|\hat{T}|\) exceeds a threshold determined by the significance level \(\alpha\) — the probability we are willing to tolerate of rejecting \(H_0\) when it is actually true.

For a two-sided test at \(\alpha = 0.05\), the rejection region is:

\[|\hat{T}| > t_{n-2,\, 0.025}\]

For large \(n\), the critical value \(t_{n-2,\, 0.025} \approx 1.96\).

The P-value#

Rather than just reporting whether we cross a fixed threshold, the p-value summarizes how much evidence the data provide against \(H_0\):

\[p\text{-value} = P\!\left(|T_{n-2}| > |\hat{T}|\right)\]

where the probability is computed using the \(t_{n-2}\) distribution. A small p-value means that a \(|T|\) as large as the one we observed would be rare if \(H_0\) were true — we take this as evidence against \(H_0\). By convention, a p-value below 0.05 is called statistically significant at the 5% level.

What the p-value is not: it is not the probability that \(H_0\) is true. The p-value is a statement about the data given \(H_0\), not about \(H_0\) given the data.

5. Statistical Significance as Signal-to-Noise#

The T-statistic has a natural interpretation as a signal-to-noise ratio:

\[\hat{T} = \frac{\hat{\beta}_1}{SE(\hat{\beta}_1)} = \frac{\text{signal}}{\text{noise}}\]

Signal = \(\hat{\beta}_1\): the estimated magnitude of the effect.
Noise = \(SE(\hat{\beta}_1) = \hat{\sigma}/\sqrt{SST_x}\): the sampling uncertainty around that estimate.

A result is statistically significant when the signal is large relative to the noise. There are three ways this can happen:

The true effect \(\beta_1\) is large — a stronger signal is easier to detect.
The model residuals \(\hat{\sigma}^2\) are small — clean data means less noise.
\(SST_x\) is large — a large sample or a wide range of \(X\) values provides more information to pin down the slope.

This decomposition has an important practical implication: statistical insignificance does not mean \(\beta_1 = 0\). It may simply mean the sample is too small, or the data too noisy, to distinguish \(\beta_1\) from zero with confidence. The question of how large a sample we need to detect a given effect with a given probability is the subject of power analysis, covered in the next section.

6. Asymptotic Normality: Large Samples Free Us from the Normality Assumption#

The result \(T \sim t_{n-2}\) is exact when errors are normal. What if the normality assumption fails?

The Central Limit Theorem provides an answer. Recall that \(\hat{\beta}_1 - \beta_1 = \sum c_i \varepsilon_i / SST_x\) is a weighted average of the errors \(\varepsilon_i\). Under mild regularity conditions — finite variance and no single observation dominating the sum — the CLT applies and:

\[\frac{\hat{\beta}_1 - \beta_1}{SE(\hat{\beta}_1)} \xrightarrow{d} N(0,1) \quad \text{as } n \to \infty\]

This has two practical consequences:

For moderate-to-large samples, we can rely on the approximate normality of \(\hat{T}\) even when errors are non-normal. The \(t_{n-2}\) critical values converge to those of the standard normal (e.g., 1.96 at 5%), which is what most software reports by default.
The normality assumption on \(\varepsilon\) is most consequential in small samples. When \(n\) is small and errors are visibly non-normal (e.g., heavily skewed or fat-tailed), the t-distribution approximation may be unreliable, and bootstrap methods become more attractive.

A common informal rule of thumb is \(n \geq 30\) as a minimum for the normal approximation to the t-test to be reasonable, though this depends on how non-normal the errors actually are.

7. Interactive Simulation#

The simulation below lets you explore how the T-test behaves under different conditions. You can vary the error distribution (normal, skewed, heavy-tailed), the sample size \(n\), and the true value of \(\beta_1\). In particular, observe how:

Under normality, the empirical distribution of \(\hat{T}\) closely follows the theoretical \(t_{n-2}\).
As \(n\) grows, the approximation holds even when errors are non-normal — the distribution of \(\hat{T}\) converges to a standard normal regardless of the error shape.
A larger true \(\beta_1\) or a smaller \(\sigma^2\) shifts the distribution of \(\hat{T}\) away from zero, making it easier to reject \(H_0\).

Appendix: Formal Proofs#

A.1 Unbiasedness of \(\hat{\sigma}^2\)#

We show that \(E[\hat{\sigma}^2] = \sigma^2\) where \(\hat{\sigma}^2 = \sum_{i=1}^n \hat{\varepsilon}_i^2 / (n-2)\), so that \(E\!\left[\sum \hat{\varepsilon}_i^2\right] = (n-2)\sigma^2\).

Step 1 — Express residuals in terms of population errors.

The OLS residuals are \(\hat{\varepsilon}_i = Y_i - \hat{Y}_i = \varepsilon_i - (\hat{\beta}_0 - \beta_0) - (\hat{\beta}_1 - \beta_1)x_i\). OLS imposes two exact linear constraints on the residuals:

\[\sum_{i=1}^n \hat{\varepsilon}_i = 0, \qquad \sum_{i=1}^n x_i \hat{\varepsilon}_i = 0\]

These two constraints mean that the residual vector \(\hat{\boldsymbol{\varepsilon}}\) lies in an \((n-2)\)-dimensional subspace of \(\mathbb{R}^n\) — two dimensions have been “used up” projecting out the fitted values.

Step 2 — Take the expectation of \(RSS\).

Using the hat matrix \(\mathbf{H} = \mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\) (with \(\mathbf{X}\) the \(n \times 2\) design matrix), we can write \(\hat{\boldsymbol{\varepsilon}} = (\mathbf{I} - \mathbf{H})\boldsymbol{\varepsilon}\). Since \(\mathbf{I} - \mathbf{H}\) is an idempotent matrix of rank \(n-2\):

\[E\!\left[\sum_{i=1}^n \hat{\varepsilon}_i^2\right] = E[\boldsymbol{\varepsilon}'(\mathbf{I}-\mathbf{H})\boldsymbol{\varepsilon}] = \sigma^2 \operatorname{tr}(\mathbf{I}-\mathbf{H}) = \sigma^2(n-2)\]

Conclusion:

\[E[\hat{\sigma}^2] = \frac{E\!\left[\sum \hat{\varepsilon}_i^2\right]}{n-2} = \sigma^2\]

A.2 The T-Statistic Follows \(t_{n-2}\) Under Normality#

Under the normality assumption \(\varepsilon_i \mid X \overset{iid}{\sim} N(0,\sigma^2)\), we establish three facts and combine them.

Fact 1. Since \(\hat{\beta}_1 = \beta_1 + \sum c_i \varepsilon_i\) is a linear combination of independent normals:

\[\hat{\beta}_1 \sim N\!\left(\beta_1,\, \frac{\sigma^2}{SST_x}\right) \implies Z \equiv \frac{\hat{\beta}_1 - \beta_1}{\sqrt{\sigma^2/SST_x}} \sim N(0,1)\]

Fact 2. Under normality, \((n-2)\hat{\sigma}^2/\sigma^2 \sim \chi^2_{n-2}\). This follows from the fact that \(\hat{\boldsymbol{\varepsilon}} = (\mathbf{I}-\mathbf{H})\boldsymbol{\varepsilon}\) and \((\mathbf{I}-\mathbf{H})\) is an idempotent projection of rank \(n-2\); a result from normal distribution theory (Cochran’s theorem) gives the chi-squared distribution.

Fact 3. \(\hat{\sigma}^2\) and \(\hat{\beta}_1\) are independent under normality. This is because \(\hat{\beta}_1\) depends on \(\mathbf{H}\boldsymbol{\varepsilon}\) and \(\hat{\sigma}^2\) depends on \((\mathbf{I}-\mathbf{H})\boldsymbol{\varepsilon}\); these are orthogonal projections of a normal vector, hence independent.

Combining. By the definition of the \(t\)-distribution (a standard normal divided by the square root of an independent \(\chi^2/\nu\)):

\[T = \frac{Z}{\sqrt{(n-2)\hat{\sigma}^2/(\sigma^2(n-2))}} = \frac{\hat{\beta}_1 - \beta_1}{\sqrt{\hat{\sigma}^2/SST_x}} = \frac{\hat{\beta}_1 - \beta_1}{SE(\hat{\beta}_1)} \sim t_{n-2}\]

Setting \(\beta_1 = 0\) under \(H_0\) gives \(\hat{T} = \hat{\beta}_1 / SE(\hat{\beta}_1) \sim t_{n-2}\).