# Hypothesis Testing in Simple Regression

With OLS we obtain an estimate $\hat{\beta}_1$ from our sample. But every sample is different — a different draw from the population would yield a different estimate. How can we tell whether the value we observed reflects a genuine population effect, or whether it could plausibly have arisen by chance even if the true effect were zero?

This is the central question of **hypothesis testing**. In this section we develop the tools to answer it: the distribution of $\hat{\beta}_1$, the T-statistic, the logic of the test, and the p-value.

---

## 1. Why We Need More Than Mean and Variance

From the [previous section](3_statistical_properties.md), we know two things about the OLS estimator:

$$E[\hat{\beta}_1] = \beta_1, \qquad \text{Var}(\hat{\beta}_1 \mid X) = \frac{\sigma^2}{SST_x}$$

These results tell us that our estimator is unbiased and give us a measure of its spread. But to make probabilistic statements — for example, "how likely is it to observe $\hat{\beta}_1 = 0.08$ if the true $\beta_1$ were zero?" — we need to know the *full distribution* of $\hat{\beta}_1$, not just its first two moments.

In econometric practice, there are three main routes to that distribution:

1. **Normality assumption on $\varepsilon$:** If the errors are normally distributed, then $\hat{\beta}_1$ is exactly normal for any sample size $n$. This is the approach we develop in Sections 2–3.
2. **Asymptotic theory:** As $n \to \infty$, the Central Limit Theorem guarantees approximate normality of $\hat{\beta}_1$ regardless of the error distribution. We return to this in Section 5.
3. **Bootstrap:** One resamples the data repeatedly to build an empirical sampling distribution of $\hat{\beta}_1$ without any distributional assumption. This is a topic for a later chapter.

---

## 2. The Normality Assumption and the Distribution of $\hat{\beta}_1$

The first approach assumes that the population errors are normally distributed:

$$\varepsilon_i \mid X \sim N(0, \sigma^2)$$

Because $\hat{\beta}_1$ is a linear combination of the $\varepsilon_i$ (recall from the {ref}`unbiasedness proof <proof-unbiasedness>` that $\hat{\beta}_1 = \beta_1 + \sum c_i \varepsilon_i$ where $c_i = (x_i - \bar{x})/SST_x$), and any linear combination of independent normal random variables is normal, it follows that:

$$\hat{\beta}_1 \sim N\!\left(\beta_1,\; \frac{\sigma^2}{SST_x}\right)$$

This is an exact result — it holds for every sample size, not just large ones. However, it depends entirely on the normality of $\varepsilon$.

---

## 3. The T-Distribution and the Test Statistic

### From a Z to a T

If we knew $\sigma^2$, we could standardize $\hat{\beta}_1$ and use the standard normal:

$$Z = \frac{\hat{\beta}_1 - \beta_1}{\sqrt{\sigma^2 / SST_x}} \sim N(0, 1)$$

But $\sigma^2$ is an unknown population parameter. In practice we replace it with the estimate $\hat{\sigma}^2$ derived from the residuals (see the {ref}`variance estimation section <variance-estimation>`):

$$\hat{\sigma}^2 = \frac{\sum_{i=1}^n \hat{\varepsilon}_i^2}{n-2}, \qquad SE(\hat{\beta}_1) = \sqrt{\frac{\hat{\sigma}^2}{SST_x}}$$

Substituting the estimated standard error, the standardized statistic is no longer normal — it follows a **Student's $t$-distribution**:

$$T = \frac{\hat{\beta}_1 - \beta_1}{SE(\hat{\beta}_1)} \sim t_{n-2}$$

The formal derivation of this result is in the [appendix](#proof-t-distribution).

### Why $n - 2$ degrees of freedom?

OLS estimates two parameters — $\hat{\beta}_0$ and $\hat{\beta}_1$ — which impose two linear constraints on the residuals. The residual vector therefore lies in an $(n-2)$-dimensional subspace, leaving $n-2$ free pieces of information to estimate $\sigma^2$. This is why we divide by $n-2$ in $\hat{\sigma}^2$, and why the $t$-distribution has $n-2$ degrees of freedom.

### The T-distribution vs. the Normal

The $t$-distribution is symmetric and bell-shaped like the standard normal, but with **heavier tails**. This extra tail weight reflects the additional uncertainty from having to estimate $\sigma^2$ rather than knowing it. As the degrees of freedom grow (i.e., as $n$ increases), the tails become thinner and $t_{n-2} \to N(0,1)$.

---

## 4. The Logic of the Test and the P-Value

### Setting up the test

A hypothesis test starts with a **null hypothesis** — a specific claim about the population parameter that we are willing to entertain. The most common null in regression is:

$$H_0: \beta_1 = 0 \quad \text{vs.} \quad H_a: \beta_1 \neq 0$$

If $H_0$ is true then $X$ has no linear effect on $Y$. We can also test other values, for instance $H_0: \beta_1 = 1$ when the model is log-log and we want to test for unit elasticity.

Under $H_0: \beta_1 = 0$, the T-statistic simplifies to:

$$\hat{T} = \frac{\hat{\beta}_1}{SE(\hat{\beta}_1)} \sim t_{n-2} \quad \text{(under } H_0\text{)}$$

### The rejection region

The logic is: if $H_0$ is true, large values of $|\hat{T}|$ are unlikely. We reject $H_0$ when $|\hat{T}|$ exceeds a threshold determined by the **significance level** $\alpha$ — the probability we are willing to tolerate of rejecting $H_0$ when it is actually true.

For a two-sided test at $\alpha = 0.05$, the rejection region is:

$$|\hat{T}| > t_{n-2,\, 0.025}$$

For large $n$, the critical value $t_{n-2,\, 0.025} \approx 1.96$.

### The P-value

Rather than just reporting whether we cross a fixed threshold, the **p-value** summarizes how much evidence the data provide against $H_0$:

$$p\text{-value} = P\!\left(|T_{n-2}| > |\hat{T}|\right)$$

where the probability is computed using the $t_{n-2}$ distribution. A small p-value means that a $|T|$ as large as the one we observed would be rare if $H_0$ were true — we take this as evidence against $H_0$. By convention, a p-value below 0.05 is called statistically significant at the 5% level.

**What the p-value is not:** it is not the probability that $H_0$ is true. The p-value is a statement about the data given $H_0$, not about $H_0$ given the data.

---

## 5. Statistical Significance as Signal-to-Noise

The T-statistic has a natural interpretation as a **signal-to-noise ratio**:

$$\hat{T} = \frac{\hat{\beta}_1}{SE(\hat{\beta}_1)} = \frac{\text{signal}}{\text{noise}}$$

- **Signal** = $\hat{\beta}_1$: the estimated magnitude of the effect.
- **Noise** = $SE(\hat{\beta}_1) = \hat{\sigma}/\sqrt{SST_x}$: the sampling uncertainty around that estimate.

A result is statistically significant when the signal is large relative to the noise. There are three ways this can happen:

1. **The true effect $\beta_1$ is large** — a stronger signal is easier to detect.
2. **The model residuals $\hat{\sigma}^2$ are small** — clean data means less noise.
3. **$SST_x$ is large** — a large sample or a wide range of $X$ values provides more information to pin down the slope.

This decomposition has an important practical implication: **statistical insignificance does not mean $\beta_1 = 0$**. It may simply mean the sample is too small, or the data too noisy, to distinguish $\beta_1$ from zero with confidence. The question of how large a sample we need to detect a given effect with a given probability is the subject of *power analysis*, covered in the [next section](5_power_calculation.md).

---

## 6. Asymptotic Normality: Large Samples Free Us from the Normality Assumption

The result $T \sim t_{n-2}$ is **exact** when errors are normal. What if the normality assumption fails?

The Central Limit Theorem provides an answer. Recall that $\hat{\beta}_1 - \beta_1 = \sum c_i \varepsilon_i / SST_x$ is a weighted average of the errors $\varepsilon_i$. Under mild regularity conditions — finite variance and no single observation dominating the sum — the CLT applies and:

$$\frac{\hat{\beta}_1 - \beta_1}{SE(\hat{\beta}_1)} \xrightarrow{d} N(0,1) \quad \text{as } n \to \infty$$

This has two practical consequences:

1. For moderate-to-large samples, we can rely on the approximate normality of $\hat{T}$ even when errors are non-normal. The $t_{n-2}$ critical values converge to those of the standard normal (e.g., 1.96 at 5%), which is what most software reports by default.
2. The normality assumption on $\varepsilon$ is most consequential in **small samples**. When $n$ is small and errors are visibly non-normal (e.g., heavily skewed or fat-tailed), the t-distribution approximation may be unreliable, and bootstrap methods become more attractive.

A common informal rule of thumb is $n \geq 30$ as a minimum for the normal approximation to the t-test to be reasonable, though this depends on how non-normal the errors actually are.

---

## 7. Interactive Simulation

The simulation below lets you explore how the T-test behaves under different conditions. You can vary the error distribution (normal, skewed, heavy-tailed), the sample size $n$, and the true value of $\beta_1$. In particular, observe how:

- Under normality, the empirical distribution of $\hat{T}$ closely follows the theoretical $t_{n-2}$.
- As $n$ grows, the approximation holds even when errors are non-normal — the distribution of $\hat{T}$ converges to a standard normal regardless of the error shape.
- A larger true $\beta_1$ or a smaller $\sigma^2$ shifts the distribution of $\hat{T}$ away from zero, making it easier to reject $H_0$.

<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
  <iframe src="https://simuecon.com/ttest/?lang=en" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border: 0;" allowfullscreen></iframe>
</div>

---

## Appendix: Formal Proofs

(proof-sigma-hat)=
### A.1 Unbiasedness of $\hat{\sigma}^2$

We show that $E[\hat{\sigma}^2] = \sigma^2$ where $\hat{\sigma}^2 = \sum_{i=1}^n \hat{\varepsilon}_i^2 / (n-2)$, so that $E\!\left[\sum \hat{\varepsilon}_i^2\right] = (n-2)\sigma^2$.

**Step 1 — Express residuals in terms of population errors.**

The OLS residuals are $\hat{\varepsilon}_i = Y_i - \hat{Y}_i = \varepsilon_i - (\hat{\beta}_0 - \beta_0) - (\hat{\beta}_1 - \beta_1)x_i$. OLS imposes two exact linear constraints on the residuals:

$$\sum_{i=1}^n \hat{\varepsilon}_i = 0, \qquad \sum_{i=1}^n x_i \hat{\varepsilon}_i = 0$$

These two constraints mean that the residual vector $\hat{\boldsymbol{\varepsilon}}$ lies in an $(n-2)$-dimensional subspace of $\mathbb{R}^n$ — two dimensions have been "used up" projecting out the fitted values.

**Step 2 — Take the expectation of $RSS$.**

Using the hat matrix $\mathbf{H} = \mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'$ (with $\mathbf{X}$ the $n \times 2$ design matrix), we can write $\hat{\boldsymbol{\varepsilon}} = (\mathbf{I} - \mathbf{H})\boldsymbol{\varepsilon}$. Since $\mathbf{I} - \mathbf{H}$ is an idempotent matrix of rank $n-2$:

$$E\!\left[\sum_{i=1}^n \hat{\varepsilon}_i^2\right] = E[\boldsymbol{\varepsilon}'(\mathbf{I}-\mathbf{H})\boldsymbol{\varepsilon}] = \sigma^2 \operatorname{tr}(\mathbf{I}-\mathbf{H}) = \sigma^2(n-2)$$

**Conclusion:**

$$E[\hat{\sigma}^2] = \frac{E\!\left[\sum \hat{\varepsilon}_i^2\right]}{n-2} = \sigma^2$$

(proof-t-distribution)=
### A.2 The T-Statistic Follows $t_{n-2}$ Under Normality

Under the normality assumption $\varepsilon_i \mid X \overset{iid}{\sim} N(0,\sigma^2)$, we establish three facts and combine them.

**Fact 1.** Since $\hat{\beta}_1 = \beta_1 + \sum c_i \varepsilon_i$ is a linear combination of independent normals:

$$\hat{\beta}_1 \sim N\!\left(\beta_1,\, \frac{\sigma^2}{SST_x}\right) \implies Z \equiv \frac{\hat{\beta}_1 - \beta_1}{\sqrt{\sigma^2/SST_x}} \sim N(0,1)$$

**Fact 2.** Under normality, $(n-2)\hat{\sigma}^2/\sigma^2 \sim \chi^2_{n-2}$. This follows from the fact that $\hat{\boldsymbol{\varepsilon}} = (\mathbf{I}-\mathbf{H})\boldsymbol{\varepsilon}$ and $(\mathbf{I}-\mathbf{H})$ is an idempotent projection of rank $n-2$; a result from normal distribution theory (Cochran's theorem) gives the chi-squared distribution.

**Fact 3.** $\hat{\sigma}^2$ and $\hat{\beta}_1$ are **independent** under normality. This is because $\hat{\beta}_1$ depends on $\mathbf{H}\boldsymbol{\varepsilon}$ and $\hat{\sigma}^2$ depends on $(\mathbf{I}-\mathbf{H})\boldsymbol{\varepsilon}$; these are orthogonal projections of a normal vector, hence independent.

**Combining.** By the definition of the $t$-distribution (a standard normal divided by the square root of an independent $\chi^2/\nu$):

$$T = \frac{Z}{\sqrt{(n-2)\hat{\sigma}^2/(\sigma^2(n-2))}} = \frac{\hat{\beta}_1 - \beta_1}{\sqrt{\hat{\sigma}^2/SST_x}} = \frac{\hat{\beta}_1 - \beta_1}{SE(\hat{\beta}_1)} \sim t_{n-2}$$

Setting $\beta_1 = 0$ under $H_0$ gives $\hat{T} = \hat{\beta}_1 / SE(\hat{\beta}_1) \sim t_{n-2}$.