Statistical Power and Sample Size

Statistical Power and Sample Size#

Suppose you have a theory: a new educational intervention improves student test scores by two points. You design a study, collect data, run the regression — and the result is not statistically significant. Do you conclude the intervention doesn’t work? Not necessarily. The problem may be that the study was too small to detect that effect. This is not a failure of the theory; it is a failure of design.

This section introduces power analysis: the tool for answering the design question before collecting any data. How large does the sample need to be to detect the effect you care about, with a reasonable probability of success?

1. The Connection to the T-Test#

In the previous section we saw that the T-statistic has a natural interpretation as a signal-to-noise ratio:

\[\hat{T} = \frac{\hat{\beta}_1}{SE(\hat{\beta}_1)} = \frac{\hat{\beta}_1}{\hat{\sigma}/\sqrt{SST_x}}\]

We reject $H_0: \beta_1 = 0$ when $|\hat{T}|$ exceeds the critical value $t_{n-2,\, \alpha/2}$. The power of the test is the probability that this happens when the true effect is not zero:

\[\text{Power} = P\!\left(|\hat{T}| > t_{n-2,\, \alpha/2} \;\Big|\; \beta_1 \neq 0\right)\]

When $\beta_1 \neq 0$, $\hat{T}$ does not follow the t-distribution centered at zero — it follows a non-central t-distribution shifted by the non-centrality parameter $\lambda = \beta_1 / SE(\hat{\beta}_1)$. The larger $|\lambda|$, the more the distribution is displaced and the higher the probability of rejecting $H_0$.

Power depends on four ingredients that trade off against each other:

Ingredient	Direction	Intuition
Larger true effect $	\beta_1	$
Smaller error std. dev. $\sigma$	↑ power	Cleaner data, less noise
Larger sample size $n$	↑ power	More information, $SE(\hat{\beta}_1)$ falls as $1/\sqrt{n}$
Larger significance level $\alpha$	↑ power	Lower bar, but more Type I errors

2. Interactive Simulation#

The simulation generates thousands of samples from the model $Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i$, runs a regression on each one, and records the resulting T-statistic. The histogram shows the empirical distribution of those statistics; the black curve is the theoretical t-distribution. The table reports the fraction of simulations in which $H_0$ was rejected — that is the empirical power.

Before exploring, make a prediction: if you double the sample size, how do you expect the distribution of T to change? And the power?

Concrete things to observe:

Effect $\beta_1 = 0$: the T-statistics follow the zero-centered theoretical distribution almost perfectly. Power at 5% hovers around 5% — that is exactly the Type I error rate.
Increase $\beta_1$: the distribution shifts rightward. The red bars (rejection region) capture a larger fraction — power rises.
Increase sample size $n$: the distribution concentrates (less spread) and shifts further from zero. With $n = 500$ and $\beta_1 = 1$, power can exceed 90%.
Increase error std. dev. $\sigma$: the distribution flattens and moves back toward zero. You need a larger effect or a bigger sample to compensate.
Increase the spread of $X$ ($\sigma_X$): this increases $SST_x$, reduces $SE(\hat{\beta}_1)$, and raises power — just like increasing $n$.

3. The Sample Size Formula#

Exploring the simulation builds intuition. Now we want a formula that allows us to calculate the required sample size before collecting any data.

The non-centrality parameter of the T-test is:

\[\lambda = \frac{\beta_1}{\sigma / \sqrt{n \cdot \sigma_X^2}} = \frac{\beta_1 \sqrt{n} \, \sigma_X}{\sigma}\]

For a target power of $1 - \phi$ (e.g., 0.80) and significance level $\alpha$ (e.g., 0.05), we need the non-central t-distribution with parameter $\lambda$ to place at least probability $1 - \phi$ beyond the critical value $z_{\alpha/2}$. Approximating with the standard normal, the condition reduces to:

\[\lambda \geq z_{\alpha/2} + z_{\phi}\]

where $z_p$ is the $p$-th percentile of the standard normal. Solving for $n$:

\[n \geq \left(\frac{(z_{\alpha/2} + z_{\phi}) \cdot \sigma}{\beta_1 \cdot \sigma_X}\right)^2\]

Numerical example. We want 80% power ($z_{0.20} \approx 0.84$) at the 5% level ($z_{0.025} \approx 1.96$), with $\beta_1 = 1$, $\sigma = 20$, and $\sigma_X = 10$:

\[n \geq \left(\frac{(1.96 + 0.84) \times 20}{1 \times 10}\right)^2 = \left(\frac{56}{10}\right)^2 = 31.36 \approx 32\]

With $n = 32$ and these parameters, power is approximately 80%. You can verify this in the simulation by setting those values.

The cost of higher power#

The formula reveals an important asymmetry: going from 80% to 90% power (raising $z_\phi$ from 0.84 to 1.28) increases the required sample size by roughly 33%. Going from 90% to 95% adds another 20%. Each additional increment of power is more expensive than the last:

\[n_{90\%} \approx 1.33 \times n_{80\%}, \qquad n_{95\%} \approx 1.56 \times n_{80\%}\]

This explains why 80% is the dominant convention in social science — it is the point where the marginal cost of additional power starts rising rapidly.

4. Type I and Type II Errors#

Every hypothesis test can make two kinds of errors. Power is the probability of avoiding the second.

	$H_0$ true	$H_0$ false
Fail to reject $H_0$	Correct decision	Type II error (prob. $\phi$)
Reject $H_0$	Type I error (prob. $\alpha$)	Correct decision (power = $1-\phi$)

The Type I error — rejecting when $H_0$ is true — is controlled by $\alpha$. Setting $\alpha = 0.05$ guarantees that, if $\beta_1 = 0$, the probability of rejecting is exactly 5%.
The Type II error — failing to reject when $H_0$ is false — is the power complement: $\phi = 1 - \text{power}$. This error is not guaranteed automatically; it depends on the study design.

A frequent trap is concluding that a non-significant result implies no effect. A study with 30% power and a non-significant result provides no real evidence that $\beta_1 = 0$; it simply lacked instruments sensitive enough to detect it.

Appendix: Sample Size Derivation#

A.1 Non-Central T-Distribution and Sample Size#

When the true value is $\beta_1 \neq 0$, the standardized statistic under $H_0$ follows a non-central t-distribution:

\[\hat{T} = \frac{\hat{\beta}_1}{SE(\hat{\beta}_1)} \sim t_{n-2}(\lambda), \qquad \lambda = \frac{\beta_1}{SE(\hat{\beta}_1)} = \frac{\beta_1 \sqrt{SST_x}}{\sigma}\]

Power is:

\[1 - \phi = P\!\left(|\hat{T}| > t_{n-2,\, \alpha/2} \;\Big|\; \lambda\right) = 1 - F_{t_{n-2}(\lambda)}\!\left(t_{n-2,\,\alpha/2}\right) + F_{t_{n-2}(\lambda)}\!\left(-t_{n-2,\,\alpha/2}\right)\]

where $F_{t_\nu(\lambda)}$ is the CDF of the non-central t-distribution with $\nu$ degrees of freedom and non-centrality parameter $\lambda$.

Normal approximation. For large samples, $t_{n-2} \approx N(0,1)$ and $t_{n-2}(\lambda) \approx N(\lambda, 1)$. The power condition $1-\phi$ simplifies to:

\[P\!\left(|Z + \lambda| > z_{\alpha/2}\right) \approx 1 - \phi\]

For $\lambda > 0$ and small $\phi$, the left-tail term is negligible and the condition reduces to:

\[P\!\left(Z > z_{\alpha/2} - \lambda\right) = 1 - \phi \implies z_{\alpha/2} - \lambda = -z_\phi \implies \lambda = z_{\alpha/2} + z_\phi\]

Step 1 — Express $\lambda$ as a function of $n$.

Assuming $X \sim N(\mu_X, \sigma_X^2)$ with $n$ observations, on average $SST_x = (n-1)\sigma_X^2 \approx n\sigma_X^2$, so:

\[\lambda = \frac{\beta_1 \sqrt{n \sigma_X^2}}{\sigma} = \frac{\beta_1 \sigma_X \sqrt{n}}{\sigma}\]

Step 2 — Solve for $n$.

Imposing $\lambda = z_{\alpha/2} + z_\phi$:

\[\frac{\beta_1 \sigma_X \sqrt{n}}{\sigma} = z_{\alpha/2} + z_\phi \implies n = \left(\frac{(z_{\alpha/2} + z_\phi)\,\sigma}{\beta_1\,\sigma_X}\right)^2\]

This is the minimum sample size to achieve power $1-\phi$ at level $\alpha$.

References#

The interactive dashboard in this section also draws on Gelman et al. [2021], in addition to Wooldridge [2020].

[GHV21]

Andrew Gelman, Jennifer Hill, and Aki Vehtari. Regression and other stories. Cambridge University Press, 2021.

[Woo20]

Jeffrey M Wooldridge. Introductory Econometrics: A Modern Approach 7rd ed. Cengage learning, 2020.

	\(H_0\) true	\(H_0\) false
Fail to reject \(H_0\)	Correct decision	Type II error (prob. \(\phi\))
Reject \(H_0\)	Type I error (prob. \(\alpha\))	Correct decision (power = \(1-\phi\))

Ingredient	Direction	Intuition
Larger true effect $	\beta_1	$
Smaller error std. dev. \(\sigma\)	↑ power	Cleaner data, less noise
Larger sample size \(n\)	↑ power	More information, \(SE(\hat{\beta}_1)\) falls as \(1/\sqrt{n}\)
Larger significance level \(\alpha\)	↑ power	Lower bar, but more Type I errors