Statistical Properties of Estimated Coefficients#

The goal of this section is to explore the statistical properties of a simple regression model. In particular, we will focus on a situation where the relationship between a dependent variable \(Y\) and an explanatory variable \(X\) is assumed to be given by:

\[ Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i \]

where \(\beta_0\) and \(\beta_1\) are the population parameters of interest, and \(\varepsilon_i\) is a random error term. We will consider a sample of \(n\) observations, indexed by \(i = 1, \ldots, n\).

We will also assume that, using the sample data, we estimate the model via the Ordinary Least Squares (OLS) method, obtaining estimates \(\hat{\beta}_0\) and \(\hat{\beta}_1\). The statistical question we are concerned with is:

What can we say about the precision of the estimators \(\hat{\beta}_0\) and \(\hat{\beta}_1\)? How close (or far) do we expect them to be from the true values \(\beta_0\) and \(\beta_1\)?


1. Objectives#

By the end of this section, the reader will understand three fundamental properties of OLS estimators:

  • Whether the estimators are centered on the true population values (unbiasedness), or whether they systematically overestimate or underestimate on average.

  • What factors determine how spread out the estimates are around the true value (variance of the estimator).

  • How to estimate that spread from the data — an essential tool for statistical inference.


2. Exploring via Simulation#

The most direct way to understand these properties is to put ourselves in the position of someone who knows the truth: we assume we know the real population model and observe what happens when we estimate it across different samples of data. Note that this simulation exercise is not what you actually encounter in practice: in real estimation, you will typically have a single sample of data from which you estimate the model’s parameters. Nevertheless, the simulation is powerful because it allows us to understand the properties of the estimators. We can learn about what values to expect on average, the variability, and even the distribution of the estimates.

Suppose the true relationship is \(Y = 10 + 3X + \varepsilon\), with \(\beta_0 = 10\) and \(\beta_1 = 3\) known to us. We generate 500 independent samples of 250 observations each, run OLS on every sample, and record the 500 pairs \((\hat{\beta}_0, \hat{\beta}_1)\). We then study the distribution of those estimates: where do they concentrate? How much do they vary?

The simulation below reproduces this experiment and allows you to change the population parameters. As you explore it, pay attention to:

  • Where is the histogram of \(\hat{\beta}_1\) centered? Does it fall near the true value \(\beta_1 = 3\)?

  • What happens to the spread of the distributions when you increase the error standard deviation \(\sigma_\varepsilon\)?

  • What happens when you increase the standard deviation of \(X\) (\(\sigma_X\))? Does more variation in \(X\) make estimates more or less precise?

  • Do \(\hat{\beta}_0\) and \(\hat{\beta}_1\) behave similarly, or does one vary more than the other?


3. What Do We Observe?#

Unbiasedness. Across all parameter combinations, the histograms of \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are centered on the true values (10 and 3, respectively). The statistics table confirms that the mean of the 500 estimates is virtually identical to the population value, regardless of the noise level or the spread of \(X\). Individual estimates stray from the true value, but not systematically in any direction.

Variance and model noise. Increasing \(\sigma_\varepsilon\) noticeably widens both distributions. A noisier model produces more dispersed estimates: they remain unbiased on average, but any single estimate can be far from the true value.

Variance and the spread of \(X\). The effect of \(\sigma_X\) works in the opposite direction: greater dispersion in \(X\) narrows the distribution of \(\hat{\beta}_1\). Intuitively, a wider range of \(X\) values provides more information to pin down the slope, reducing uncertainty about \(\beta_1\).

These patterns are not a coincidence or an artifact of this particular simulation. Statistical theory predicts them exactly, as we now see.


4. Formal Results#

Unbiasedness#

The simulation showed that the distributions of \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are centered on the true values. The formal result confirms this is no coincidence: under the assumption \(E[\varepsilon_i \mid X] = 0\), OLS estimators are exactly unbiased.

Formal result:

\[E(\hat{\beta}_0) = \beta_0, \qquad E(\hat{\beta}_1) = \beta_1\]

These equations say that, in expectation, the estimates equal the true values. In other words: the bias is exactly zero. Over repeated samples, OLS neither overestimates nor underestimates the true parameter. The formal proof is in the appendix.

Variance of the Estimators#

The simulation also showed that larger \(\sigma_\varepsilon\) widens the distributions while larger \(\sigma_X\) narrows them. The variance formula captures this relationship exactly:

Formal result:

\[\text{Var}(\hat{\beta}_1 \mid X) = \frac{\sigma^2}{SST_x}, \qquad SST_x = \sum_{i=1}^{n}(x_i - \bar{x})^2\]

where \(\sigma^2 = \text{Var}(\varepsilon_i \mid X)\) is the error variance. A larger model error (\(\sigma^2\)) increases the variance of the estimator; greater variability in \(X\) (\(SST_x\)) reduces it. The formal derivation is in the appendix.

In practice, \(\sigma^2\) is unknown and must be estimated before we can apply this formula. This is done in the next section.


5. Estimating the Variance of \(\hat{\beta}_1\)#

The formula \(\text{Var}(\hat{\beta}_1 \mid X) = \sigma^2 / SST_x\) is exact but not directly usable, because \(\sigma^2 = \text{Var}(\varepsilon_i \mid X)\) is an unknown population parameter. To conduct inference we must estimate it from the data.

Estimating \(\sigma^2\): Once the model is estimated, we observe the OLS residuals \(\hat{\varepsilon}_i = Y_i - \hat{Y}_i\). A natural estimator of the error variance is the average squared residual, corrected for the number of estimated parameters:

\[\hat{\sigma}^2 = \frac{\sum_{i=1}^n \hat{\varepsilon}_i^2}{n-2} = \frac{RSS}{n-2}\]

The divisor is \(n-2\) rather than \(n\) because OLS estimates two parameters — \(\hat{\beta}_0\) and \(\hat{\beta}_1\) — which impose two linear constraints on the residuals (they must sum to zero and be orthogonal to \(X\)). Dividing by \(n-2\) corrects for this and makes \(\hat{\sigma}^2\) an unbiased estimator of \(\sigma^2\): \(E[\hat{\sigma}^2] = \sigma^2\). The formal proof is in the appendix of the next section.

The estimated variance and standard error: Replacing \(\sigma^2\) with \(\hat{\sigma}^2\) gives the estimated variance of \(\hat{\beta}_1\):

\[\widehat{\text{Var}}(\hat{\beta}_1) = \frac{\hat{\sigma}^2}{SST_x}, \qquad SE(\hat{\beta}_1) = \sqrt{\widehat{\text{Var}}(\hat{\beta}_1)} = \frac{\hat{\sigma}}{\sqrt{SST_x}}\]

The quantity \(SE(\hat{\beta}_1)\) — the standard error of the estimator — plays a central role in inference. Despite the word “error,” it does not refer to the model residuals; it measures the estimated standard deviation of \(\hat{\beta}_1\) across repeated samples. With the standard error in hand, we can assess not just where our estimate falls, but how far it is from any hypothesized value relative to its own precision — which is exactly the logic of hypothesis testing, developed in the next section.


Appendix: Formal Proofs#

A.1 Unbiasedness of \(\hat{\beta}_1\)#

Step 1 — Rewrite \(\hat{\beta}_1\) in terms of the population error.

We start from the OLS estimator:

\[\hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})\,Y_i}{SST_x}\]

Substituting \(Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i\):

\[\hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(\beta_0 + \beta_1 x_i + \varepsilon_i)}{SST_x}\]

Using \(\sum_{i=1}^n (x_i - \bar{x}) = 0\) and \(\sum_{i=1}^n (x_i - \bar{x})\,x_i = SST_x\), the term in \(\beta_0\) vanishes and the term in \(\beta_1\) simplifies:

\[\hat{\beta}_1 = \beta_1 + \frac{\sum_{i=1}^n (x_i - \bar{x})\,\varepsilon_i}{SST_x}\]

Step 2 — Take the expectation conditional on \(X\).

\[E[\hat{\beta}_1 \mid X] = \beta_1 + \frac{1}{SST_x}\sum_{i=1}^n (x_i - \bar{x})\underbrace{E[\varepsilon_i \mid X]}_{=\;0} = \beta_1\]

The last equality follows from the assumption \(E[\varepsilon_i \mid X] = 0\). Since the conditional expectation equals \(\beta_1\) for any realization of \(X\), we also conclude \(E[\hat{\beta}_1] = \beta_1\).

For \(\hat{\beta}_0\), the result follows analogously: since \(\hat{\beta}_0 = \bar{Y} - \hat{\beta}_1\bar{x}\), taking expectations and using \(E[\hat{\beta}_1] = \beta_1\) and \(E[\varepsilon_i \mid X] = 0\) gives \(E[\hat{\beta}_0] = \beta_0\).

A.2 Variance of \(\hat{\beta}_1\)#

From Step 1 above we have:

\[\hat{\beta}_1 - \beta_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})\,\varepsilon_i}{SST_x}\]

We take the variance conditional on \(X\). Under the homoskedasticity assumption, \(\text{Var}(\varepsilon_i \mid X) = \sigma^2\) for all \(i\), and the errors are mutually independent:

\[\text{Var}(\hat{\beta}_1 \mid X) = \frac{1}{SST_x^2}\,\sum_{i=1}^n (x_i - \bar{x})^2\,\sigma^2 = \frac{\sigma^2 \cdot SST_x}{SST_x^2} = \frac{\sigma^2}{SST_x}\]

The expression confirms that estimator precision improves (lower variance) when the model error is small or when the \(X\) data are more spread out.