Statistical Properties of Estimated Coefficients

Statistical Properties of Estimated Coefficients#

The goal of this section is to explore the statistical properties of a simple regression model. In particular, we will focus on a situation where the relationship between a dependent variable \(Y\) and an explanatory variable \(X\) is assumed to be given by:

\[ Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i \]

where \(\beta_0\) and \(\beta_1\) are the population parameters of interest, and \(\varepsilon_i\) is a random error term. We will consider \(n\) observations, indexed by \(i = 1, \ldots, n\), drawn from a sample of the target population.

We will also assume that, using the sample data, we estimate the model via the Ordinary Least Squares (OLS) method, obtaining estimates \(\hat{\beta_0}\) and \(\hat{\beta_1}\). The statistical question we are concerned with is:

What can we say about the precision of the estimators \(\hat{\beta_0}\) and \(\hat{\beta_1}\)? How close (or far) do we expect them to be from the true values \(\beta_0\) and \(\beta_1\)?

Fortunately, statistical theory provides us with two important results to understand this question: unbiasedness and the determinants of variance. In this section, we will first introduce these concepts theoretically, and then use an interactive simulation to demonstrate these results numerically.

We introduce the concepts below:

1. Unbiasedness: This property states that, on average, the estimated coefficients will equal the true population coefficients. The intuition behind this result is that although individual estimates may vary and differ from the true value due to the randomness of sampling, if we could repeat this estimation over many samples, the average of those estimates would converge to the true value. This property ensures that our estimates are not systematically biased in any particular direction, providing confidence in their accuracy.

2. Variance: Even if our estimates are unbiased, they will always exhibit some degree of variance, which quantifies the uncertainty around the estimated coefficient. This variance measures how much our estimated coefficients may deviate from the true population coefficients. Higher variance indicates less precise estimation, meaning the true value may be farther from the estimated one. In contrast, lower variance suggests a more reliable estimate, where the true value is likely closer to the estimated one. The variance of our estimates is influenced by two key factors:

Model error: The presence of unexplained variation in the dependent variable (\(Y\)) contributes to the variance of our estimates. This error can be due to omitted variables or the inherent randomness in the data.
Variability of the independent variable: Greater spread in the values of our independent variable (\(X\)) leads to lower variance in the coefficient estimates. This is because a wider range of X values provides more information for estimating the relationship with Y.

To illustrate these results numerically, the simulation presented below allows us to explore the following question: If we could simulate multiple samples of data from a known population model, how close would the estimates (\(\hat{\beta_0}\) and \(\hat{\beta_1}\)) be to the true values (\(\beta_0\) and \(\beta_1\))? The simulation will let you experiment with different assumptions about the population model and visualize the results of the multiple estimations through graphs.