Selection Bias and Randomized Controlled Experiments#

A digital mobility platform (such as Uber) runs a benefits program for drivers who complete trips to relatively low-supply zones (for example, airport rides and rainy days). The benefits include a series of discounts on vehicle maintenance and tire replacement. After implementing this program, the company analyzes the drivers who participated and finds they earn, on average, 40 dollars more per week than those who did not. The immediate conclusion: the program worked to stimulate trips. But a skeptical analyst raises her hand — drivers who chose to participate were probably already more active and motivated before the program began. Could that pre-existing difference explain the entire gap?

This question — when does a difference between groups reflect a genuine causal effect and when does it mislead us — is the central problem of causal inference. We begin this section by analyzing a simulation that lets us understand the risks of a naive comparison as an estimator of a program’s effect. We then introduce the potential outcomes framework, a theoretical architecture that clarifies under what assumptions a comparison identifies a causal effect. Later in this section we introduce the first method that achieves identification: random assignment (or experimentation). We close the section discussing elements of experimental design: how many drivers would a platform like Uber need in its A/B test to detect an effect with sufficient confidence?


1. The Naive Estimator#

We have \(n\) drivers, some who chose to participate in the bonus program and others who did not. Following the literature, we use the notation \(T\) — for Treatment — to denote those who participated (\(T_i = 1\)) and those who did not (\(T_i = 0\)). The most direct comparison is the difference in average earnings between the two groups:

\[\hat{\Delta}_{naive} = \bar{Y}_{T=1} - \bar{Y}_{T=0}\]

We call this the naive estimator — naive because it takes the observed difference at face value, without asking whether the groups were comparable before treatment. What does this difference actually capture? Before the formalization, we will explore the problem with simulated data.

Exploration: The Risks of a Naive Comparison and Selection Bias#

We will explore the situation with a simulation. We make the following assumptions: each driver has a latent motivation intensity. Higher work motivation implies more time behind the wheel and therefore higher earnings. We can think of this intensity as normally distributed across drivers. Additionally, more motivated drivers are also more likely to drive to low-supply areas, so we expect them to participate in the program with higher probability. We call this effect the selection intensity, and the simulation lets us adjust it.

Finally, the simulation also lets us assume a causal effect of participating in the program on a driver’s earnings — precisely the true effect the company would want to know.

Before using the simulation, form a prediction: if the true program effect is zero but motivated drivers earn the bonus more frequently and have higher baseline earnings (\(\gamma > 0\)), what should the naive comparison show? Does that answer change if you increase the sample size to 2000?

Specific things to explore:

  • Set the program effect to 0 and selection intensity to 0. Where does the naive estimator land? Does it match the true effect?

  • Raise selection intensity to 2 or 3, keeping the program effect at 0. What happens to the naive estimator?

  • Increase \(n\) to 2000 — does the bias disappear with more data?

What do we observe?#

The exploration reveals something fundamental: the naive comparison does not, in general, recover the program effect. It returns a value that is biased relative to the true value, and the magnitude of that bias is related to the selection intensity.

Additionally, we can see that the bias does not decrease with sample size. For example, with selection intensity \(\gamma = 3\) and a zero program effect, the naive estimator points systematically to around 72 dollars per week, both with 100 drivers and with 2000. More data produce a more precise estimate of the wrong number.

The decomposition bar anticipates the algebra ahead: the naive estimator conflates two things. One is the program effect (what we want to measure). The other is the pre-existing difference in baseline earnings between those who earned the bonus and those who did not. When there is no selection there is no pre-existing difference and the naive comparison works; when there is positive selection the groups are not comparable and the comparison is misleading.

Finally, note that this bias can be thought of as an omitted variable bias of the kind we studied in the multiple regression section. Here the naive comparison omits motivation intensity, and since this is correlated with program participation, it produces bias.


2. The Potential Outcomes Framework#

We now introduce a conceptual theoretical framework that lets us analyze causal questions. More precisely, the framework helps us understand what estimators — like the naive one — actually recover in causal terms. We use the Potential Outcomes Framework (Neyman, Rubin).

For each driver \(i\), we define two potential outcomes:

  • \(Y_i(1)\): driver \(i\)’s weekly earnings if they participate in the program

  • \(Y_i(0)\): driver \(i\)’s weekly earnings if they do not participate

We define the individual causal effect of the bonus for driver \(i\) as the difference between these two worlds:

\[\tau_i = Y_i(1) - Y_i(0)\]

Note that we never observe both outcomes for the same person at the same time. This is the fundamental problem of causal inference. If driver \(i\) receives the bonus, we observe \(Y_i(1)\) and \(Y_i(0)\) is counterfactual — the outcome they would have had in the parallel world where they did not receive it. Causality requires comparing two states of the world for the same unit, but only one is ever realized!

We can also rewrite this idea as

\[\begin{split}Y = \begin{cases} Y(1) & \text{if } T = 1 \\ Y(0) & \text{if } T = 0 \end{cases}\end{split}\]

3. Estimands of Interest#

Although we generally cannot observe individual causal effects, the methods we will study seek to approximate certain quantities of interest. Let us define:

Average Treatment Effect (ATE): the average effect across the entire population,

\[ATE = E[\tau_i] = E[Y_i(1) - Y_i(0)]\]

Average Treatment Effect on the Treated (ATT): the average effect among those who actually received treatment,

\[ATT = E[\tau_i \mid T_i = 1] = E[Y_i(1) - Y_i(0) \mid T_i = 1]\]

The distinction matters when effects vary across individuals. The bonus might work better for drivers with more time flexibility. In that case the ATT (the effect for those who earned the bonus) can differ from the ATE (the average effect if we gave it to everyone). When effects are homogeneous (\(\tau_i = \delta\) for all \(i\)), \(ATE = ATT = \delta\).

4. Formal Decomposition of the Naive Estimator#

The potential outcomes framework lets us see the naive estimator’s problem precisely.

It helps to first express what the estimator approximates. Since the estimator is defined as $\(\hat{\Delta}_{naive} = \bar{Y}_{T=1} - \bar{Y}_{T=0}\)\(, in population terms the comparison proposes: \)\( E[Y_i \mid T_i = 1] - E[Y_i \mid T_i = 0]\)$

Applying the potential outcomes framework, when a driver participates we observe \(Y_i(1)\), and when they do not we observe \(Y_i(0)\), so

\[ E[Y_i \mid T_i = 1] - E[Y_i \mid T_i = 0]=E[Y_i(1) \mid T_i = 1] - E[Y_i(0) \mid T_i = 0]\]

Adding and subtracting \(E[Y_i(0) \mid T_i = 1]\):

\[E[Y_i \mid T_i = 1] - E[Y_i \mid T_i = 0] = \underbrace{E[Y_i(1) - Y_i(0) \mid T_i = 1]}_{\text{ATT}} + \underbrace{E[Y_i(0) \mid T_i = 1] - E[Y_i(0) \mid T_i = 0]}_{\text{Selection bias}}\]

This equation is the central result of this section. Each term has a precise interpretation.

The ATT is exactly what we want: the causal effect of the bonus averaged over the drivers who received it — how much more they earn because of the bonus, compared to what they would have earned without it.

Selection bias is the difference in baseline earnings between the two groups: how much more the drivers who earned the bonus would have made on average had they not received it, compared to those who did not earn it. Under voluntary participation, motivated drivers seek out low-supply zones and have higher baseline earnings, so \(E[Y_i(0) \mid T_i = 1] > E[Y_i(0) \mid T_i = 0]\) and the bias is positive. When \(\gamma = 0\) in the simulation there is no link between motivation and baseline earnings, selection bias is zero, and the naive comparison works.

One implication deserves emphasis: selection bias involves the potential outcome \(Y_i(0)\) of the treated — a quantity we never directly observe. The bias is not an artifact of the data; it is a property of the process by which units end up in treatment.


5. A First Solution: Random Assignment#

What do we need for selection bias to disappear? According to the model, we need the two comparison groups (those who would participate and those who would not) to have equal average outcomes in the absence of the program.

In this example, under voluntary participation, this rarely holds: those who choose to participate tend to differ from those who do not, precisely in the characteristics that also affect the outcome. Random assignment guarantees comparability by design. What does random assignment mean? That program participation is determined by a lottery.

Following our example, while we could imagine a situation where the company randomly selects which drivers to offer the opportunity to complete certain trips as part of the benefits program, that situation requires some additional technical details we will revisit when studying instrumental variables. For now it is convenient to slightly change the example:

Suppose the mobility company (Uber) considers an alternative program: it will grant maintenance discount benefits — on vehicle upkeep and tire replacement — to all drivers unconditionally (without requiring any particular type of trip), since it believes that helping drivers keep their vehicles in good condition will translate into more completed rides. Before launching the program, and to understand the expected effect of maintenance on driver earnings, the company might conduct a survey to gather information on vehicle condition. It could then exploit the resulting data by comparing how many trips are completed by better-maintained vehicles versus poorly-maintained ones. However — you have probably already guessed — this would also be a naive comparison, since more motivated drivers likely also maintain their vehicles better.

Now we can present the solution: what would happen if the company assigned the benefits randomly?

Exploration: Observational vs. Randomized#

The following simulation uses a data-generating process similar to the previous one — same drivers, same motivation. The only thing that changes is the assignment rule: in Observational mode, drivers voluntarily choose to maintain their vehicle well if their motivation exceeds the threshold; in Randomized mode, the company randomly assigns the maintenance discounts to 50% of drivers. To simplify the model, we assume that all drivers who receive the benefits perform good vehicle maintenance.

The main graph does not show a single experiment: it shows the distribution of the estimator \(\hat{\Delta}_{naive}\) across 500 simulated experiments. This lets us see the systematic behavior of the estimator — if it is biased, we will see a persistent shift away from the true \(\delta\).

Before exploring, make a prediction: if we randomize the benefits, should the selection intensity \(\gamma\) still matter?

Specific things to explore:

  • Set selection intensity to 3 and a null maintenance effect \(\delta = 0\). In Observational mode, where is the distribution of estimators centered? Now switch to Randomized: what happens to the bias?

  • Raise selection intensity to different values in Randomized mode: does the bias change? Why not, even though selection strongly affects baseline outcomes?

  • With a high effect and few observations \(n = 100\) in Randomized mode, observe the width of the distribution. The estimator is centered on the true value, but uncertainty is large. This motivates the power analysis in the next section.

What do we observe?#

The message is clear: randomization eliminates bias, regardless of selection intensity.

Note what randomization does not do: it does not eliminate heterogeneity in motivation or make all drivers identical. Motivated drivers still earn more. What it eliminates is the correlation between motivation and treatment assignment. When assignment is random, the treated group and the control group have, on average, the same distribution of motivation — and therefore the same distribution of baseline earnings. Selection bias disappears because the two groups are valid counterfactuals for each other.

The balance plot makes this mechanism visible. This plot compares motivation between the group that participates in an intervention and the comparison group. Under voluntary participation, drivers in the well-maintained vehicle group have higher motivation because that is precisely what led them to maintain their vehicles in the first place. Under randomization, the motivation distribution is indistinguishable across groups. The balance test — verifying that pre-existing covariates are similar between treated and control — is precisely the empirical check that randomization worked. Finally, a clarification: motivation may not be an observable variable in practice. Under randomization we expect it to be balanced even if we cannot measure it.


6. Formal Result: Why Randomization Identifies the ATE#

The simulation’s intuition has an exact algebraic counterpart.

Independence assumption: A randomized experiment guarantees that assignment is independent of potential outcomes:

\[T_i \perp (Y_i(0),\, Y_i(1))\]

Under this condition, the distribution of \(Y_i(0)\) is the same for treated and control — knowing \(T_i\) carries no information about a driver’s baseline outcome:

\[E[Y_i(0) \mid T_i = 1] = E[Y_i(0) \mid T_i = 0] = E[Y_i(0)]\]

Note that this implies the selection bias term in the decomposition above is zero, and the naive estimator converges to the ATE:

\[E[Y_i \mid T_i = 1] - E[Y_i \mid T_i = 0] = \underbrace{E[Y_i(1) - Y_i(0) \mid T_i = 1]}_{\text{ATT}}\]

An additional consequence: under randomization, \(ATE = ATT\). Because assignment is uncorrelated with potential outcomes, the distribution of \(\tau_i\) among the treated equals the full population distribution. The experiment tells us both the effect for those who received the bonus and the effect we would expect if we gave it to the entire population.


7. Practical Design of a Randomized Controlled Experiment#

Showing that randomization works in theory is the beginning, not the end. In practice, four design elements determine whether a real experiment identifies the ATE precisely.

7.1 SUTVA: The No-Interference Assumption#

The potential outcomes framework assumes that each unit’s outcome depends only on its own assignment. Formally, the Stable Unit Treatment Value Assumption (SUTVA) requires:

  1. No interference: \(Y_i(1)\) and \(Y_i(0)\) do not depend on \(T_j\) for \(j \neq i\). The fact that another driver receives the bonus does not affect driver \(i\)’s earnings.

  2. Single version of treatment: all treated drivers receive the same bonus — there are no high- and low-intensity versions mixed together in the experiment.

The first condition is problematic in markets where agents compete. If the bonus leads enrolled drivers to connect more heavily in low-supply zones, they may capture rides that would otherwise have gone to non-enrolled drivers — reducing the control group’s earnings for reasons unrelated to their potential outcomes. This displacement effect (spillover) violates SUTVA and biases the ATE estimator upward.

In two-sided platforms like Uber, Airbnb, or Instacart, spillovers are the rule rather than the exception. Practical solutions include randomizing at the level of geographic market (rather than individual) or using switchback designs that rotate treatment over time. We return to this problem when we study causal inference with panel data.

7.2 Balance Verification#

After randomizing, it is standard practice to verify that the assignment created comparable groups on pre-treatment characteristics. A balance table reports means of pre-existing covariates for the treated and control groups, together with p-values from tests of mean differences. Under randomization, none of these differences should be statistically significant.

If some are significant, this is not necessarily evidence that something went wrong — with enough covariates tested, some false positives are expected by chance. Standard practice is to report the balance table and control for any imbalanced covariates in the regression, which brings us to the next point.

7.3 Estimating the ATE via Regression#

The potential outcomes framework connects directly to the regression tools we already know. When assignment is random, \(T_i\) is independent of everything else — of the drivers’ pre-existing characteristics and, crucially, of their potential outcomes. For this reason, the simplest regression of \(Y_i\) on \(T_i\):

\[Y_i = \alpha + \delta T_i + \varepsilon_i\]

produces an unbiased estimator of the causal effect. The coefficient \(\hat{\delta}\) identifies the ATE — and under randomization, ATE = ATT.

The key is that, under randomization, \(T_i \perp \varepsilon_i\): treatment assignment is orthogonal to everything left in the residual, including drivers’ unobserved characteristics. The exogeneity assumption that required careful argumentation in multiple regression — and that was precisely the one violated by the naive estimator under voluntary participation — is satisfied here by design.

This also means that interpreting \(\hat{\delta}\) as a causal effect does not require the model to be “correctly specified” in the sense of including all determinants of \(Y_i\): including or excluding additional covariates does not change the expected value of \(\hat{\delta}\). It does, however, affect its variance — and that is the point of the next section.

7.4 Covariate Adjustment for Precision#

In an RCT, regressing \(Y_i\) on \(T_i\) produces an unbiased estimator of the ATE even without any covariates. However, adding a pre-existing control \(X_i\):

\[Y_i = \alpha + \delta T_i + \beta X_i + \varepsilon_i\]

does not change the expected value of \(\hat{\delta}\) (which remains the ATE, by the independence between \(T_i\) and \(X_i\)), but reduces its variance. The intuition follows from the Frisch-Waugh-Lovell theorem we saw in multiple regression: adding \(X_i\) absorbs part of the variability in \(Y_i\) that would otherwise remain in the residual, reducing \(\hat{\sigma}^2\) and with it the standard error of \(\hat{\delta}\).

The gain can be substantial. If \(X_i\) explains 30% of the variance of \(Y_i\) — for example, the previous week’s earnings — adding that single covariate is roughly equivalent to increasing the sample size by 43%.[1] In practice, data teams at companies like Uber, Netflix, or Meta almost always use some version of this adjustment — called CUPED (Controlled-experiment Using Pre-Experiment Data) or simply ANCOVA — precisely because the precision gains reduce the required sample size and speed up experimentation cycles.


8. Statistical Power: How Many Drivers Does the Experiment Need?#

Randomization ensures the experiment is unbiased. The remaining question is whether it will be informative enough. A small experiment can be perfectly unbiased and still lack the statistical power to distinguish the bonus effect from random noise.

Suppose Uber’s team expects a minimum detectable effect of 20 dollars per week (\(\delta = 20\)), with a standard deviation of weekly earnings of \(\sigma = 60\) dollars. They want to detect that effect with 80% power at the 5% significance level.

In a 50/50 randomized experiment, treatment \(T_i\) is a Bernoulli random variable with standard deviation \(\sigma_T = 0{,}5\). The formula from the power calculation section applies directly:

\[n \geq \left(\frac{(z_{0{,}025} + z_{0{,}20})\,\sigma}{\delta\,\sigma_T}\right)^2 = \left(\frac{(1{,}96 + 0{,}84) \times 60}{20 \times 0{,}5}\right)^2 = \left(\frac{168}{10}\right)^2 \approx 282\]

Uber needs at least 282 drivers in the experiment (141 per group) to detect a 20-dollar effect with 80% probability.

The simulation below lets you explore how this requirement changes under different assumptions. Before exploring freely, set the example parameters (\(\delta = 20\), \(\sigma = 60\), \(\sigma_T = 0{,}5\), \(n = 300\)) and verify that the empirical power is around 80%. Then:

  • How many drivers are needed to detect an effect of only 10 dollars per week?

  • What happens to the required sample size if the team insists on 90% power instead of 80%?

In practice, power analysis is done before running the experiment — it is a design decision, not a post-hoc justification. An experiment with 20% power that fails to detect an effect says nothing: the bonus might not work, or the experiment might have been too small to see it. Only with a prior power analysis can we distinguish between “the effect does not exist” and “the experiment lacked the capacity to detect it.”


9. Summary: The Experiment Identifies the ATE#

The randomized controlled experiment identifies the ATE under a single condition verifiable by the researcher: that assignment is truly random. It does not require assuming that groups are similar in unobserved characteristics, nor that the regression model is correctly specified. Randomization guarantees comparability in expectation — both in observables and unobservables.

This explains why RCTs occupy the top of the evidence hierarchy in economics, medicine, and the social sciences. Not because they are perfect — SUTVA can be violated, implementation can produce imbalances, compliance can be imperfect — but because the required assumptions are transparent and largely under the researcher’s control.

When randomization is not feasible — for ethical, logistical, or cost reasons — economists turn to quasi-experimental methods that seek to recreate the RCT’s comparability under stronger assumptions: controlling for observable variables via regression and matching, exploiting institutional discontinuities, using variation generated by exogenous instruments, and tracking groups before and after an event. Each of those methods is the topic of a chapter in the remainder of this book, and in each case we will use the potential outcomes framework to clarify exactly what we are identifying and under what assumptions.