# Categorical Dependent Variables: Logit and Probit

Consider the following situation: a product team at a SaaS company wants to predict which users will churn in the next 30 days. The outcome of interest is categorical: the user either "churned" or "stayed." Based on what we have seen so far, any econometrician's natural reflex would be to turn this outcome into a binary variable (1 if the user "churned," 0 if the user "stayed") and run a regression. But can OLS predict probabilities? What happens when the dependent variable only takes two values?

This section answers those questions. We present the **logistic regression model** (logit) and the **probit model** as alternatives to OLS for binary dependent variables, explore coefficient interpretation through **marginal effects**, and extend the framework to the multi-category case with the **multinomial logit**.

---

(binary-objectives-en)=
## 1. Objectives

By the end of this section, you will be able to:

- Identify why OLS yields problematic estimates when the dependent variable is binary, both in terms of predictions outside the $[0,1]$ range and structural heteroskedasticity in the residuals.
- Understand how logit and probit solve these problems by mapping a linear index into the $[0,1]$ interval through a link function.
- Correctly interpret estimated logit/probit coefficients, which are not marginal effects but effects on the log-odds or latent index.
- Compute and interpret marginal effects: the effect at a point $X_0$, the effect at the mean, and the average marginal effect (AME).
- Apply multinomial logit when the dependent variable takes more than two unordered values.

---

(binary-ols-logit-en)=
## 2. The Problem with OLS and the Logit/Probit Solution

OLS can estimate a linear relationship between $X$ and $Y$, but when $Y$ only takes values 0 and 1, that linearity creates two unavoidable problems: predictions outside the $[0,1]$ range and structural heteroskedasticity in residuals. The simulation below compares OLS, Logit, and Probit on a sample of 500 SaaS users where $Y_i = 1$ if the user churned and $X_i$ is average weekly logins.

What should you look for?

- **With OLS:** focus on the red shaded regions - these are areas where the model predicts probabilities below 0 or above 1. These are not estimation errors; they are structural failures of the linear model.
- **In the OLS residuals vs fitted panel:** observe the two-band pattern. Residuals are not random - they have structure. This reflects intrinsic heteroskedasticity in a linear probability model.
- **Logit and Probit:** residuals are more diffuse, without the same band structure. The table shows that predictions at $X = 2, 5, 10$ are always valid probabilities.
- **When do Logit and Probit look similar?** When baseline probability is moderate (around 50%), all three models give similar results in the interior. The largest differences appear when the true probability approaches 0 or 1.

<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
  <iframe src="https://simuecon.com/binary_outcome/?lang=en" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border: 0;" allowfullscreen></iframe>
</div>

### What do we observe?

**OLS fails in the tails, not in the center.** When true probability is moderate and the sample is large, the linear probability model (OLS) produces slope estimates reasonably close to logit and probit. The problem appears in prediction: as soon as the support of $X$ is wide enough, OLS inevitably yields predictions outside $[0,1]$. In addition, residuals show a two-band structure that reflects intrinsic heteroskedasticity - errors cannot have constant variance when $Y$ only takes 0 and 1.

### Formal Result

#### The Problem with OLS

When the dependent variable is binary, $Y_i \in \{0, 1\}$, the conditional expectation is a probability:

$$E[Y_i | X_i] = P(Y_i = 1 | X_i) = p_i$$

The **linear probability model** (LPM) directly specifies $p_i = \beta_0 + \beta_1 X_i$. OLS estimates this model consistently under the usual assumptions, but it has two unavoidable structural failures.

**Failure 1 - Out-of-range predictions.** Nothing prevents $\hat{\beta}_0 + \hat{\beta}_1 X_i$ from being negative or larger than 1. Negative probabilities or probabilities above 1 are meaningless.

**Failure 2 - Structural heteroskedasticity.** Since $Y_i \sim \text{Bernoulli}(p_i)$, conditional variance is $\text{Var}(Y_i | X_i) = p_i(1-p_i)$. This variance depends on $X_i$ by construction - heteroskedasticity cannot be removed with any transformation of the linear model.

#### The Logit Model

The logit model specifies conditional probability through the **logistic function**:

$$P(Y_i = 1 | X_i) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_i)}} = \Lambda(\beta_0 + \beta_1 X_i)$$

where $\Lambda(\cdot)$ denotes the logistic cumulative distribution function. This function maps any real value into $(0, 1)$, guaranteeing valid predictions by construction.

The equivalent **log-odds** parameterization makes the underlying linearity explicit:

$$\ln\left(\frac{P(Y_i=1|X_i)}{1 - P(Y_i=1|X_i)}\right) = \beta_0 + \beta_1 X_i$$

**Interpretation of coefficient $\beta_1$:** a one-unit increase in $X_i$ changes log-odds by $\beta_1$ units. Equivalently, it multiplies odds by $e^{\beta_1}$.

Parameters are estimated by **maximum likelihood** (ML) — see [Appendix A.2](#binary-mle-en) for an introduction to the method. The log-likelihood for a sample of $n$ independent observations is:

$$\ell(\beta) = \sum_{i=1}^n \left[ Y_i \ln \Lambda(X_i'\beta) + (1 - Y_i) \ln(1 - \Lambda(X_i'\beta)) \right]$$

There is no closed-form analytical solution; it is maximized numerically (typically with Newton-Raphson). Under standard regularity, the ML estimator is consistent and asymptotically normal.

#### The Probit Model

The probit model specifies probability through the standard normal cumulative distribution function $\Phi(\cdot)$:

$$P(Y_i = 1 | X_i) = \Phi(\beta_0 + \beta_1 X_i)$$

The formal motivation is a **latent-variable model**: there exists an unobserved continuous variable $Y_i^* = \beta_0 + \beta_1 X_i + \varepsilon_i$ with $\varepsilon_i \sim N(0,1)$, and we observe $Y_i = 1$ if and only if $Y_i^* > 0$. Then:

$$P(Y_i = 1 | X_i) = P(Y_i^* > 0 | X_i) = P(\varepsilon_i > -(\beta_0 + \beta_1 X_i)) = \Phi(\beta_0 + \beta_1 X_i)$$

**Logit vs. Probit in practice:** the two link functions are very similar in the interior of the support and produce nearly identical marginal-effect estimates. The main difference is that the logistic distribution has slightly heavier tails than the normal, which makes logit and probit diverge more when true probability is very close to 0 or 1. In practice, choosing between them rarely changes substantive conclusions.

---

(binary-marginal-en)=
## 3. Marginal Effects: The Effect Is Not Constant

Once we estimate a logit or probit, the obvious question arises: by how much does churn probability decrease if a user adds one more weekly login? The answer, which may be surprising, is that **it depends on how many logins that user already makes**.

The simulation works analytically with the probability function. It lets you see how the marginal effect of $X$ on $P(Y=1)$ varies over the support and compares three ways to summarize it:

- **Where is the marginal effect largest?** Drag the evaluation point $X_0$ along the axis. The effect is maximum at the inflection point of the sigmoid curve (where probability is 0.5) and approaches zero in both tails.
- **The tangent at $X_0$:** the red dashed line is the local linear approximation of the effect - its slope is exactly the marginal effect at that point.
- **The bottom panel:** shows the full marginal-effect curve as a function of $X$. Horizontal lines indicate the marginal effect at the mean (ME at mean) and the average marginal effect (AME, computed by integrating over the distribution of $X$).
- **Switch between Logit and Probit:** do marginal effects differ much? Where along the support is the gap largest?

<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
  <iframe src="https://simuecon.com/marginal_effects/?lang=en" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border: 0;" allowfullscreen></iframe>
</div>

### What do we observe?

**The marginal effect is heterogeneous.** One additional login reduces churn probability much more for a moderately active user (4-6 logins/week, where the curve is steepest) than for a very inactive user (1 login/week, near the left tail where baseline probability is high) or a very active user (10+ logins/week, where probability is already very low). This heterogeneity is intrinsic to the model - not a limitation but a feature that reflects nonlinearity in probabilities.

### Formal Result

Logit and probit coefficients **are not marginal effects**. The effect of $X_i$ on probability depends on the level of $X_i$. For the logit model:

$$\frac{\partial P(Y_i=1|X_i)}{\partial X_i} = \lambda(\beta_0 + \beta_1 X_i) \cdot \beta_1$$

where $\lambda(\cdot) = \Lambda(\cdot)[1 - \Lambda(\cdot)]$ is the logistic density. For the probit model, the analogous expression replaces $\lambda$ with $\phi$ (the standard normal density).

Because this derivative depends on $X_i$, three summary measures are typically reported:

**Marginal effect at point $X_0$:** evaluates the derivative at a specific value of interest.

$$\text{ME}(X_0) = f(\hat{\beta}_0 + \hat{\beta}_1 X_0) \cdot \hat{\beta}_1$$

**Marginal effect at the mean (MEM):** evaluates the derivative at $\bar{X}$.

$$\text{MEM} = f(\hat{\beta}_0 + \hat{\beta}_1 \bar{X}) \cdot \hat{\beta}_1$$

**Average marginal effect (AME):** averages the derivative over all individuals in the sample. This is the most commonly used measure because it has a clear policy interpretation - the average effect in the observed population.

$$\text{AME} = \frac{1}{n} \sum_{i=1}^n f(\hat{\beta}_0 + \hat{\beta}_1 X_i) \cdot \hat{\beta}_1$$

**Formal result:** the AME is consistent under standard assumptions of correct model specification. It is the nonlinear counterpart of the OLS coefficient in the linear probability model: both estimate the average population effect of a marginal change in $X$.

---

(binary-sigmoid-en)=
## 4. Extra: The Sigmoid Function and the Log-Odds Space

The core of the logit model is the **logistic function** (also called the sigmoid): an S-shaped curve that takes any real number and maps it into the open interval $(0, 1)$. This transformation guarantees that predictions are always valid probabilities.

The simulation lets you explore the logistic function and its relationship to the **log-odds space**. The key result is the duality between both panels: probability $P(Y=1 | X)$ is S-shaped (nonlinear), but if we plot log-odds $\ln[p/(1-p)]$ as a function of $X$, we get a perfectly straight line.

- **What does $\beta_0$ control?** Move it and observe the horizontal shift of the curve. $\beta_0$ determines baseline probability when $X = 0$.
- **What does $\beta_1$ control?** Increase it and observe the curve becoming steeper. A large $\beta_1$ turns the sigmoid into something close to a step function.
- **Right panel (log-odds vs. $X$):** regardless of parameter values, the line is always straight. The slope of that line is exactly $\beta_1$.
- **Switch between Logit and Probit:** curves look very similar but not identical. Where along the support do they differ the most?

<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
  <iframe src="https://simuecon.com/log_odds_sigmoid/?lang=en" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border: 0;" allowfullscreen></iframe>
</div>

### What do we observe?

**The sigmoid function has an elegant duality.** In probability space, the model is nonlinear (S-curve). But if we transform probability into log-odds space via $\ln[p/(1-p)]$, the model is perfectly linear in $X$. That linearity in log-odds is what defines logit: coefficient $\beta_1$ is the change in log-odds per unit of $X$, not the change in probability.

### Formal Result

**Result:** if $P(Y=1|X) = \Lambda(\beta_0 + \beta_1 X)$ where $\Lambda(z) = 1/(1+e^{-z})$, then log-odds is linear in $X$.

**Step 1 -** Compute odds:

$$\frac{P(Y=1|X)}{P(Y=0|X)} = \frac{\Lambda(z)}{1 - \Lambda(z)} = \frac{1/(1+e^{-z})}{e^{-z}/(1+e^{-z})} = e^z$$

**Step 2 -** Take logarithm:

$$\ln\left(\frac{P(Y=1|X)}{P(Y=0|X)}\right) = z = \beta_0 + \beta_1 X$$

Log-odds is a linear function of $X$ with slope $\beta_1$. $\blacksquare$

---

(binary-multinomial-en)=
## 5. Multiple Categories: Multinomial Logit

What happens when the dependent variable has more than two unordered categories? Think of the subscription plan in a SaaS company: **Free**, **Basic**, or **Pro**. There is no natural ordering among them - they are qualitative categories.

**Multinomial logit** generalizes binary logit to $J$ categories. A reference category is chosen (typically the most common one; here, Free), and one coefficient vector is estimated for each of the remaining $J-1$ categories against the reference. For three categories:

$$P(\text{Basic} | X) = \frac{e^{\beta_{\text{Basic}} \cdot X}}{1 + e^{\beta_{\text{Basic}} \cdot X} + e^{\beta_{\text{Pro}} \cdot X}}$$

$$P(\text{Pro} | X) = \frac{e^{\beta_{\text{Pro}} \cdot X}}{1 + e^{\beta_{\text{Basic}} \cdot X} + e^{\beta_{\text{Pro}} \cdot X}}$$

$$P(\text{Free} | X) = \frac{1}{1 + e^{\beta_{\text{Basic}} \cdot X} + e^{\beta_{\text{Pro}} \cdot X}}$$

where $\beta_j \cdot X = \beta_{0j} + \beta_{1j} X_1 + \beta_{2j} X_2$ with $X_1$ = company size and $X_2$ = weekly logins.

The simulation below shows how coefficients shift probability mass across the three categories. The **stacked area chart** makes the unit-sum restriction visible: when $P(\text{Pro})$ rises, some other probability must fall.

What should you explore?

- **Increase coefficients for Pro vs. Free:** observe how $P(\text{Pro})$ rises at the expense of the other categories. Which category does it "steal" more from?
- **Match Basic and Pro coefficients:** when both coefficient vectors are identical, the model cannot distinguish between the two plans - predicted probabilities become close to each other.
- **The individual prediction panel:** change the firm profile (size and logins) and observe in real time which plan the model predicts as most likely.
- **The odds-ratio table:** $e^{\beta_{1j}}$ gives the odds ratio of category $j$ vs. Free for each additional unit of $X_1$. When is this interpretation more intuitive than direct probability?

<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
  <iframe src="https://simuecon.com/multinomial_logit/?lang=en" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border: 0;" allowfullscreen></iframe>
</div>

**IIA assumption.** Multinomial logit imposes the **independence of irrelevant alternatives** (IIA) assumption: the probability ratio between two categories does not depend on what other categories exist in the choice set. In this example, $P(\text{Pro})/P(\text{Basic})$ is the same regardless of whether the Free plan exists. This can be unrealistic in some settings - if a very similar "Pro Lite" plan were introduced, IIA would predict fixed-proportion market-share losses from Free and Basic, which is often not credible. Nested logit or multinomial probit can relax this assumption when relevant.

---

(binary-appendix-en)=
## Appendix: Formal Derivations

### A.1 Structural Heteroskedasticity in the LPM

**Result:** in the linear probability model, $\text{Var}(\varepsilon_i | X_i) = p_i(1-p_i)$, which depends on $X_i$.

**Step 1 -** Write $Y_i = p_i + \varepsilon_i$ where $p_i = E[Y_i|X_i]$.

**Step 2 -** Since $Y_i \in \{0,1\}$:

$$\text{Var}(Y_i | X_i) = E[Y_i^2 | X_i] - (E[Y_i|X_i])^2 = p_i - p_i^2 = p_i(1-p_i)$$

**Step 3 -** Since $p_i = \beta_0 + \beta_1 X_i$, variance depends on $X_i$ whenever $\beta_1 \neq 0$. OLS remains consistent but is inefficient, and usual standard errors are incorrect. $\blacksquare$

(binary-mle-en)=
### A.2 Introduction to Maximum Likelihood Estimation

**Maximum likelihood** (ML) is one of the most important estimation principles in statistics. The core idea is simple: given the observed data, choose parameter values that make those data "most probable" (most likely).

**The likelihood function.** For a sample of $n$ independent observations with distribution $f(y_i; \theta)$, the likelihood function is:

$$L(\theta) = \prod_{i=1}^n f(y_i; \theta)$$

**The log-likelihood.** For numerical convenience we work with the log, which turns the product into a sum:

$$\ell(\theta) = \ln L(\theta) = \sum_{i=1}^n \ln f(y_i; \theta)$$

**The ML estimator** is the value $\hat{\theta}$ that maximizes $\ell(\theta)$:

$$\hat{\theta}^{\text{ML}} = \arg\max_\theta \, \ell(\theta)$$

**Properties.** Under standard regularity conditions, the ML estimator is consistent ($\hat{\theta}^{\text{ML}} \xrightarrow{p} \theta^*$) and asymptotically normal:

$$\sqrt{n}(\hat{\theta}^{\text{ML}} - \theta^*) \xrightarrow{d} N\!\left(0,\, \mathcal{I}(\theta^*)^{-1}\right)$$

where $\mathcal{I}(\theta^*) = -E[\partial^2 \ell / \partial\theta\,\partial\theta']$ is the **Fisher information matrix**. The estimator is also asymptotically efficient: no consistent estimator has lower asymptotic variance.

**In logit and probit**, $Y_i | X_i \sim \text{Bernoulli}(F(X_i'\beta))$ with $F = \Lambda$ (logistic) or $F = \Phi$ (standard normal). The resulting log-likelihood has no closed-form solution and is maximized numerically, typically with Newton-Raphson. $\blacksquare$
