Logistic Regression

Previously, we discussed cases where the independent variables are categorical. Now, suppose the dependent variable $y$ is binary, i.e., $y = 0$ or $1$. In this scenario, the normality assumption of the residuals does not hold (since there are only two possible values, not a normal distribution), and standard statistical inference is not applicable.

Instead, we use a different approach: logistic regression.

3.1 Sigmoid Function

Consider the following function:

\[y = \frac{e^x}{1 + e^{x}} = \frac{1}{1 + e^{-x}},\]

We observe that:

$0 \leq y \leq 1$,
$\lim_{x \to -\infty} y = 0$,
$\lim_{x \to \infty} y = 1$,
$y$ is a continuous and differentiable function,
$y(0) = 0.5$.

This function is called the sigmoid function. It is commonly used to model the probability of a binary outcome.

By inverting the sigmoid function, we obtain:

\[x = \ln \left( \frac{y}{1 - y} \right).\]

3.2 Regression Model

Given a binary outcome $Y$:

\[Y = \begin{cases} 1 & \text{if some condition is met}, \\ 0 & \text{otherwise}. \end{cases}\]

We model the probability that $Y = 1$ as:

\[p = P[Y = 1 \mid X],\]

where $X$ represents the independent variables.

Let

\[p = \frac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}}.\]

This implies:

\[\ln \left( \frac{p}{1 - p} \right) = \beta_0 + \beta_1 X.\]

3.3 Maximum Likelihood Estimation

To estimate the parameters $\beta_0$ and $\beta_1$, recall the probability mass function of the Bernoulli distribution:

\[P[X=x] = p^x (1 - p)^{1 - x}.\]

The likelihood function for $n$ observations is:

\[L(\beta_0, \beta_1) = P[Y_1 = y_1, Y_2 = y_2, \ldots, Y_n = y_n] = \prod_{i=1}^{n} p_i^{y_i} (1 - p_i)^{1 - y_i},\]

where:

$p_i = p(x_i) = \frac{e^{\beta_0 + \beta_1 x_i}}{1 + e^{\beta_0 + \beta_1 x_i}}$,
$Y_i$ is a random binary variable,
$y_i$ is the observed value of $Y_i$.

The log-likelihood function is:

\[\ln L(\beta_0, \beta_1) = \sum_{i=1}^{n} y_i \ln p_i + (1 - y_i) \ln (1 - p_i).\]

Maximum likelihood estimation seeks the values of $\beta_0$ and $\beta_1$ that maximize the log-likelihood. This is done by setting the partial derivatives of the log-likelihood with respect to $\beta_0$ and $\beta_1$ to zero:

\[\begin{cases} \frac{\partial \ln L}{\partial \beta_0} = 0, \\ \frac{\partial \ln L}{\partial \beta_1} = 0, \end{cases}\]

which leads to:

\[\begin{cases} \sum_{i=1}^{n} (y_i - p_i) = 0, \\ \sum_{i=1}^{n} (y_i - p_i) x_i = 0, \end{cases}\]

or equivalently:

\[\begin{cases} \bar{y} = \bar{p}, \\ \overline{xy} = \overline{xp}. \end{cases}\]

Since $p_i = \frac{e^{\beta_0 + \beta_1 x_i}}{1 + e^{\beta_0 + \beta_1 x_i}}$, this is a nonlinear system of two equations with two unknowns. It is typically solved using numerical methods, such as the Newton-Raphson method.

For $k$ independent variables, the model generalizes to:

\[p_i = P[Y_i = 1 \mid X],\] \[p_i = \frac{e^{\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \ldots + \beta_k x_{ik}}}{1 + e^{\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \ldots + \beta_k x_{ik}}}.\]

The resulting system is:

\[\begin{cases} \bar{y} = \bar{p}, \\ \overline{x_j y} = \overline{x_j p}, \quad j = 1, 2, \ldots, k. \end{cases}\]

3.4 Multiclass Case: $M$ Categories of $Y$

When $Y$ has $M$ categories and $k$ independent variables, the model becomes:

\[P[Y_i = 1 \mid X] = \frac{e^{\beta_0^1 + \beta_1^1 x_1^1 + \beta_2^1 x_2^1 + \ldots + \beta_k^1 x_k^1}}{1 + e^{\beta_0^1 + \beta_1^1 x_1^1 + \beta_2^1 x_2^1 + \ldots + \beta_k^1 x_k^1}},\] \[P[Y_i = j \mid X] = \frac{e^{\beta_0^j + \beta_1^j x_1^j + \beta_2^j x_2^j + \ldots + \beta_k^j x_k^j}}{1 + e^{\beta_0^j + \beta_1^j x_1^j + \beta_2^j x_2^j + \ldots + \beta_k^j x_k^j}}, \quad \text{for } j = 2, 3, \ldots, M-1,\] \[P[Y_i = M \mid X] = 1 - \sum_{j=1}^{M-1} P[Y_i = j \mid X].\]

In matrix notation:

\[P[Y_i = j \mid X] = \frac{1}{1 + e^{-\beta_j^T X_i}},\] \[P[Y_i = M \mid X] = 1 - \sum_{j=1}^{M-1} \frac{1}{1 + e^{-\beta_j^T X_i}},\]

where:

$\beta_j = (\beta_0^j, \beta_1^j, \ldots, \beta_k^j)$,
$X_i = (1, x_1^i, x_2^i, \ldots, x_k^i)$.

We also have:

\[\beta^T X = \log \left( \frac{P[Y_i = j \mid X]}{1 - P[Y_i = j \mid X]} \right).\]

The likelihood function is:

\[L(\beta_0^1, \beta_1^1, \ldots, \beta_k^1, \ldots, \beta_0^{M-1}, \beta_1^{M-1}, \ldots, \beta_k^{M-1}) = \prod_{i=1}^{n} \prod_{j=1}^{M} P[Y_i = j \mid X_i]^{y_{ij}},\]

where:

$\beta_j = (\beta_0^j, \beta_1^j, \ldots, \beta_k^j)$,
$y_{ij}$ is the observed value indicating $Y_i = j$.

Solving $(k+1)(M-1)$ nonlinear equations yields the maximum likelihood estimates of the parameters, typically using numerical methods.

Share on

X Facebook LinkedIn Bluesky

Yanzhong(Eric) Huang

Logistic Regression

3.1 Sigmoid Function

3.2 Regression Model

3.3 Maximum Likelihood Estimation

3.4 Multiclass Case: $M$ Categories of $Y$

Share on

Comments