Thursday, 13 November 2025

Maximum Likelihood Estimation of Regression with Binary Response

by W. B. Meitei, PhD


In the logit or probit model, the outcome variable yi is a Bernoulli random variable, i.e.,

\[y_{i} = \left\{ \begin{matrix} 1 & with\ probability\ p_{i} \\ 0 & with\ probability\ 1 - p_{i} \end{matrix} \right.\ \]

Thus, if n observations are available, then the likelihood function is,

\[L = \prod_{i = 1}^{n}{p_{i}^{y_{i}}\left( 1 - p_{i} \right)^{1 - y_{i}}}\]

The logit or probit model arises when pi is specified to be given by the logistic or normal cumulative distribution function evaluated at \(X'\beta\). Let \(F(X'\beta)\) be the cumulative distribution function, then the likelihood function of both models is,

\[L = \prod_{i = 1}^{n}{{F(X_{i}'\beta)}^{y_{i}}\left\{ 1 - F(X_{i}'\beta) \right\}^{1 - y_{i}}}\]

Thus, the log-likelihood function is,

\[\ln(L) = l = \sum_{i = 1}^{n}\left\lbrack y_{i}\ln\left( F(X_{i}'\beta) \right) + \left( 1 - y_{i} \right)\ln\left\{ 1 - F(X_{i}'\beta) \right\} \right\rbrack\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)\]

Now, the first order conditions of ln(L) are nonlinear and nonanalytic. Therefore, we have to obtain the maximum likelihood estimates using numerical optimisation methods.

Using a numerical optimisation method such as the Newton-Raphson method, we can obtain the maximum likelihood estimate of β recursively as,

\[{\widehat{\beta}}_{n + 1} = {\widehat{\beta}}_{n} - \left\lbrack \frac{d^{2}l}{d\beta^{2}} \right\rbrack_{\beta = {\widehat{\beta}}_{n}}^{- 1}\left\lbrack \frac{dl}{d\beta} \right\rbrack_{\beta = {\widehat{\beta}}_{n}}\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)\]

Here, \({\widehat{\beta}}_{n}\) is the nth round estimate, and the Hessian \(\left( \frac{d^{2}l}{d\beta^{2}} \right)\) and score vectors \(\left( \frac{dl}{d\beta} \right)\) are evaluated at this estimate.

We know from the maximum likelihood theorem that

\[\sqrt{n}\left( {\widehat{\beta}}_{ML} - \beta \right)\overset{asy}{\rightarrow}N\left\lbrack 0, - E\left( \frac{d^{2}l}{d\beta^{2}} \right)^{- 1} \right\rbrack\]

or

\[\sqrt{n}\left( {\widehat{\beta}}_{ML} - \beta \right)\overset{asy}{\rightarrow}N\left\lbrack 0,I^{- 1}(\beta) \right\rbrack\]

where, \(I = - \left. \ \frac{d^{2}l}{d\beta^{2}} \right|_{\beta = \beta_{0}}\) is the Fisher information. β0 is an interior point of parameter space β.

Now, for a finite sample, the asymptotic distribution of \({\widehat{\beta}}_{ML}\) can be approximated by,

\[N\left\lbrack \beta, - \left( \frac{d^{2}l}{d\beta^{2}} \right)_{\beta = \beta_{ML}}^{- 1} \right\rbrack\]


1. The case of the Logit Model

Here, the cumulative distribution function is,

\[F(x) = \frac{1}{1 + e^{- x}}\]

And the probability density function is,

\[f(x) = \frac{e^{- x}}{\left( 1 + e^{- x} \right)^{2}}\]

Then,

\[1 - F(x) = 1 - \frac{1}{1 + e^{- x}} = \frac{e^{- x}}{1 + e^{- x}} = F( - x)\]

\[\frac{f(x)}{F(x)} = \frac{e^{- x}}{\left( 1 + e^{- x} \right)^{2}}\left( 1 + e^{- x} \right) = \frac{e^{- x}}{1 + e^{- x}} = F( - x)\]

Now,

\begin{align} \frac{d}{dx}f(x) &= \frac{d}{dx}\frac{e^{-x}}{(1 + e^{-x})^2} \\[0.5em] &= \frac{(1 + e^{-x})^2 \frac{d}{dx}e^{-x} - e^{-x} \frac{d}{dx}(1 + e^{-x})^2}{\{(1 + e^{-x})^2\}^2} \\[0.5em] &= \frac{(1 + e^{-x})^2(-e^{-x}) - e^{-x}\left\{2(1 + e^{-x}) \frac{d}{dx}e^{-x}\right\}}{(1 + e^{-x})^4} \\[0.5em] &= \frac{-e^{-x}(1 + e^{-x})^2 + 2e^{-2x}(1 + e^{-x})}{(1 + e^{-x})^4} \\[0.5em] &= \frac{e^{-x}(1 + e^{-x})(-1 - e^{-x} + 2e^{-x})}{(1 + e^{-x})^4} \\[0.5em] &= \frac{e^{-x}(1 - e^{-x})}{(1 + e^{-x})^3} \\[0.5em] &= -\frac{e^{-x}}{(1 + e^{-x})^2} \frac{1}{(1 + e^{-x})}(1 - e^{-x}) \\[0.5em] &= -f(x)F(x)(1 - e^{-x}) \end{align}

Thus, the score vector for the logit model using (1) is,

\begin{align*} \frac{d l}{d \beta} &= \frac{d}{d \beta} \sum_{i=1}^{n} \left[ y_i \ln \{ F(X_i' \beta) \} + (1-y_i) \ln \{ 1 - F(X_i' \beta) \} \right] \\[0.5em] &= \frac{d}{d \beta} \sum_{i=1}^{n} \left[y_i \ln \{ F(X_i' \beta) \} + (1-y_i) \ln \{ F(-X_i' \beta) \}\right] \\[0.5em] &= \sum_{i=1}^{n} y_i \frac{d}{d \beta} \ln \{ F(X_i' \beta) \} + \sum_{i=1}^{n} (1-y_i) \frac{d}{d \beta} \ln \{ F(-X_i' \beta) \} \\[0.5em] &= \sum_{i=1}^{n} y_i \frac{d}{d \beta} \ln \left( \frac{1}{1+e^{-X_i' \beta}} \right) + \sum_{i=1}^{n} (1-y_i) \frac{d}{d \beta} \ln \left( \frac{1}{1+e^{X_i' \beta}} \right) \\[0.5em] &= \sum_{i=1}^{n} y_i \frac{d}{d \beta} \left[ \ln(1) - \ln(1+e^{-X_i' \beta}) \right] + \sum_{i=1}^{n} (1-y_i) \frac{d}{d \beta} \left[ \ln(1) - \ln(1+e^{X_i' \beta}) \right] \\[0.5em] &= \sum_{i=1}^{n} y_i \frac{d}{d \beta} \left[ -\ln(1+e^{-X_i' \beta}) \right] + \sum_{i=1}^{n} (1-y_i) \frac{d}{d \beta} \left[ -\ln(1+e^{X_i' \beta}) \right] \\[0.5em] &= \sum_{i=1}^{n} y_i \frac{-1}{1+e^{-X_i' \beta}} \frac{d}{d \beta} (1+e^{-X_i' \beta}) + \sum_{i=1}^{n} (1-y_i) \frac{-1}{1+e^{X_i' \beta}} \frac{d}{d \beta} (1+e^{X_i' \beta}) \\[0.5em] &= \sum_{i=1}^{n} y_i \frac{X_i e^{-X_i' \beta}}{1+e^{-X_i' \beta}} - \sum_{i=1}^{n} (1-y_i) \frac{X_i e^{X_i' \beta}}{1+e^{X_i' \beta}} \\[0.5em] &= \sum_{i=1}^{n} y_i \frac{X_i}{1+e^{X_i' \beta}} - \sum_{i=1}^{n} (1-y_i) \frac{X_i}{1+e^{-X_i' \beta}} \\[0.5em] &= \sum_{i=1}^{n} y_i F(-X_i' \beta) X_i - \sum_{i=1}^{n} (1-y_i) F(X_i' \beta) X_i \\[0.5em] &= \sum_{i=1}^{n} y_i \{ 1 - F(X_i' \beta) \} X_i - \sum_{i=1}^{n} (1-y_i) F(X_i' \beta) X_i \\[0.5em] &= \sum_{i=1}^{n} [ y_i - y_i F(X_i' \beta) - F(X_i' \beta) + y_i F(X_i' \beta) ] X_i \\[0.5em] &= \sum_{i=1}^{n} [ y_i - F(X_i' \beta) ] X_i \end{align*}

And the Hessian for the logit model using (1) is,

\begin{align*} \frac{d^2 l}{d \beta^2} &= \frac{d}{d \beta} \frac{d l}{d \beta} \\[0.5em] &= \frac{d}{d \beta} \sum_{i=1}^n [y_i - F(X_i' \beta)] X_i \\[0.5em] &= 0 - \sum_{i=1}^n \frac{d}{d \beta} F(X_i' \beta) X_i \\[0.5em] &= -\sum_{i=1}^n \frac{e^{-X_i' \beta}}{(1 + e^{-X_i' \beta})^2} X_i X_i' \\[0.5em] &= -\sum_{i=1}^n f(X_i' \beta) X_i X_i' \end{align*}

Using the score vector and Hessian in (2), and iterating it until, \(\left| {\widehat{\beta}}_{n + 1} - {\widehat{\beta}}_{n} \right| < \epsilon\), we can get the required \({\widehat{\beta}}_{ML}\).


2. The case of the Probit Model

Here, we assume,

\[f(x) = \frac{1}{\sqrt{2\pi}}e^{- \frac{1}{2}x^{2}},\ \ where - \infty < x < \infty\]

\[F(x) = \int_{- \infty}^{x}{f(x)dx}\]

Following the same procedure as for the logit model, we can get the required \({\widehat{\beta}}_{ML}\) for the probit model. 



Suggested Readings:

  1. Fomby, T. B. (2010). Maximum Likelihood Estimation of Logit and Probit Models. SMU Department of Economics.
  2. Taboga, M. (2021). Logistic classification model (logit or logistic regression)Fundamentals of Statistics.

Suggested Citation: Meitei, W. B. (2025). Maximum Likelihood Estimation of Regression with Binary Response. WBM STATS.

No comments:

Post a Comment