Showing posts with label Multivariate Statistics. Show all posts
Showing posts with label Multivariate Statistics. Show all posts

Thursday, 13 November 2025

Maximum Likelihood Estimation of Regression with Binary Response

by W. B. Meitei, PhD


In the logit or probit model, the outcome variable yi is a Bernoulli random variable, i.e.,

\[y_{i} = \left\{ \begin{matrix} 1 & with\ probability\ p_{i} \\ 0 & with\ probability\ 1 - p_{i} \end{matrix} \right.\ \]

Thus, if n observations are available, then the likelihood function is,

\[L = \prod_{i = 1}^{n}{p_{i}^{y_{i}}\left( 1 - p_{i} \right)^{1 - y_{i}}}\]

The logit or probit model arises when pi is specified to be given by the logistic or normal cumulative distribution function evaluated at \(X'\beta\). Let \(F(X'\beta)\) be the cumulative distribution function, then the likelihood function of both models is,

\[L = \prod_{i = 1}^{n}{{F(X_{i}'\beta)}^{y_{i}}\left\{ 1 - F(X_{i}'\beta) \right\}^{1 - y_{i}}}\]

Thus, the log-likelihood function is,

\[\ln(L) = l = \sum_{i = 1}^{n}\left\lbrack y_{i}\ln\left( F(X_{i}'\beta) \right) + \left( 1 - y_{i} \right)\ln\left\{ 1 - F(X_{i}'\beta) \right\} \right\rbrack\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)\]

Now, the first order conditions of ln(L) are nonlinear and nonanalytic. Therefore, we have to obtain the maximum likelihood estimates using numerical optimization methods.

Using a numerical optimization method such as the Newton-Raphson method, we can obtain the maximum likelihood estimate of β recursively as,

\[{\widehat{\beta}}_{n + 1} = {\widehat{\beta}}_{n} - \left\lbrack \frac{d^{2}l}{d\beta^{2}} \right\rbrack_{\beta = {\widehat{\beta}}_{n}}^{- 1}\left\lbrack \frac{dl}{d\beta} \right\rbrack_{\beta = {\widehat{\beta}}_{n}}\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)\]

Here, \({\widehat{\beta}}_{n}\) is the nth round estimate, and the Hessian \(\left( \frac{d^{2}l}{d\beta^{2}} \right)\) and score vectors \(\left( \frac{dl}{d\beta} \right)\) are evaluated at this estimate.

We know from the maximum likelihood theorem that

\[\sqrt{n}\left( {\widehat{\beta}}_{ML} - \beta \right)\overset{asy}{\rightarrow}N\left\lbrack 0, - E\left( \frac{d^{2}l}{d\beta^{2}} \right)^{- 1} \right\rbrack\]

or

\[\sqrt{n}\left( {\widehat{\beta}}_{ML} - \beta \right)\overset{asy}{\rightarrow}N\left\lbrack 0,I^{- 1}(\beta) \right\rbrack\]

where, \(I = - \left. \ \frac{d^{2}l}{d\beta^{2}} \right|_{\beta = \beta_{0}}\) is the Fisher information. β0 is an interior point of parameter space β.

Now, for a finite sample, the asymptotic distribution of \({\widehat{\beta}}_{ML}\) can be approximated by,

\[N\left\lbrack \beta, - \left( \frac{d^{2}l}{d\beta^{2}} \right)_{\beta = \beta_{ML}}^{- 1} \right\rbrack\]


1. The case of the Logit Model

Here, the cumulative distribution function is,

\[F(x) = \frac{1}{1 + e^{- x}}\]

And the probability density function is,

\[f(x) = \frac{e^{- x}}{\left( 1 + e^{- x} \right)^{2}}\]

Then,

\[1 - F(x) = 1 - \frac{1}{1 + e^{- x}} = \frac{e^{- x}}{1 + e^{- x}} = F( - x)\]

\[\frac{f(x)}{F(x)} = \frac{e^{- x}}{\left( 1 + e^{- x} \right)^{2}}\left( 1 + e^{- x} \right) = \frac{e^{- x}}{1 + e^{- x}} = F( - x)\]

Now,

\begin{align} \frac{d}{dx}f(x) &= \frac{d}{dx}\frac{e^{-x}}{(1 + e^{-x})^2} \\[0.5em] &= \frac{(1 + e^{-x})^2 \frac{d}{dx}e^{-x} - e^{-x} \frac{d}{dx}(1 + e^{-x})^2}{\{(1 + e^{-x})^2\}^2} \\[0.5em] &= \frac{(1 + e^{-x})^2(-e^{-x}) - e^{-x}\left\{2(1 + e^{-x}) \frac{d}{dx}e^{-x}\right\}}{(1 + e^{-x})^4} \\[0.5em] &= \frac{-e^{-x}(1 + e^{-x})^2 + 2e^{-2x}(1 + e^{-x})}{(1 + e^{-x})^4} \\[0.5em] &= \frac{e^{-x}(1 + e^{-x})(-1 - e^{-x} + 2e^{-x})}{(1 + e^{-x})^4} \\[0.5em] &= \frac{e^{-x}(1 - e^{-x})}{(1 + e^{-x})^3} \\[0.5em] &= -\frac{e^{-x}}{(1 + e^{-x})^2} \frac{1}{(1 + e^{-x})}(1 - e^{-x}) \\[0.5em] &= -f(x)F(x)(1 - e^{-x}) \end{align}

Thus, the score vector for the logit model using (1) is,

\begin{align*} \frac{d l}{d \beta} &= \frac{d}{d \beta} \sum_{i=1}^{n} \left[ y_i \ln \{ F(X_i' \beta) \} + (1-y_i) \ln \{ 1 - F(X_i' \beta) \} \right] \\[0.5em] &= \frac{d}{d \beta} \sum_{i=1}^{n} \left[y_i \ln \{ F(X_i' \beta) \} + (1-y_i) \ln \{ F(-X_i' \beta) \}\right] \\[0.5em] &= \sum_{i=1}^{n} y_i \frac{d}{d \beta} \ln \{ F(X_i' \beta) \} + \sum_{i=1}^{n} (1-y_i) \frac{d}{d \beta} \ln \{ F(-X_i' \beta) \} \\[0.5em] &= \sum_{i=1}^{n} y_i \frac{d}{d \beta} \ln \left( \frac{1}{1+e^{-X_i' \beta}} \right) + \sum_{i=1}^{n} (1-y_i) \frac{d}{d \beta} \ln \left( \frac{1}{1+e^{X_i' \beta}} \right) \\[0.5em] &= \sum_{i=1}^{n} y_i \frac{d}{d \beta} \left[ \ln(1) - \ln(1+e^{-X_i' \beta}) \right] + \sum_{i=1}^{n} (1-y_i) \frac{d}{d \beta} \left[ \ln(1) - \ln(1+e^{X_i' \beta}) \right] \\[0.5em] &= \sum_{i=1}^{n} y_i \frac{d}{d \beta} \left[ -\ln(1+e^{-X_i' \beta}) \right] + \sum_{i=1}^{n} (1-y_i) \frac{d}{d \beta} \left[ -\ln(1+e^{X_i' \beta}) \right] \\[0.5em] &= \sum_{i=1}^{n} y_i \frac{-1}{1+e^{-X_i' \beta}} \frac{d}{d \beta} (1+e^{-X_i' \beta}) + \sum_{i=1}^{n} (1-y_i) \frac{-1}{1+e^{X_i' \beta}} \frac{d}{d \beta} (1+e^{X_i' \beta}) \\[0.5em] &= \sum_{i=1}^{n} y_i \frac{X_i e^{-X_i' \beta}}{1+e^{-X_i' \beta}} - \sum_{i=1}^{n} (1-y_i) \frac{X_i e^{X_i' \beta}}{1+e^{X_i' \beta}} \\[0.5em] &= \sum_{i=1}^{n} y_i \frac{X_i}{1+e^{X_i' \beta}} - \sum_{i=1}^{n} (1-y_i) \frac{X_i}{1+e^{-X_i' \beta}} \\[0.5em] &= \sum_{i=1}^{n} y_i F(-X_i' \beta) X_i - \sum_{i=1}^{n} (1-y_i) F(X_i' \beta) X_i \\[0.5em] &= \sum_{i=1}^{n} y_i \{ 1 - F(X_i' \beta) \} X_i - \sum_{i=1}^{n} (1-y_i) F(X_i' \beta) X_i \\[0.5em] &= \sum_{i=1}^{n} [ y_i - y_i F(X_i' \beta) - F(X_i' \beta) + y_i F(X_i' \beta) ] X_i \\[0.5em] &= \sum_{i=1}^{n} [ y_i - F(X_i' \beta) ] X_i \end{align*}

And the Hessian for the logit model using (1) is,

\begin{align*} \frac{d^2 l}{d \beta^2} &= \frac{d}{d \beta} \frac{d l}{d \beta} \\[0.5em] &= \frac{d}{d \beta} \sum_{i=1}^n [y_i - F(X_i' \beta)] X_i \\[0.5em] &= 0 - \sum_{i=1}^n \frac{d}{d \beta} F(X_i' \beta) X_i \\[0.5em] &= -\sum_{i=1}^n \frac{e^{-X_i' \beta}}{(1 + e^{-X_i' \beta})^2} X_i X_i' \\[0.5em] &= -\sum_{i=1}^n f(X_i' \beta) X_i X_i' \end{align*}

Using the score vector and Hessian in (2), and iterating it until, \(\left| {\widehat{\beta}}_{n + 1} - {\widehat{\beta}}_{n} \right| < \epsilon\), we can get the required \({\widehat{\beta}}_{ML}\).


2. The case of the Probit Model

Here, we assume,

\[f(x) = \frac{1}{\sqrt{2\pi}}e^{- \frac{1}{2}x^{2}},\ \ where - \infty < x < \infty\]

\[F(x) = \int_{- \infty}^{x}{f(x)dx}\]

Following the same procedure as for the logit model, we can get the required \({\widehat{\beta}}_{ML}\) for the probit model. 



Suggested Readings:

  1. Fomby, T. B. (2010). Maximum Likelihood Estimation of Logit and Probit Models. SMU Department of Economics.
  2. Taboga, M. (2021). Logistic classification model (logit or logistic regression)Fundamentals of Statistics.

Suggested Citation: Meitei, W. B. (2025). Maximum Likelihood Estimation of Regression with Binary Response. WBM STATS.

Maximum Likelihood Estimation

by W. B. Meitei, PhD


Maximum likelihood estimation is a fundamental statistical method for estimating the unknown parameters of a probability distribution from observed data. Given a parametric family or probability density (or mass) functions \(f(x;\theta)\) with a parameter Ө ϵ Θ, the likelihood function for i.i.d. observations X1, X2, …, Xn is defined as,

\[L\left( \theta;X_{1},X_{2},\ldots,X_{n} \right) = \prod_{i = 1}^{n}{f(X_{i};\theta)}\]

Or equivalently, the log likelihood function is,

\[l(\theta) = \log\left\{ L(\theta) \right\} = \sum_{i = 1}^{n}{\log\left\{ f(X_{i};\theta) \right\}}\]

MLE chooses the value \({\widehat{\theta}}_{MLE}\) that maximizes the likelihood or the log likelihood function, i.e.,

\[{\widehat{\theta}}_{MLE} = arg\max_{Ө\ \epsilon\ \Theta}{L\left( \theta;X_{1},X_{2},\ldots,X_{n} \right)} = arg\max_{Ө\ \epsilon\ \Theta}{l(\theta)}\]


Maximum likelihood estimation theorems

1. Consistency of MLE

Let \({\widehat{\theta}}_{n}\) denote the MLE based on n independent observations from a distribution with true parameter \(\theta_{0}\ \epsilon\ \Theta\). Under standard regularity conditions (smoothness, identifiability, and existence of moments),

\[\lim_{n \rightarrow \infty}{P\left\lbrack |{\widehat{\theta}}_{n} - \theta_{0}| \leq \varepsilon \right\rbrack = 1}\]

i.e., MLE is consistent as the sample size increases.

2. Asymptotic normality

Under the same regularity conditions, the MLE is asymptotically normal,

\[\sqrt{n}\left( {\widehat{\theta}}_{n} - \theta_{0} \right)\overset{asy}{\rightarrow}N\left\lbrack 0,I^{- 1}\left( \theta_{0} \right) \right\rbrack\]

Where, \(I\left( \theta_{0} \right)\) is the Fisher Information,

\[I\left( \theta_{0} \right) = {Var}_{\theta_{0}}\left\lbrack \frac{\partial}{\partial\theta}\log\left\{ f(X_{i};\theta_{0}) \right\} \right\rbrack = - E\left\lbrack \frac{\partial^{2}}{\partial\theta^{2}}\log\left\{ f(X_{i};\theta_{0}) \right\} \right\rbrack\]

This theorem underpins the construction of asymptotic confidence intervals and hypothesis tests for MLEs.

3. Efficiency

Asymptotically, the MLE achieves the Cramer-Rao lower bound, meaning it is asymptotically efficient, i.e., the variance of \({\widehat{\theta}}_{n}\) approaches the minimum possible variance for unbiased estimators as n.

\[Var\left( {\widehat{\theta}}_{n} \right) \approx I^{- 1}\left( \theta_{0} \right)\]


Regularity conditions

a. True parameter interior condition: The true parameter \(\theta_{0}\) lies in the interior of the parameter space Θ, i.e. \(\theta_{0}\ \epsilon\ int(\Theta)\).

b. Identifiability: Different parameter values correspond to different probability distributions, i.e.,

\[f\left( x;\theta_{1} \right) = f\left( x;\theta_{2} \right)\ \ \ \forall\ \ \ x\ \ \ \ \  \Rightarrow \theta_{1} = \theta_{2}\]

c. Differentiability of the likelihood: The likelihood function, \(L(\theta)\) or the log likelihood function, \(l(\theta)\) is continuously differentiable with respect to Ө.

d. Existence of Fisher Information: The Fisher Information matrix, \(I\left( \theta_{0} \right)\) is finite and positive definite at \(\theta_{0}\).

e. Interchange of differentiation and expectation: It must be valid to differentiate under the integral,

\[\frac{\partial}{\partial\theta}\int_{}^{}{f(x;\theta)}dx = \int_{}^{}\frac{\partial}{\partial\theta}f(x;\theta)\ dx\]

f. Existence of third derivatives (for asymptotic normality): For asymptotic expansion of the MLE (such as the Cramer-Rao lower bound and normality), the log-likelihood should have continuous third derivatives, i.e.,

\[\frac{\partial^{3}}{\partial\theta_{i}\partial\theta_{j}\partial\theta_{k}}l(\theta)\]

exist and is finite.

g. Dominance condition: There exists an integrable function \(M(x)\) such that for all \(\theta\) in a neighbourhood of \(\theta_{0}\), i.e.,

\[\left| \frac{\partial}{\partial\theta_{i}}l(\theta) \right| \leq M(x),\ \ \ \ \ \ E\left\lbrack M(x) \right\rbrack < \infty\]


For example,

For i.i.d. normal observations \(X_{i} \sim N(\mu,\sigma^{2})\), the MLEs are:

\[\widehat{\mu} = \frac{1}{n}\sum_{i = 1}^{n}X_{i}\] and \[{\widehat{\sigma}}^{2} = \frac{1}{n}\sum_{i = 1}^{n}\left( X_{i} - \widehat{\mu} \right)^{2}\]

These satisfy consistencyasymptotic normality, and efficiency as n → .



Suggested Readings:

  1. STATS 200. Lecture 15: Fisher information and the Cramer-Rao bound. Stanford University.
  2. DISCDown. Maximum Likelihood Estimation.
  3. Econ 620. Maximum Likelihood Estimation (MLE). Cornell University.
  4. Taboga, M. (2021). Maximum likelihood estimationFundamentals of Statistics.

Suggested Citation: Meitei, W. B. (2025). Maximum Likelihood Estimation. WBM STATS.