Showing posts with label Multivariate Statistics. Show all posts
Showing posts with label Multivariate Statistics. Show all posts

Thursday, 13 November 2025

Maximum Likelihood Estimation of Regression with Binary Response

by W. B. Meitei, PhD


In the logit or probit model, the outcome variable yi is a Bernoulli random variable, i.e.,

\[y_{i} = \left\{ \begin{matrix} 1 & with\ probability\ p_{i} \\ 0 & with\ probability\ 1 - p_{i} \end{matrix} \right.\ \]

Thus, if n observations are available, then the likelihood function is,

\[L = \prod_{i = 1}^{n}{p_{i}^{y_{i}}\left( 1 - p_{i} \right)^{1 - y_{i}}}\]

The logit or probit model arises when pi is specified to be given by the logistic or normal cumulative distribution function evaluated at \(X'\beta\). Let \(F(X'\beta)\) be the cumulative distribution function, then the likelihood function of both models is,

\[L = \prod_{i = 1}^{n}{{F(X_{i}'\beta)}^{y_{i}}\left\{ 1 - F(X_{i}'\beta) \right\}^{1 - y_{i}}}\]

Thus, the log-likelihood function is,

\[\ln(L) = l = \sum_{i = 1}^{n}\left\lbrack y_{i}\ln\left( F(X_{i}'\beta) \right) + \left( 1 - y_{i} \right)\ln\left\{ 1 - F(X_{i}'\beta) \right\} \right\rbrack\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)\]

Now, the first order conditions of ln(L) are nonlinear and nonanalytic. Therefore, we have to obtain the maximum likelihood estimates using numerical optimisation methods.

Using a numerical optimisation method such as the Newton-Raphson method, we can obtain the maximum likelihood estimate of β recursively as,

\[{\widehat{\beta}}_{n + 1} = {\widehat{\beta}}_{n} - \left\lbrack \frac{d^{2}l}{d\beta^{2}} \right\rbrack_{\beta = {\widehat{\beta}}_{n}}^{- 1}\left\lbrack \frac{dl}{d\beta} \right\rbrack_{\beta = {\widehat{\beta}}_{n}}\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)\]

Here, \({\widehat{\beta}}_{n}\) is the nth round estimate, and the Hessian \(\left( \frac{d^{2}l}{d\beta^{2}} \right)\) and score vectors \(\left( \frac{dl}{d\beta} \right)\) are evaluated at this estimate.

We know from the maximum likelihood theorem that

\[\sqrt{n}\left( {\widehat{\beta}}_{ML} - \beta \right)\overset{asy}{\rightarrow}N\left\lbrack 0, - E\left( \frac{d^{2}l}{d\beta^{2}} \right)^{- 1} \right\rbrack\]

or

\[\sqrt{n}\left( {\widehat{\beta}}_{ML} - \beta \right)\overset{asy}{\rightarrow}N\left\lbrack 0,I^{- 1}(\beta) \right\rbrack\]

where, \(I = - \left. \ \frac{d^{2}l}{d\beta^{2}} \right|_{\beta = \beta_{0}}\) is the Fisher information. β0 is an interior point of parameter space β.

Now, for a finite sample, the asymptotic distribution of \({\widehat{\beta}}_{ML}\) can be approximated by,

\[N\left\lbrack \beta, - \left( \frac{d^{2}l}{d\beta^{2}} \right)_{\beta = \beta_{ML}}^{- 1} \right\rbrack\]


1. The case of the Logit Model

Here, the cumulative distribution function is,

\[F(x) = \frac{1}{1 + e^{- x}}\]

And the probability density function is,

\[f(x) = \frac{e^{- x}}{\left( 1 + e^{- x} \right)^{2}}\]

Then,

\[1 - F(x) = 1 - \frac{1}{1 + e^{- x}} = \frac{e^{- x}}{1 + e^{- x}} = F( - x)\]

\[\frac{f(x)}{F(x)} = \frac{e^{- x}}{\left( 1 + e^{- x} \right)^{2}}\left( 1 + e^{- x} \right) = \frac{e^{- x}}{1 + e^{- x}} = F( - x)\]

Now,

\begin{align} \frac{d}{dx}f(x) &= \frac{d}{dx}\frac{e^{-x}}{(1 + e^{-x})^2} \\[0.5em] &= \frac{(1 + e^{-x})^2 \frac{d}{dx}e^{-x} - e^{-x} \frac{d}{dx}(1 + e^{-x})^2}{\{(1 + e^{-x})^2\}^2} \\[0.5em] &= \frac{(1 + e^{-x})^2(-e^{-x}) - e^{-x}\left\{2(1 + e^{-x}) \frac{d}{dx}e^{-x}\right\}}{(1 + e^{-x})^4} \\[0.5em] &= \frac{-e^{-x}(1 + e^{-x})^2 + 2e^{-2x}(1 + e^{-x})}{(1 + e^{-x})^4} \\[0.5em] &= \frac{e^{-x}(1 + e^{-x})(-1 - e^{-x} + 2e^{-x})}{(1 + e^{-x})^4} \\[0.5em] &= \frac{e^{-x}(1 - e^{-x})}{(1 + e^{-x})^3} \\[0.5em] &= -\frac{e^{-x}}{(1 + e^{-x})^2} \frac{1}{(1 + e^{-x})}(1 - e^{-x}) \\[0.5em] &= -f(x)F(x)(1 - e^{-x}) \end{align}

Thus, the score vector for the logit model using (1) is,

\begin{align*} \frac{d l}{d \beta} &= \frac{d}{d \beta} \sum_{i=1}^{n} \left[ y_i \ln \{ F(X_i' \beta) \} + (1-y_i) \ln \{ 1 - F(X_i' \beta) \} \right] \\[0.5em] &= \frac{d}{d \beta} \sum_{i=1}^{n} \left[y_i \ln \{ F(X_i' \beta) \} + (1-y_i) \ln \{ F(-X_i' \beta) \}\right] \\[0.5em] &= \sum_{i=1}^{n} y_i \frac{d}{d \beta} \ln \{ F(X_i' \beta) \} + \sum_{i=1}^{n} (1-y_i) \frac{d}{d \beta} \ln \{ F(-X_i' \beta) \} \\[0.5em] &= \sum_{i=1}^{n} y_i \frac{d}{d \beta} \ln \left( \frac{1}{1+e^{-X_i' \beta}} \right) + \sum_{i=1}^{n} (1-y_i) \frac{d}{d \beta} \ln \left( \frac{1}{1+e^{X_i' \beta}} \right) \\[0.5em] &= \sum_{i=1}^{n} y_i \frac{d}{d \beta} \left[ \ln(1) - \ln(1+e^{-X_i' \beta}) \right] + \sum_{i=1}^{n} (1-y_i) \frac{d}{d \beta} \left[ \ln(1) - \ln(1+e^{X_i' \beta}) \right] \\[0.5em] &= \sum_{i=1}^{n} y_i \frac{d}{d \beta} \left[ -\ln(1+e^{-X_i' \beta}) \right] + \sum_{i=1}^{n} (1-y_i) \frac{d}{d \beta} \left[ -\ln(1+e^{X_i' \beta}) \right] \\[0.5em] &= \sum_{i=1}^{n} y_i \frac{-1}{1+e^{-X_i' \beta}} \frac{d}{d \beta} (1+e^{-X_i' \beta}) + \sum_{i=1}^{n} (1-y_i) \frac{-1}{1+e^{X_i' \beta}} \frac{d}{d \beta} (1+e^{X_i' \beta}) \\[0.5em] &= \sum_{i=1}^{n} y_i \frac{X_i e^{-X_i' \beta}}{1+e^{-X_i' \beta}} - \sum_{i=1}^{n} (1-y_i) \frac{X_i e^{X_i' \beta}}{1+e^{X_i' \beta}} \\[0.5em] &= \sum_{i=1}^{n} y_i \frac{X_i}{1+e^{X_i' \beta}} - \sum_{i=1}^{n} (1-y_i) \frac{X_i}{1+e^{-X_i' \beta}} \\[0.5em] &= \sum_{i=1}^{n} y_i F(-X_i' \beta) X_i - \sum_{i=1}^{n} (1-y_i) F(X_i' \beta) X_i \\[0.5em] &= \sum_{i=1}^{n} y_i \{ 1 - F(X_i' \beta) \} X_i - \sum_{i=1}^{n} (1-y_i) F(X_i' \beta) X_i \\[0.5em] &= \sum_{i=1}^{n} [ y_i - y_i F(X_i' \beta) - F(X_i' \beta) + y_i F(X_i' \beta) ] X_i \\[0.5em] &= \sum_{i=1}^{n} [ y_i - F(X_i' \beta) ] X_i \end{align*}

And the Hessian for the logit model using (1) is,

\begin{align*} \frac{d^2 l}{d \beta^2} &= \frac{d}{d \beta} \frac{d l}{d \beta} \\[0.5em] &= \frac{d}{d \beta} \sum_{i=1}^n [y_i - F(X_i' \beta)] X_i \\[0.5em] &= 0 - \sum_{i=1}^n \frac{d}{d \beta} F(X_i' \beta) X_i \\[0.5em] &= -\sum_{i=1}^n \frac{e^{-X_i' \beta}}{(1 + e^{-X_i' \beta})^2} X_i X_i' \\[0.5em] &= -\sum_{i=1}^n f(X_i' \beta) X_i X_i' \end{align*}

Using the score vector and Hessian in (2), and iterating it until, \(\left| {\widehat{\beta}}_{n + 1} - {\widehat{\beta}}_{n} \right| < \epsilon\), we can get the required \({\widehat{\beta}}_{ML}\).


2. The case of the Probit Model

Here, we assume,

\[f(x) = \frac{1}{\sqrt{2\pi}}e^{- \frac{1}{2}x^{2}},\ \ where - \infty < x < \infty\]

\[F(x) = \int_{- \infty}^{x}{f(x)dx}\]

Following the same procedure as for the logit model, we can get the required \({\widehat{\beta}}_{ML}\) for the probit model. 



Suggested Readings:

  1. Fomby, T. B. (2010). Maximum Likelihood Estimation of Logit and Probit Models. SMU Department of Economics.
  2. Taboga, M. (2021). Logistic classification model (logit or logistic regression)Fundamentals of Statistics.

Suggested Citation: Meitei, W. B. (2025). Maximum Likelihood Estimation of Regression with Binary Response. WBM STATS.

Maximum Likelihood Estimation

by W. B. Meitei, PhD


Maximum likelihood estimation is a fundamental statistical method used to estimate the unknown parameters of a probability distribution based on observed data. Given a parametric family or probability density (or mass) functions \(f(x;\theta)\) with a parameter Ө ϵ Θ, the likelihood function for i.i.d. observations X1, X2, …, Xn is defined as,

\[L\left( \theta;X_{1},X_{2},\ldots,X_{n} \right) = \prod_{i = 1}^{n}{f(X_{i};\theta)}\]

Or equivalently, the log likelihood function is,

\[l(\theta) = \log\left\{ L(\theta) \right\} = \sum_{i = 1}^{n}{\log\left\{ f(X_{i};\theta) \right\}}\]

MLE chooses the value \({\widehat{\theta}}_{MLE}\) that maximises the likelihood or the log likelihood function, i.e.,

\[{\widehat{\theta}}_{MLE} = arg\max_{Ө\ \epsilon\ \Theta}{L\left( \theta;X_{1},X_{2},\ldots,X_{n} \right)} = arg\max_{Ө\ \epsilon\ \Theta}{l(\theta)}\]


Maximum likelihood estimation theorems

1. Consistency of MLE

Let \({\widehat{\theta}}_{n}\) denote the MLE based on n independent observations from a distribution with true parameter \(\theta_{0}\ \epsilon\ \Theta\). Under standard regularity conditions (smoothness, identifiability, and existence of moments),

\[\lim_{n \rightarrow \infty}{P\left\lbrack |{\widehat{\theta}}_{n} - \theta_{0}| \leq \varepsilon \right\rbrack = 1}\]

i.e., MLE is consistent as the sample size increases.

2. Asymptotic normality

Under the same regularity conditions, the MLE is asymptotically normal,

\[\sqrt{n}\left( {\widehat{\theta}}_{n} - \theta_{0} \right)\overset{asy}{\rightarrow}N\left\lbrack 0,I^{- 1}\left( \theta_{0} \right) \right\rbrack\]

Where, \(I\left( \theta_{0} \right)\) is the Fisher Information,

\[I\left( \theta_{0} \right) = {Var}_{\theta_{0}}\left\lbrack \frac{\partial}{\partial\theta}\log\left\{ f(X_{i};\theta_{0}) \right\} \right\rbrack = - E\left\lbrack \frac{\partial^{2}}{\partial\theta^{2}}\log\left\{ f(X_{i};\theta_{0}) \right\} \right\rbrack\]

This theorem underpins the construction of asymptotic confidence intervals and hypothesis test for MLEs.

3. Efficiency

Asymptotically, the MLE achieves the Cramer-Rao lower bound, meaning it is asymptotically efficient, i.e., the variance of \({\widehat{\theta}}_{n}\) approaches the minimum possible variance for unbiased estimators as n.

\[Var\left( {\widehat{\theta}}_{n} \right) \approx I^{- 1}\left( \theta_{0} \right)\]


Regularity conditions

a. True parameter interior condition

The true parameter \(\theta_{0}\) lies in the interior of the parameter space Θ, i.e. \(\theta_{0}\ \epsilon\ int(\Theta)\).

b. Identifiability

Different parameter values correspond to different probability distributions, i.e.,

\[f\left( x;\theta_{1} \right) = f\left( x;\theta_{2} \right)\ \ \ \forall\ \ \ x\ \ \ \ \  \Rightarrow \theta_{1} = \theta_{2}\]

c. Differentiability of the likelihood

The likelihood function, \(L(\theta)\) or the log likelihood function, \(l(\theta)\) is continuously differentiable with respect to Ө.

d. Existence of Fisher Information

The Fisher Information matrix, \(I\left( \theta_{0} \right)\) is finite and positive definite at \(\theta_{0}\).

e. Interchange of differentiation and expectation

It must be valid to differentiate under the integral,

\[\frac{\partial}{\partial\theta}\int_{}^{}{f(x;\theta)}dx = \int_{}^{}\frac{\partial}{\partial\theta}f(x;\theta)\ dx\]

f. Existence of third derivatives (for asymptotic normality)

For asymptotic expansion of the MLE (such as the Cramer-Rao lower bound and normality), the log-likelihood should have continuous third derivatives, i.e.,

\[\frac{\partial^{3}}{\partial\theta_{i}\partial\theta_{j}\partial\theta_{k}}l(\theta)\]

exist and is finite.

g. Dominance condition

There exists an integrable function \(M(x)\) such that for all \(\theta\) in a neighbourhood of \(\theta_{0}\), i.e.,

\[\left| \frac{\partial}{\partial\theta_{i}}l(\theta) \right| \leq M(x),\ \ \ \ \ \ E\left\lbrack M(x) \right\rbrack < \infty\]


For example,

For i.i.d. normal observations \(X_{i} \sim N(\mu,\sigma^{2})\), the MLEs are:

\[\widehat{\mu} = \frac{1}{n}\sum_{i = 1}^{n}X_{i}\] and \[{\widehat{\sigma}}^{2} = \frac{1}{n}\sum_{i = 1}^{n}\left( X_{i} - \widehat{\mu} \right)^{2}\]

These satisfy consistencyasymptotic normality, and efficiency as n → .



Suggested Readings:

  1. STATS 200. Lecture 15: Fisher information and theCramer-Rao bound. Stanford University.
  2. DISCDown. Maximum Likelihood Estimation.
  3. Econ 620. Maximum Likelihood Estimation (MLE). Cornell University.
  4. Taboga, M. (2021). Maximum likelihood estimationFundamentals of Statistics.

Suggested Citation: Meitei, W. B. (2025). Maximum Likelihood Estimation. WBM STATS.