Logistic regression is a well known technique in statistics. Here we show how we can solve the Maximum Likelihood estimation problem in GAMS. We use the data from http://www.ats.ucla.edu/stat/r/dae/logit.htm.
Of course, here we don’t have a black box logistic regression tool so we need to do a few extra steps. First we need to add an explicit constant term to the data, and expand a categorical variable into a number of dummy variables. The second part is to write down the log likelihood function. The resulting model looks like:
Usually this is written down as an unconstrained optimization problem. In such a formulation we would repeat some linear expressions that I rather substitute out. Largely for aesthetical reasons. This adds a linear constraint block but makes the problem somewhat less nonlinear – depending how you count: the nonlinear code is smaller but the number of nonlinear nonzero elements is larger. For small problems it does not really matter: NLP solvers handle both versions quite well (more details on this below). Yet another way to solve this is to write down the first-order conditions and solve these. I suspect this is how glm inside R works.
Usually for NLP models like this I’d like to specify a starting point for the nonlinear variables. Here it seems that the default initial values bx(i)=0 are just fine.
The GAMS output is:
---- 464 VARIABLE beta.L coefficients to estimate
const -3.98998, gre 0.00226, gpa 0.80404, rank2 -0.67544
With R’s glm command we get the same results:
Using a little bit of extra code we can calculate some useful statistics. Just as R does. In fact we try to mimic their output. Here is the GAMS code:
We use a small CNS model (square nonlinear system solver) to invert the information matrix. The right-hand side of equation inverter is actually an identity matrix (ones on the diagonal, zero everywhere else). We solve here a linear system of equations with a non-linear solver. That is because GAMS does not have model type to solve a linear system of equations. It could be done with an LP solver and a dummy objective. As the size of the matrix is small, we choose for the shortest GAMS code. The final results when displaying the results parameter look like:
---- 487 PARAMETER results
Estimate Std. Error z value Pr(>|z|)
const -3.98998 1.13995 -3.50013 0.00047
The estimates are our beta’s. The standard errors are the square root of the diagonal elements of the inverse of the information matrix. The z-values are just the result of dividing the estimate by its standard error. Finally the p-values are found using the standard normal cdf function. This looks just like the R summary output (apart from the stars indicating the significance level):
It is actually quite instructive to do this exercise to understand a little bit better what R is reporting when doing a logistic regression. High-performance, convenient and well-tested black-box routines are great to use. But once in a while it may make sense to go back to basics and work on the problem directly.
Comparison between unconstrained and linearly constrained formulation
I was a little bit wondering about the relative performance of these two different formulations. Here is a summary:
For most solvers it is a wash.