Sunday, February 1, 2015

Logistic Regression

Logistic regression is a well known technique in statistics. Here we show how we can solve the Maximum Likelihood estimation problem in GAMS. We use the data from

Of course, here we don’t have a black box logistic regression tool so we need to do a few extra steps. First we need to add an explicit constant term to the data, and expand a categorical variable into a number of dummy variables. The second part is to write down the log likelihood function. The resulting model looks like:


Usually this is written down as an unconstrained optimization problem. In such a formulation we would repeat some linear expressions that I rather substitute out. Largely for aesthetical reasons. This adds a linear constraint block but makes the problem somewhat less nonlinear – depending how you count: the nonlinear code is smaller but the number of nonlinear nonzero elements is larger. For small problems it does not really matter: NLP solvers handle both versions quite well (more details on this below). Yet another way to solve this is to write down the first-order conditions and solve these. I suspect this is how glm inside R works.

Usually for NLP models like this I’d like to specify a starting point for the nonlinear variables. Here it seems that the default initial values bx(i)=0 are just fine.

The GAMS output is:

----    464 VARIABLE beta.L  coefficients to estimate

const -3.98998,    gre    0.00226,    gpa    0.80404,    rank2 -0.67544
rank3 -1.34020,    rank4 -1.55146

With R’s glm command we get the same results:



Using a little bit of extra code we can calculate some useful statistics. Just as R does. In fact we try to mimic their output. Here is the GAMS code:


We use a small CNS model (square nonlinear system solver) to invert the information matrix. The right-hand side of equation inverter is actually an identity matrix (ones on the diagonal, zero everywhere else). We solve here a linear system of equations with a non-linear solver. That is because GAMS does not have model type to solve a linear system of equations. It could be done with an LP solver and a dummy objective. As the size of the matrix is small, we choose for the shortest GAMS code. The final results when displaying the results parameter look like:

----    487 PARAMETER results 

         Estimate  Std. Error     z value    Pr(>|z|)

const    -3.98998     1.13995    -3.50013     0.00047
gre       0.00226     0.00109     2.06986     0.03847
gpa       0.80404     0.33182     2.42312     0.01539
rank2    -0.67544     0.31649    -2.13417     0.03283
rank3    -1.34020     0.34531    -3.88120     0.00010
rank4    -1.55146     0.41783    -3.71313     0.00020

The estimates are our beta’s. The standard errors are the square root of the diagonal elements of the inverse of the information matrix. The z-values are just the result of dividing the estimate by its standard error. Finally the p-values are found using the standard normal cdf function. This looks just like the R summary output (apart from the stars indicating the significance level):

Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.989979 1.139951 -3.500 0.000465 ***
gre 0.002264 0.001094 2.070 0.038465 *
gpa 0.804038 0.331819 2.423 0.015388 *
rank2 -0.675443 0.316490 -2.134 0.032829 *
rank3 -1.340204 0.345306 -3.881 0.000104 ***
rank4 -1.551464 0.417832 -3.713 0.000205 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

It is actually quite instructive to do this exercise to understand a little bit better what R is reporting when doing a logistic regression. High-performance,  convenient and well-tested black-box routines are great to use. But once in a while it may make sense to go back to basics and work on the problem directly.

Comparison between unconstrained and linearly constrained formulation

I was a little bit wondering about the relative performance of these two different formulations. Here is a summary:


For most solvers it is a wash.