Yet Another Math Programming Consultant: Logistic Regression

Sunday, February 1, 2015

Logistic Regression

Logistic regression is a well known technique in statistics. Here we show how we can solve the Maximum Likelihood estimation problem in GAMS. We use the data from http://www.ats.ucla.edu/stat/r/dae/logit.htm.

Of course, here we don’t have a black box logistic regression tool so we need to do a few extra steps. First we need to add an explicit constant term to the data, and expand a categorical variable into a number of dummy variables. The second part is to write down the log likelihood function. The resulting model looks like:

Usually this is written down as an unconstrained optimization problem. In such a formulation we would repeat some linear expressions that I rather substitute out. Largely for aesthetical reasons. This adds a linear constraint block but makes the problem somewhat less nonlinear – depending how you count: the nonlinear code is smaller but the number of nonlinear nonzero elements is larger. For small problems it does not really matter: NLP solvers handle both versions quite well (more details on this below). Yet another way to solve this is to write down the first-order conditions and solve these. I suspect this is how glm inside R works.

Usually for NLP models like this I’d like to specify a starting point for the nonlinear variables. Here it seems that the default initial values bx(i)=0 are just fine.

The GAMS output is:

---- 464 VARIABLE beta.L coefficients to estimate

const -3.98998, gre 0.00226, gpa 0.80404, rank2 -0.67544
rank3 -1.34020, rank4 -1.55146

With R’s glm command we get the same results:

Statistics

Using a little bit of extra code we can calculate some useful statistics. Just as R does. In fact we try to mimic their output. Here is the GAMS code:

We use a small CNS model (square nonlinear system solver) to invert the information matrix. The right-hand side of equation inverter is actually an identity matrix (ones on the diagonal, zero everywhere else). We solve here a linear system of equations with a non-linear solver. That is because GAMS does not have model type to solve a linear system of equations. It could be done with an LP solver and a dummy objective. As the size of the matrix is small, we choose for the shortest GAMS code. The final results when displaying the results parameter look like:

---- 487 PARAMETER results

Estimate Std. Error z value Pr(>|z|)

const    -3.98998     1.13995    -3.50013     0.00047
gre       0.00226     0.00109     2.06986     0.03847
gpa       0.80404     0.33182     2.42312     0.01539
rank2    -0.67544     0.31649    -2.13417     0.03283
rank3    -1.34020     0.34531    -3.88120     0.00010
rank4    -1.55146     0.41783    -3.71313     0.00020

The estimates are our beta’s. The standard errors are the square root of the diagonal elements of the inverse of the information matrix. The z-values are just the result of dividing the estimate by its standard error. Finally the p-values are found using the standard normal cdf function. This looks just like the R summary output (apart from the stars indicating the significance level):

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.989979   1.139951  -3.500 0.000465 ***
gre          0.002264   0.001094   2.070 0.038465 *  
gpa          0.804038   0.331819   2.423 0.015388 *  
rank2       -0.675443   0.316490  -2.134 0.032829 *  
rank3       -1.340204   0.345306  -3.881 0.000104 ***
rank4       -1.551464   0.417832  -3.713 0.000205 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

It is actually quite instructive to do this exercise to understand a little bit better what R is reporting when doing a logistic regression. High-performance, convenient and well-tested black-box routines are great to use. But once in a while it may make sense to go back to basics and work on the problem directly.