Logistic regression is a well known technique in statistics. Here we show how we can solve the Maximum Likelihood estimation problem in GAMS. We use the data from http://www.ats.ucla.edu/stat/r/dae/logit.htm.

Of course, here we don’t have a black box logistic regression tool so we need to do a few extra steps. First we need to add an explicit constant term to the data, and expand a categorical variable into a number of dummy variables. The second part is to write down the log likelihood function. The resulting model looks like:

Usually this is written down as an unconstrained optimization problem. In such a formulation we would repeat some linear expressions that I rather substitute out. Largely for aesthetical reasons. This adds a linear constraint block but makes the problem somewhat less nonlinear – depending how you count: the nonlinear code is smaller but the number of nonlinear nonzero elements is larger. For small problems it does not really matter: NLP solvers handle both versions quite well (more details on this below). Yet another way to solve this is to write down the first-order conditions and solve these. I suspect this is how **glm** inside R works.

Usually for NLP models like this I’d like to specify a starting point for the nonlinear variables. Here it seems that the default initial values bx(i)=0 are just fine.

The GAMS output is:

---- 464 VARIABLE beta.L coefficients to estimate const -3.98998, gre 0.00226, gpa 0.80404, rank2 -0.67544 |

With R’s **glm** command we get the same results:

#### Statistics

Using a little bit of extra code we can calculate some useful statistics. Just as R does. In fact we try to mimic their output. Here is the GAMS code:

We use a small **CNS** model (square nonlinear system solver) to invert the information matrix. The right-hand side of equation **inverter** is actually an identity matrix (ones on the diagonal, zero everywhere else). We solve here a linear system of equations with a non-linear solver. That is because GAMS does not have model type to solve a linear system of equations. It could be done with an LP solver and a dummy objective. As the size of the matrix is small, we choose for the shortest GAMS code. The final results when displaying the **results** parameter look like:

---- 487 PARAMETER results Estimate Std. Error z value Pr(>|z|) const -3.98998 1.13995 -3.50013 0.00047 |

The estimates are our beta’s. The standard errors are the square root of the diagonal elements of the inverse of the information matrix. The z-values are just the result of dividing the estimate by its standard error. Finally the p-values are found using the standard normal cdf function. This looks just like the R summary output (apart from the stars indicating the significance level):

Coefficients: |

It is actually quite instructive to do this exercise to understand a little bit better what R is reporting when doing a logistic regression. High-performance, convenient and well-tested black-box routines are great to use. But once in a while it may make sense to go back to basics and work on the problem directly.

#### Comparison between unconstrained and linearly constrained formulation

I was a little bit wondering about the relative performance of these two different formulations. Here is a summary:

For most solvers it is a wash.

## No comments:

## Post a Comment