Yet Another Math Programming Consultant

Tuesday, March 31, 2015

Reliability of NLP solvers

NLP solvers can fail for a number of reasons. One way to increase the reliability of a production run is to use a back-up solver in case the main solver fails. In GAMS primal and dual information is passed between SOLVE statements even if different solvers are used. This is actually very useful in a case I just observed.

Solver	Status
mosek	Near Optimal
ipopt	Solved to Acceptable Level
mosek + ipopt	Optimal Solution Found

Here both Mosek and IPOPT are getting into trouble and are not able to give us an optimal solution (in this case it was close, but I have seen cases where “near optimal” could be interpreted as “far away from optimal”). But the sequence: first Mosek, then IPOPT actually solves the problem to optimality.

We programmed this essentially as:

option nlp=mosek;
solve allocationmodel minimizing entropy using nlp;
if (allocationmodel.modelstat > 2,       { not optimal }
    option nlp=ipopt;
    solve allocationmodel minimizing entropy using nlp;
);

k-means clustering heuristic in GAMS and the XOR operator

In http://yetanothermathprogrammingconsultant.blogspot.com/2015/03/k-means-clustering-formulated-as-miqcp.html we were quite underwhelmed by the performance of MIQCPs (Mixed Integer Quadratically Constrained Problems).

A standard heuristic is easily formulated in GAMS:

$ontext

K-means clustering heuristic

$offtext

option seed=101;

sets
   k 'clusters' /k1*k4/
   i 'data points' /i1*i100/
   ik(i,k) 'assignment of points to clusters'
   xy 'coordinates of points' /x,y/
;
parameter
   m(k,xy) 'random clusters to generate data points'
   p(i,*) 'data points'
   numk    'number of clusters'
   cluster(i)    'cluster number'
;

numk = card(k);
alias(ii,i);

*------------------------------------------------
* generate random data
* points are around some clusters
*------------------------------------------------
m(k,xy) = uniform(0,4);
cluster(i) = uniformint(1,numk);
ik(i,k) = cluster(i)=ord(k);
p(i,xy) = sum(ik(i,k),m(k,xy)) + uniform(0,1);
display m,p;

sets
ik(i,k) 'assignment of points to clusters'
ik_prev(i,k) 'previous assignment of points to clusters'
ik_diff
trial   'number of trials' /trial1*trial15/
iter    'max number of iterations' /iter1*iter20/
;
parameters
n(k)       'number of points assigned to cluster'
c(k,xy)    'centroids'
notconverged '0=converged,else not converged'
d(i,k)     'squared distance'
dclose(i) 'distance closest cluster'
trace(trial,iter) 'reporting'
;

loop(trial,

*------------------------------------------------
* Step 1
* Random assignment
*------------------------------------------------

      cluster(i) = uniformint(1,numk);
      ik(i,k) = cluster(i)=ord(k);

      notConverged = 1;
      loop(iter$notConverged,

*------------------------------------------------
* Step 2
* Calculate centroids
*------------------------------------------------
          n(k) = sum(ik(i,k), 1);
          c(k,xy)$n(k) = sum(ik(i,k), p(i,xy))/n(k);

*------------------------------------------------
* Step 3
* Re-assign points
*------------------------------------------------

          ik_prev(i,k) = ik(i,k);
          d(i,k) = sum(xy, sqr(p(i,xy)-c(k,xy)));
          dclose(i) = smin(k, d(i,k));
          ik(i,k) = yes$(dclose(i) = d(i,k));

*------------------------------------------------
* Step 4
* Check convergence
*------------------------------------------------

          ik_diff(i,k) = ik(i,k) xor ik_prev(i,k);
          notConverged = card(ik_diff);

          trace(trial,iter) = sum(ik(i,k),d(i,k));
      );
);

display trace;

The results show it is important to use different starting configurations (i.e. multiple trials):

---- 100 PARAMETER trace reporting

iter1 iter2 iter3 iter4 iter5 iter6

trial1      366.178      28.135      14.566
trial2      271.458      26.741      14.566
trial3      316.975      23.912      14.566
trial4      299.522      24.511      14.566
trial5      346.148      77.747      14.566
trial6      313.978      64.730      14.865      14.566
trial7      310.735      27.412      14.566
trial8      330.829     213.308     184.504     102.191      76.611      74.210
trial9      356.897      90.286      16.112      14.566
trial10     345.086      82.716      76.154      76.146
trial11     354.545     169.648      75.626      74.350      71.929      67.385
trial12     310.257      52.175      18.800      14.566
trial13     289.293      23.505      14.566
trial14     337.024      20.783      14.566
trial15     336.969      28.747      14.566

+ iter7 iter8 iter9 iter10

trial8 67.933 42.621 14.881 14.566
trial11 59.815 59.591

The consensus is that 14.566 is the optimal sum of the squared distances. Note that this was an extremely easy problem:

Notes:

We use an xor operator to find the difference between if the sets ik_prev and ik. If there is no difference card(ik_diff)=0 and thus notConverged=0. See below for a small example illustrating the xor.
The assignment
ik(i,k) = yes$(dclose(i) = d(i,k));
is incorrect if point i is exactly between two clusters. In that case we assign the point to two clusters. We can fix this by a more complicated loop:
ik(i,k) = no;
loop((i,k)$(dclose(i) = d(i,k)),
ik(i,k)$(sum(ik(i,kk),1)=0) = yes;
);
The construct
cluster(i) = uniformint(1,numk); ik(i,k) = cluster(i)=ord(k);
cannot be simplified to
ik(i,k) = uniformint(1,numk)=ord(k);.
The latter version is incorrect.
The assignment
c(k,xy)$n(k) = sum(ik(i,k), p(i,xy))/n(k);
will keep a cluster with zero points at its current location.

The XOR operator

This fragment illustrates a few different ways to calculate the difference between two sets:

set
   i /a*z/
   j(i) /a,b/
   k(i) / b,c/
   diff(i) 'difference between sets: card(diff)=0 if equal'
;

diff(i) = (j(i)-k(i)) + (k(i)-j(i));
display diff;

diff(i) = j(i) xor k(i);
display diff;

diff(i) = 1$j(i) <> 1$k(i);
display diff;

---- 9 SET diff difference between sets: card(diff)=0 if equal

a, c

---- 12 SET diff difference between sets: card(diff)=0 if equal

a, c

---- 15 SET diff difference between sets: card(diff)=0 if equal

a, c

My preferred syntax diff(i)=j(i)<>k(i) is not allowed. In my opinion, notation should really try to help readability, and not show off mathematical prowess.

Monday, March 30, 2015

More oligopolistic behavior

Even very simple models require attention to detail. In [link] we described simple model from the literature. A related model with a coupling constraint that limits pollution emissions is described in:

Jacek Krawczyk and James Zuccollo, NIRA-3: An improved MATLAB package for finding Nash equilibria in infinite games, Victoria University of Wellington, December 2006
Lars Mathiesen, Regulation of pollution in a Cournot equilibrium, The Norwegian School of Economics and Business Administration, Bergen, June 2008.

The Nash-Cournot model without pollution controls is simply:

Adding the pollution controls using taxes (or tradable quotas) is not that difficult:

Still for this simple model I get different results for the base case (no pollution controls). My results are:

---- 135 PARAMETER results

output profit emission price

no control .j1      55.351      61.274
no control .j2      14.914      13.345
no control .j3      53.684      57.639
no control .k1                             419.978
no control .k2                             301.125
no control .-                                            1.761
with control.j1      21.145       8.942
with control.j2      16.028      15.414
with control.j3       2.726       0.149
with control.k1                             100.000
with control.k2                              81.164
with control.-                                            2.601

The paper [1] has the following results:

The second part corresponds to my solutions, but the first part shows differences. As the problem with controls is an extension of the problem without controls, there is actually a reasonable chance my solution is correct.

Friday, March 27, 2015

MIP Links

Ed Klotz (Cplex): Identification, Assessment and Correction of Ill-Conditioning and Numerical Instability in Linear and Integer Programs [slides and recording].
Performance Variability

Performance Variability Defined [link]
Performance Variability: Good Readings [link]
Performance Variability: Consequences [link]
Cplex Performance Variability [link]

Thursday, March 26, 2015

K-means clustering formulated as MIQCP optimization problem

It is instructive to formulate the k-means clustering problem as a Mixed-Integer Quadratically Constrained Problem (MIQCP) in GAMS. It looks trivial at first sight but in fact requires a little bit of attention.

We start with a small data set. E.g. 3 clusters and 50 data points in two dimensions (x and y). I.e. our sets look like:

We generate some random clusters as follows:

Should be an easy problem:

The clustering model could look like:

We solver here for two things simultaneously:

The location of the clusters (continuous variables c)
The assignment of points to clusters (binary variables x)

The equation distances computes all (squared) distances between the points and the clusters. We multiply by x so that we only sum these distances when a point is assigned to a cluster (in that case x=1). Finally we make sure each point is assigned to exactly one cluster.

Unfortunately this model is non-convex, so we can solve this only with global solvers such as Baron or GloMiQo.

Convexification

We can make the model convex as follows:

We use a big-M construct where the upper bound of d forms a reasonable tight value for M. Here we have that d(i,k)=0 if point i is not assigned to cluster k (note that d≥0). The objective has become linear in this case.

Now we can solve the problem with commercial solvers like Cplex and Gurobi. The results (with the location of the clusters from variable c) are:

Performance

For larger data sets it turns out the performance is not very good. E.g. with 100 points and 4 clusters we have 400 binary variables and it takes hours or days to prove optimality. Both Cplex and Gurobi are not able to prove optimality even when we load an optimal solution using MIPSTART. Xpress has even more troubles, and returns bad solutions. We see more often that MIQP/MIQCP solvers are not as robust and fast as (linear) MIP solvers. It is too bad the MIQCP solvers are doing so poorly on this problem.

Using a solver would have opened up possibly interesting extensions like additional constraints on the clustering, such as “at least n points in each cluster”.

Bug in Gurobi

We were able to confuse Gurobi by writing the model as:

It now accepts the model (although it is non-convex) and produces a bogus solution: all points are assigned to cluster 1. Gurobi should have refused to solve the problem (with some message about Q matrix not PSD).

Wednesday, March 25, 2015

Oligopolistic producer behavior or why airline tickets are not cheaper

Small example from the literature:

Note: John Nash just won the Abel prize (http://www.nature.com/news/beautiful-mind-john-nash-adds-abel-prize-to-his-nobel-1.17179).

We implement different behaviors with a GAMS model:

Behavior	Equations
Oligopolistic (MCP) Nash-Cournot Equilibrium
Monopolistic (NLP) Think of this as firms are merged.
Competitive (NLP) Technique: Optimize Social Welfare Objective. Check solution by: Price= Marginal Cost.

Here q(i) is quantity supplied by firm i. We see most profit is to be made by merging firms and reducing capacity, just as happened in the US airline industry (Europe is a very different story). See also: https://www.nytimes.com/2015/03/24/business/dealbook/as-oil-prices-fall-air-fares-still-stay-high.html.
Another small example from energy markets: https://yetanothermathprogrammingconsultant.blogspot.com/2013/07/microeconomic-principles.html

Sparse sum: set cover problem in GAMS

In a set cover problem (http://en.wikipedia.org/wiki/Set_cover_problem) we can organize the sets in two different ways:

Use a (sparse) two dimensional set. The cover equation will contain a construct like: sum[c(s,i), x(s)].
Use a (sparse) zero-one parameter. In this case the cover equation can have something like: sum[s, c(s,i)*x(s)].

The question came up. What is better? I like the set construct a bit better but that is purely based on aesthetics. The performance is exactly the same.

Data structure: set	Data structure: 0/1 parameter

The sets are randomly generated. We solve here just as an LP as we are only interested in the performance of GAMS generating the problem.

Thursday, March 12, 2015

gdx2sqlite –fast option

The –fast option in the tool gdx2sqlite is sometimes really helpful. This is especially the case if there are many tables:

The column avg table size indicates the average number of rows in a table. It looks like if this number is small the effect of –fast is most pronounced.

The –fast option will issue PRAGMA synchronous = OFF and PRAGMA journal_mode = MEMORY to SQLite.

Here is a nice presentation on SQLite:

Interesting tidbit: statistics on SQLite: 2e9 running instances, 5e5 applications (May 2014).

Saturday, February 28, 2015

GDX2SQLite –small option

In the tool GDX2SQLITE we export a GAMS GDX file to a SQLite database file.

The –small option mimics how GAMS stores data internally. All strings (called UELs or Unique Elements) are stored in a pool and all sets, parameters, variables and equations use index numbers for that pool.

To find out how such a pool looks like, you can do

alias (*,pool);
display pool;

The same storage mechanism is used in GDX files: the strings are stored separately.

When we convert a GDX file to a SQLite database there is an option to use the same storage structure. With SQL we can hide the details behind a VIEW: a “virtual” table that is computed on the fly from an underlying SQL SELECT statement.

Here is the mechanism:

In the picture above, only the first 6 rows of each table are shown. In reality the tables have a significant number of rows.

In practical models and data sets the number of UELS (ie the size of the string pool) is small (often surprisingly small) compared to the total number of rows in the data tables. The advantage of this storage scheme is that the database files are smaller.

Indeed here we have some statistics:

Artificial data often has a larger number of UELs than real world data, as is illustrated here (data.gdx is a synthetic data set while WaterCropData is from a real model).

Notes:

A single UEL pool is used for multiple data tables.
This somewhat related to older LP solvers which had a super-sparsity storage scheme. Not only were the zero elements not stored but also all values were stored in a pool and the coefficient matrix was pointing into this pool. I don’t think this scheme is used anymore. Most recent reference I could find is http://www.sciencedirect.com/science/article/pii/0360835290900068. The original reference is:
KALAN, J., "Aspects of Large-Scale In-Core Linear Programming," ACM Proceedings, 1971
To be complete: Sparsity indicates we don’t store zero elements. Usually we don’t store zero elements in GDX files (as GAMS does not store and export them). This also means zero elements are not stored in the resulting database file either. E.g. in table Data, only combinations (i,v,value) exist where value ≠ 0.
Version 0.5 of GDX2SQLITE adds better column names to the CREATE VIEW statements to make the results more compatible with the default storage scheme.
The –small option is clearly beneficial on database size. What are the impacts on performance of creating the database and on querying the database? Creating the database is probably close to linear in the total size, so I expect some savings here. I have no good answer about querying speed. Note: the column uel in table UELS$ is a primary key.

Thursday, February 26, 2015

Gurobi vs R: subset selection

The R package leaps (http://cran.r-project.org/web/packages/leaps/leaps.pdf) provided a convenient way to perform feature selection. The methods are based on work by Alan Miller, described in:

Alan Miller, “Subset Selection in Regression”, Chapman and Hall/CRC; 2 edition (April 15, 2002)

In a related paper: Alan Miller, “Subsets of Regression Variables”, J. R. Statist. Soc. A (1984), 147, Part 3, pp. 389-425 (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.474.3223&rep=rep1&type=pdf) we even see some comments from Beale.

We should be able to reproduce the “exhaustive” method from the leaps package using an MIQP formulation. Something like:

Given a value for n we want to minimize SSQ. The binary variables δ indicate whether a feature j can appear in the model (we exclude the constant term: that one remains in the model). We need to solve this model for n=1,2,3,… To be complete, the other symbols are β: the quantities we want to estimate, data X,y and our big M constant.

I would guess that say Gurobi would do a better job than leaps. As it turns out, it was not easy to beat leaps.

To generate a random model we use a very simplistic scheme. Not sure if this has much in common with real world data sets, but a similar approach was used in Gatu, Kontoghiorghes, “Branch-and-Bound Algorithms for Computing the Best-Subset Regression Models”, Journal of Computational and Graphical Statistics, Volume 15, Number 1, Pages 139–156:

We use N=50 variables in total of which we want to select n=1,2,3,..,10 variables.

GAMS/Gurobi

Here are the results:

---- 87 PARAMETER t = 112.897 elapsed time of solve loop

---- 87 SET selected selected variables

v5 v6 v8 v11 v13 v19 v31 v32 v35

n1                      YES
n2                      YES
n3                      YES                                                         YES
n4                      YES                                                         YES
n5                      YES                                             YES         YES
n6                      YES                                             YES         YES                     YES
n7                      YES         YES                                 YES         YES                     YES
n8                      YES         YES                     YES         YES         YES                     YES
n9                      YES         YES                     YES         YES         YES         YES         YES
n10         YES         YES         YES         YES         YES         YES         YES                     YES

+ v40 v43

n2          YES
n3          YES
n4          YES         YES
n5          YES         YES
n6          YES         YES
n7          YES         YES
n8          YES         YES
n9          YES         YES
n10         YES         YES

---- 87 PARAMETER rss residual sum of squares

n1 836.651, n2 836.342, n3 836.060, n4 835.869, n5 835.677, n6 835.495, n7 835.330
n8 835.189, n9 835.075, n10 834.937

---- 87 PARAMETER solvetime time for each solve

n1 0.080, n2 0.135, n3 0.609, n4 1.951, n5 3.876, n6 6.878, n7 15.971, n8 21.016
n9 30.122, n10 27.511

A few notes:

The model is very sensitive to the size of M. I used M=1000. With M=1e5 we get into trouble. We can see messages like: Warning: max constraint violation (1.2977e-02) exceeds tolerance (possibly due to large matrix coefficient range).This QP formulation is not so easy to solve for Gurobi from a numerical standpoint.
The RSS is going down when we add variables. As expected.
Also as expected: the time to solve each sub-problem goes up when we have more variables in the model. “pick n out of N” constructs often cause performance problems.
I used a small optcr (gap) and 4 threads to make Gurobi fast enough to beat Leaps. (Leaps takes 3 minutes and a loop of 10 solves with Gurobi takes 2 minutes).
I often like to formulate a separate fit equation as in:

This turns out to be very bad w.r.t. performance. Of course the fit equation forms a completely dense block in the coefficient matrix, something Gurobi was not designed for (it likes sparse structures much better).

R/Leaps

The R code is simply:

d<-dbGetQuery(db,"select * from data") %>% dcast(i ~ v) %>% select(-i)
system.time(r <- regsubsets(y~.,data=d,nvmax=10,really.big=T))

The first line gets the data from the database and applies a pivot operation (from a long, skinny data frame with three columns i,v,value to a wide one with 10 columns v1,v2,v3,…,v10). This trick is useful to remember is this happens a lot when reading data from a database (such data is typically organized in long, skinny tables opposed to a “matrix” organization). The second line calls the regsubsets function from the Leaps package.

The output is:

          v1  v10 v11 v12 v13 v14 v15 v16 v17 v18 v19 v2  v20 v21 v22 v23 v24 v25 v26 v27 v28
1  ( 1 )  " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " "
2  ( 1 )  " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " "
3  ( 1 )  " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " "
4  ( 1 )  " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " "
5  ( 1 )  " " " " " " " " " " " " " " " " " " " " "*" " " " " " " " " " " " " " " " " " " " "
6  ( 1 )  " " " " " " " " " " " " " " " " " " " " "*" " " " " " " " " " " " " " " " " " " " "
7  ( 1 )  " " " " " " " " " " " " " " " " " " " " "*" " " " " " " " " " " " " " " " " " " " "
8  ( 1 )  " " " " " " " " "*" " " " " " " " " " " "*" " " " " " " " " " " " " " " " " " " " "
9  ( 1 )  " " " " " " " " "*" " " " " " " " " " " "*" " " " " " " " " " " " " " " " " " " " "
10  ( 1 ) " " " " "*" " " "*" " " " " " " " " " " "*" " " " " " " " " " " " " " " " " " " " "
          v29 v3  v30 v31 v32 v33 v34 v35 v36 v37 v38 v39 v4  v40 v41 v42 v43 v44 v45 v46 v47
1  ( 1 )  " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " "
2  ( 1 )  " " " " " " " " " " " " " " " " " " " " " " " " " " "*" " " " " " " " " " " " " " "
3  ( 1 )  " " " " " " "*" " " " " " " " " " " " " " " " " " " "*" " " " " " " " " " " " " " "
4  ( 1 )  " " " " " " "*" " " " " " " " " " " " " " " " " " " "*" " " " " "*" " " " " " " " "
5  ( 1 )  " " " " " " "*" " " " " " " " " " " " " " " " " " " "*" " " " " "*" " " " " " " " "
6  ( 1 )  " " " " " " "*" " " " " " " "*" " " " " " " " " " " "*" " " " " "*" " " " " " " " "
7  ( 1 )  " " " " " " "*" " " " " " " "*" " " " " " " " " " " "*" " " " " "*" " " " " " " " "
8  ( 1 )  " " " " " " "*" " " " " " " "*" " " " " " " " " " " "*" " " " " "*" " " " " " " " "
9  ( 1 )  " " " " " " "*" "*" " " " " "*" " " " " " " " " " " "*" " " " " "*" " " " " " " " "
10  ( 1 ) " " " " " " "*" "*" " " " " "*" " " " " " " " " " " "*" " " " " "*" " " " " " " " "
          v48 v49 v5  v50 v6  v7  v8  v9 
1  ( 1 )  " " " " " " " " "*" " " " " " "
2  ( 1 )  " " " " " " " " "*" " " " " " "
3  ( 1 )  " " " " " " " " "*" " " " " " "
4  ( 1 )  " " " " " " " " "*" " " " " " "
5  ( 1 )  " " " " " " " " "*" " " " " " "
6  ( 1 )  " " " " " " " " "*" " " " " " "
7  ( 1 )  " " " " " " " " "*" " " "*" " "
8  ( 1 )  " " " " " " " " "*" " " "*" " "
9  ( 1 )  " " " " " " " " "*" " " "*" " "
10  ( 1 ) " " " " " " " " "*" " " "*" " "

   user  system elapsed 
 194.41    0.00  194.44

> summary(r)$rss
 [1] 836.6514 836.3416 836.1431 835.9519 835.7598 835.5764 835.4079 835.2662 835.1392 835.0116

The sum of squares start out the same but they diverge just a little bit.

The selected variables are much the same as using Gurobi (but there are a few differences here and there). The time is actually close to that of Gurobi. This performance is much better than I expected.