Yet Another Math Programming Consultant: January 2013

Wednesday, January 30, 2013

GAMS to OpenOpt too scary?

http://forum.openopt.org/viewtopic.php?id=579:

“This works though I'm worried it might look too scary to my students coming from GAMS.”

GAMS/VBScript Tool to view DBF table structure

When receiving DBF files I usually look at the structure by loading it into Excel. In this case I received DBF files that were much too large to be handled by Excel. Before writing queries it is useful to be able to look at the column names and types. That is what this GAMS job does:

$ontext

   Describe database tables

   Gives description of DBF files in project directory

$offtext

$set dir %system.fp%
$set connection "Provider=Microsoft.ACE.OLEDB.12.0;Data Source=%dir%;Extended Properties=dBASE IV;"

$onecho > db.vbs
WScript.Echo "------------------------------------------------------------------------"

Const adSchemaTables = 20
Const adSchemaColumns = 4

Set objConnection = CreateObject("ADODB.Connection")
WScript.Echo "ADO Version:",objConnection.Version
objConnection.ConnectionString = "%connection%"
WScript.Echo "ADO Connection:",objConnection.ConnectionString
Wscript.Echo
objConnection.Open
Set rs1 = objConnection.OpenSchema(adSchemaTables)
if (rs1.Eof) then
   WScript.Echo "No tables found"
end if
Do Until rs1.Eof
    TableName = rs1("Table_Name")
    Set rs2 = objConnection.OpenSchema(adSchemaColumns,Array(Null, Null, TableName))
    Wscript.Echo UCase(TableName)

    Do While Not rs2.EOF

        ColumnName = rs2("Column_Name")
        ColTypeNum = rs2("Data_Type")
        ColClen    = rs2("Character_Maximum_Length")

        select case ColTypeNum
           case 20   ColTypeName = "BigInt"
           case 128 ColTypeName = "Binary"
           case 11   ColTypeName = "Boolean"
           case 129 ColTypeName = "Char"
           case 6    ColTypeName = "Currency"
           case 7    ColTypeName = "Date"
           case 135 ColTypeName = "TimeStamp"
           case 14   ColTypeName = "Decimal"
           case 5    ColTypeName = "Double"
           case 72   ColTypeName = "GUID"
           case 9    ColTypeName = "Dispatch"
           case 3    ColTypeName = "Integer"
           case 205 ColTypeName = "LongVarBinary(Image,Blob)"
           case 201 ColTypeName = "LongVarChar(Memo,Text)"
           case 203 ColTypeName = "LongVarWChar(Memo,Text)"
           case 131 ColTypeName = "Numeric"
           case 4    ColTypeName = "Single"
           case 2    ColTypeName = "SmallInt"
           case 17   ColTypeName = "UnsignedTinyInt(Byte)"
           case 204 ColTypeName = "VarBinary"
           case 200 ColTypeName = "VarChar"
           case 12   ColTypeName = "Variant"
           case 202 ColTypeName = "VarWChar"
           case 130 ColTypeName = "WChar"
           case else ColTypeName = ColTypeNum
        end select

        Wscript.Echo "    ", ColumnName & ", " & ColTypeName & ", " & ColClen
        rs2.MoveNext
    Loop
    rs2.Close

    Wscript.Echo
    rs1.MoveNext
Loop

rs1.Close
objConnection.Close

WScript.Echo "------------------------------------------------------------------------"

$offecho

$call cscript db.vbs

The output looks like:

--- Job describe.gms Start 01/30/13 15:24:38 WEX-VS8 23.9.5 x86/MS Windows
GAMS Rev 239 Copyright (C) 1987-2012 GAMS Development. All rights reserved
Licensee: Erwin Kalvelagen                               G120717/0001CV-WIN
          Amsterdam Optimization Modeling Group                      DC4572
--- Starting compilation
--- describe.gms(80) 2 Mb
--- call cscript db.vbs
Microsoft (R) Windows Script Host Version 5.8
Copyright (C) Microsoft Corporation. All rights reserved.

------------------------------------------------------------------------
ADO Version: 6.1
ADO Connection: Provider=Microsoft.ACE.OLEDB.12.0;Data Source=C:\projects\ulrike - ifpri\gdx\;Extended Properties=dBASE IV;

SPAM2005_HARVA_TALL_BETA
     ALLOC_KEY, WChar, 8
     CELL5M, Double,
     CR_NAME_SH, WChar, 4
     CR_VALUE, Double,
     ISO3, WChar, 3
     PROD_LEVEL, WChar, 7
     REC_TYPE, WChar, 1
     TECH, WChar, 1
     UNIT, WChar, 6

SPAM2005_HARVA_TALL_BETA_1
     ALLOC_KEY, WChar, 8
     CELL5M, Double,
     CR_NAME_SH, WChar, 4
     CR_VALUE, Double,
     ISO3, WChar, 3
     PROD_LEVEL, WChar, 7
     REC_TYPE, WChar, 1
     TECH, WChar, 1
     UNIT, WChar, 6

SPAM2005_PROD_TALL_BETA_1
     ALLOC_KEY, WChar, 8
     CELL5M, Double,
     CR_NAME_SH, WChar, 4
     CR_VALUE, Double,
     ISO3, WChar, 3
     PROD_LEVEL, WChar, 7
     REC_TYPE, WChar, 1
     TECH, WChar, 1
     UNIT, WChar, 6

------------------------------------------------------------------------
--- describe.gms(81) 2 Mb
--- Starting execution - empty program
*** Status: Normal completion
--- Job describe.gms Stop 01/30/13 15:24:38 elapsed 0:00:00.316

I looked for some free DBF viewers, but they largely originate from obscure Russian web sites and Google warned not to open the downloaded zip files. This ADO based approach should at least create no problems like introducing viruses.

Monday, January 28, 2013

Reading a CSV file

This is a difficult CSV file to read using using the GAMS tool GDXXRW:

"BARID","BASIN","CAND","PASS1","PASS2","PASS3","USHAB","COST","DSID"

4980,"StJoseph",1,0,0,0,1563.44,0,-1

2217,"StJoseph",1,0,0,0,440.3,100000,4980

2307,"StJoseph",1,1,1,1,1577.67,0,4980

4992,"StJoseph",1,0,0,0,12.16,0,4980

…..

We can however read this with SQL2GMS using the following query:

Q1=select barid,dsid,'pass1',pass1 from %csvfile% where barid is not null \
   union \
   select barid,dsid,'pass2',pass2 from %csvfile% where barid is not null \
   union \
   select barid,dsid,'pass3',pass3 from %csvfile% where barid is not null \
   union \
   select barid,dsid,'cost',cost from %csvfile% where barid is not null

The empty lines are handled by the check for NULLs. Notice that the parameter will look like:

The advantage of SQL is that we have some flexibility in reorganizing and operating on the data before it is passed on to the GDX DLL. In some cases that can be very useful.

In virtually all of the recent projects I have been involved with, I used SQL2GMS to get (sometimes large) data into GAMS. Not all GAMS modelers are familiar with SQL but given the large amounts of data available in databases I would say that it is a worthwhile time investment to get some basic database knowledge.

SQL Server and PIVOT/UNPIVOT

http://msdn.microsoft.com/en-us/library/ms177410(v=sql.105).aspx

I think the syntax is somewhat confusing, but this thing can be handy.

Not all databases I encounter are normalized! In fact I see many tables that resemble spreadsheets (i.e. with many columns).

Sunday, January 27, 2013

GAMS sameas problem

I tried a construct that looks like the following code:

sets
   cell5m
   cell5m_spam
;

$gdxin d.gdx
$load cell5m cell5m_spam

set cellmap(cell5m,cell5m_spam);
cellmap(cell5m,cell5m_spam)$sameas(cell5m,cell5m_spam) = yes;   (1)

Basically I take the intersection of the two sets and create a mapping between the two base sets. Unfortunately, the sets are big:

scalars
n_cell5m
n_cell5m_spam
;

n_cell5m = card(cell5m);
n_cell5m_spam = card(cell5m_spam);

display n_cell5m, n_cell5m_spam;

---- 23 PARAMETER n_cell5m = 2294094.000
PARAMETER n_cell5m_spam = 806882.000

as a result expression (1) takes an extraordinary long time to execute:

---- 26 Assignment cellmap 2436.954 2437.032 SECS 383 Mb 806839

I believe the sameas function is just not properly implemented in GAMS. This performance is really not what one would expect.

Of course we can work around this:

sets
cell5m(*)
cell5m_spam(*)
;

$gdxin d.gdx
$load cell5m cell5m_spam

alias (*,u);
set intersect(*);
intersect(u) = cell5m(u)*cell5m_spam(u);

set cellmap(cell5m,cell5m_spam);
cellmap(cell5m(intersect),cell5m_spam(intersect)) = yes;

This now executes in much less than a second:

---- 27 Assignment intersect 0.187 0.203 SECS 333 Mb 806839
---- 33 Assignment cellmap 0.203 0.406 SECS 358 Mb 806839

Note that we had to change the declarations of the sets a little bit, in order to get this passed the domain checking of GAMS.

Monday, January 21, 2013

GAMS/GDX cannot read scalar from compressed GDX file

I encountered a problem when reading a scalar from a compressed GDX file:

GDXIN C:\projects\ulrike - ifpri\gdx\spam.gdx
--- LOAD cell5m = 2:cell5m
--- LOAD rowcount = 1:rowcount
**** Problem reading symbol rowcount from gdx file
**** GDXmsg: Expected data marker (DATA) not found in GDX file

I did not really need the rowcount scalar so for me the workaround was: just skip reading this scalar.

It is amazing that this bug was not found earlier. I guess not many users are using compressed GDX files (the default is no compression) and of those few even fewer will ever store a scalar in such a file.

Some possible workarounds:

skip reading scalars
store scalars as a 1 dimensional parameter
store scalars in a separate uncompress gdx file
read scalars using gdxdump.

As said a possible workaround could be to use gdxdump for the scalars. This means instead of using:

$set gdxfile spam.gdx

set cell5m;
scalar rowcount;

* read from compressed GDX file
$gdxin %gdxfile%
$load cell5m rowcount

we should implement something like:

$set gdxfile spam.gdx

set cell5m;
scalar rowcount;

* read from compressed GDX file
$gdxin %gdxfile%
$load cell5m

$call gdxdump %gdxfile% symb=rowcount output=rowcount.inc
$include rowcount.inc

display rowcount;

No doubt this will be fixed in the next release.

Friday, January 18, 2013

Excel 1 million row limit

Excel 7 (and later) has limit of 1,048,576 rows (before version 7 this limit was 65,536 rows). When using COM automation (using Excel from a program) one usually turns off DisplayAlerts to prevent popping up messages related to saving workbooks etc. However this also means that a large CSV file will be silently truncated by the GAMS utility GDXXRW:

C:\projects\tmp>tail x.csv

2999991

2999992

2999993

2999994

2999995

2999996

2999997

2999998

2999999

3000000

C:\projects\tmp>wc x.csv

3000000 3000000 25888896 x.csv

C:\projects\tmp>gdxxrw i=x.csv o=x.gdx set=x rng=a1 rdim=1 cdim=0

GDXXRW Dec 18, 2012 24.0.1 WIN 37366.37409 VS8 x86/MS Windows

Input file : C:\projects\tmp\x.csv

Output file: C:\projects\tmp\x.gdx

Total time = 7847 Ms

C:\projects\tmp>gdxdump x.gdx > x.gms

C:\projects\tmp>head x.gms

$onempty

Set x(*) /

'1',

'2',

'3',

'4',

'5',

'6',

'7',

C:\projects\tmp>tail x.gms

'1048569',

'1048570',

'1048571',

'1048572',

'1048573',

'1048574',

'1048575',

'1048576' /;

$offempty

C:\projects\tmp>

We can read unlimited size CSV files using SQL2GMS (see for instance http://yetanothermathprogrammingconsultant.blogspot.com/2010/04/gamsgdx-compression.html).

I talked with the GAMS people about this. The proposed fix for the next release: check if row 108576 is not empty. If that is the case issue a warning message.

Book: Complementarity Modeling in Energy Markets

Complementarity Modeling in Energy Markets (International Series in Operations Research & Management Science)

http://www.amazon.com/Complementarity-Modeling-International-Operations-Management/dp/1441961224

Looks interesting.

Wednesday, January 16, 2013

Loading large data set into GAMS GDX file

For a project we convert data from a humongous table with 23 million records stored in a remote SQL Server database (global spatial grid with data on agricultural production and land use from the Spatial Production Allocation Model). It takes about 1 hour to run a query and populate a GDX file. Probably the performance is related to limited bandwidth. This is truly an example of “big data”. In a subsequent model we load the large GDX file into GAMS for further processing. This loading is much faster: about 4 seconds. It takes about 750 MB RAM to load the data in GAMS.

For these type of applications the use of GDX file as a high-performance staging “database” is very convenient. It would be a nightmare to load this data directly from the SQL Server database into GAMS: every run would start with a 1 hour startup time.

Compression works quite well on this gdx file (see also: http://yetanothermathprogrammingconsultant.blogspot.com/2010/04/gamsgdx-compression.html). Here are the file sizes:

Compression	File size in bytes
Uncompressed	303,114,020
Compressed	77,879,752

Users are more and more throwing real large data at me. Besides the SQL Server database above, I was also delivered an 1.2 GB DBF file. DBF files are still popular with GIS applications. This file could be imported into a compressed GDX file with the following size reduction (also partly because we dropped some redundant columns but mainly because GDX stores strings more efficiently):

File format	Size
dbf	1.2 GB
compressed GDX	62 MB

Sunday, January 13, 2013

C vs javascript

The difference between standard C compiled GLPK and the Javascript version seems to be a factor 10. As one could expect the solution path is slightly different.

C based glpsol.exe

GLPK Simplex Optimizer, v4.47
2726 rows, 6569 columns, 39488 non-zeros
Preprocessing...
2463 rows, 5594 columns, 33424 non-zeros
Scaling...
A: min|aij| = 4.104e-004 max|aij| = 1.389e+004 ratio = 3.384e+007
GM: min|aij| = 9.610e-002 max|aij| = 1.041e+001 ratio = 1.083e+002
EQ: min|aij| = 9.235e-003 max|aij| = 1.000e+000 ratio = 1.083e+002
Constructing initial basis...
Size of triangular part = 2463
      0: obj = -2.893228867e+010 infeas = 4.389e+005 (0)
    500: obj = 4.288287651e+009 infeas = 1.013e+005 (0)
   1000: obj = -2.415760147e+009 infeas = 9.499e+002 (0)
* 1496: obj = -2.117757487e+009 infeas = 1.461e-013 (0)
* 1500: obj = -2.049813931e+009 infeas = 1.461e-013 (0)
* 2000: obj = -2.582182378e+007 infeas = 2.005e-012 (0)
* 2500: obj = -5.675534925e+006 infeas = 1.362e-013 (0)
* 3000: obj = 5.134369324e+004 infeas = 1.049e-013 (0)
* 3500: obj = 8.575779704e+004 infeas = 3.918e-014 (0)
* 4000: obj = 1.034583169e+005 infeas = 7.290e-014 (0)
* 4500: obj = 1.148573480e+005 infeas = 4.027e-014 (0)
* 4527: obj = 1.148736555e+005 infeas = 3.876e-014 (0)
OPTIMAL SOLUTION FOUND
Time used:   1.2 secs

javascript

GLPK Simplex Optimizer, v4.47
2726 rows, 6569 columns, 39488 non-zeros
Preprocessing...
2463 rows, 5594 columns, 33424 non-zeros
Scaling...
 A: min|aij| = 0.000410398983226989  max|aij| = 13886.775  ratio = 33837254.88500861
GM: min|aij| = 0.09609772925154227  max|aij| = 10.406073148538535  ratio = 108.2863583727347
EQ: min|aij| = 0.009234773567302722  max|aij| = 1.0000000000000002  ratio = 108.2863583727347
Constructing initial basis...
Size of triangular part = 2463
 0: obj = -28932288674.902164  infeas = 438884.18813663867 (0)
 500: obj = 2417695413.8613462  infeas = 91877.35020040894 (0)
 1000: obj = -3476603174.5055895  infeas = 712.4076522292974 (0)
 1500: obj = -3524070971.30494  infeas = 42.2144600534787 (0)
*1509: obj = -3523600633.5120773  infeas = 1.8645671302407697e-13 (0)
*2000: obj = -35554643.69264513  infeas = 2.521881663037279e-12 (0)
*2500: obj = -4067298.103876429  infeas = 1.475546792196709e-13 (0)
*3000: obj = 52841.771320373286  infeas = 8.990385481854632e-14 (0)
*3500: obj = 84846.60326744069  infeas = 1.5219189175881605e-14 (0)
*4000: obj = 99082.72629055602  infeas = 1.5758436355355321e-13 (0)
*4475: obj = 114873.65551217867  infeas = 3.595886618335783e-14 (0)
OPTIMAL SOLUTION FOUND
Elapsed time: 14.672 seconds

C version was downloaded from: http://sourceforge.net/projects/winglpk/

Javascript version was from: http://www3.nd.edu/~jeff/mathprog/mathprog.html

Friday, January 11, 2013

Links

New Gurobi version: http://www.gurobi.com/products/gurobi-optimizer/what's-new-in-v5.1. It is amazing how much progress is still being made in making solvers faster.
http://www3.nd.edu/~jeff/mathprog/mathprog.html. Front-end using a javascript (!!) version of GLPK. So this runs on the client. I am curious how performance compares with running C compiled .exe. See http://yetanothermathprogrammingconsultant.blogspot.com/2013/01/c-vs-javascript.html for an experiment with a reasonably sized LP.
AMPL’s Constraint Programming Interface: http://zverovich.net/ics2013/2013-01-06-ampl-interface-to-constraint-programming-solvers.html
Running modeling languages using an Excel interface: http://solverstudio.org/. Interesting approach, a little bit like Solver Foundations Excel front-end. Both seem to use VSTO:

Thursday, January 10, 2013

Clock based seed in GAMS

If you use a random numbers in GAMS, setting execseed to a clock based number can be used to make each random number sequence different. I sometimes see:

execseed = gmillisec(jnow);

The idea is to initialize the seed with the milliseconds part of the current date. This is not completely correct however. In very rare cases this will fail with the message:

**** Exec Error at line 4: Random seed has to be > 0.0 and < 10**9

A better version looks like:

execseed = 1+gmillisec(jnow);

I.e. the advice given by the GAMS people here is actually not that good: http://support.gams.com/doku.php?id=gams:random_number_generator_in_gams:

Note that runs with this clock based construct will make your runs irreproducable (and thus bugs difficult to handle).

Tuesday, January 8, 2013

Optimizing some GAMS code

Some part of a large GAMS model was running slow so we had to investigate. The first line in the box below was the culprit:

We tried two alternative formulations using implicit loops. The performance gains on this data set were enormous (now the GAMS part is no longer the bottleneck in the process, the bulk of the time has moved back to database operations). Explicit loops can be much more expensive than implicit loops, especially when dealing with big data.

Friday, January 4, 2013

Non-convex QCP

Is there any software available for non-convex QCP’s? (http://www.ibm.com/developerworks/forums/thread.jspa?messageID=14925577).

You may want to try GloMIQO which targets these type of problems.

See:

Always check the data

Here some agricultural production statistics for Angola (source: XLS download from FAO/CountryStat):

I am sure the harvest of 2011 was not that abundant.

Looks like an import problem. This is the XLS file. Also available is CSV data. That shows:

All other production data are coded as integers, but here we have a number with a decimal comma (and many more digits than warranted). Decimal commas/decimal points are always a point of contention.

PS: Crops are denoted in Portuguese (Horticolas=vegetables) as Angola was once a Portuguese colony.

PS2. I dropped a note to Countrystats support about this.