Wednesday, January 30, 2013

GAMS to OpenOpt too scary?

http://forum.openopt.org/viewtopic.php?id=579:

“This works though I'm worried it might look too scary to my students coming from GAMS.”

GAMS/VBScript Tool to view DBF table structure

When receiving DBF files I usually look at the structure by loading it into Excel. In this case I received DBF files that were much too large to be handled by Excel. Before writing queries it is useful to be able to look at the column names and types. That is what this GAMS job does:

$ontext

  
Describe database tables

  
Gives description of DBF files in project directory

$offtext

$set dir  %system.fp%
$set connection "Provider=Microsoft.ACE.OLEDB.12.0;Data Source=%dir%;Extended Properties=dBASE IV;"

$onecho > db.vbs

WScript.Echo "------------------------------------------------------------------------"

Const adSchemaTables = 20
Const adSchemaColumns = 4

Set objConnection = CreateObject("ADODB.Connection")
WScript.Echo "ADO Version:",objConnection.Version
objConnection.ConnectionString = "%connection%"
WScript.Echo "ADO Connection:",objConnection.ConnectionString
Wscript.Echo
objConnection.Open
Set rs1 = objConnection.OpenSchema(adSchemaTables)
if (rs1.Eof) then
  
WScript.Echo "No tables found"
end if
Do Until rs1.Eof
   
TableName = rs1("Table_Name")
   
Set rs2 = objConnection.OpenSchema(adSchemaColumns,Array(Null, Null, TableName))
   
Wscript.Echo UCase(TableName)


   
Do While Not rs2.EOF

       
ColumnName = rs2("Column_Name")
       
ColTypeNum = rs2("Data_Type")
       
ColClen    = rs2("Character_Maximum_Length")

       
select case ColTypeNum
          
case 20   ColTypeName = "BigInt"
          
case 128  ColTypeName = "Binary"
          
case 11   ColTypeName = "Boolean"
          
case 129  ColTypeName = "Char"
          
case 6    ColTypeName = "Currency"
          
case 7    ColTypeName = "Date"
          
case 135  ColTypeName = "TimeStamp"
          
case 14   ColTypeName = "Decimal"
          
case 5    ColTypeName = "Double"
          
case 72   ColTypeName = "GUID"
          
case 9    ColTypeName = "Dispatch"
          
case 3    ColTypeName = "Integer"
          
case 205  ColTypeName = "LongVarBinary(Image,Blob)"
          
case 201  ColTypeName = "LongVarChar(Memo,Text)"
          
case 203  ColTypeName = "LongVarWChar(Memo,Text)"
          
case 131  ColTypeName = "Numeric"
          
case 4    ColTypeName = "Single"
          
case 2    ColTypeName = "SmallInt"
          
case 17   ColTypeName = "UnsignedTinyInt(Byte)"
          
case 204  ColTypeName = "VarBinary"
          
case 200  ColTypeName = "VarChar"
          
case 12   ColTypeName = "Variant"
          
case 202  ColTypeName = "VarWChar"
          
case 130  ColTypeName = "WChar"
          
case else ColTypeName = ColTypeNum
       
end select

       
Wscript.Echo  "    ", ColumnName & ", " & ColTypeName & ", " & ColClen
       
rs2.MoveNext
   
Loop
   
rs2.Close

   
Wscript.Echo
   
rs1.MoveNext
Loop

rs1.Close
objConnection.Close

WScript.Echo "------------------------------------------------------------------------"

$offecho

$call cscript db.vbs

The output looks like:


--- Job describe.gms Start 01/30/13 15:24:38 WEX-VS8 23.9.5 x86/MS Windows       
GAMS Rev 239  Copyright (C) 1987-2012 GAMS Development. All rights reserved
Licensee: Erwin Kalvelagen                               G120717/0001CV-WIN
          Amsterdam Optimization Modeling Group                      DC4572
--- Starting compilation
--- describe.gms(80) 2 Mb
--- call cscript db.vbs
Microsoft (R) Windows Script Host Version 5.8
Copyright (C) Microsoft Corporation. All rights reserved.

------------------------------------------------------------------------
ADO Version: 6.1
ADO Connection: Provider=Microsoft.ACE.OLEDB.12.0;Data Source=C:\projects\ulrike - ifpri\gdx\;Extended Properties=dBASE IV;

SPAM2005_HARVA_TALL_BETA
     ALLOC_KEY, WChar, 8
     CELL5M, Double,
     CR_NAME_SH, WChar, 4
     CR_VALUE, Double,
     ISO3, WChar, 3
     PROD_LEVEL, WChar, 7
     REC_TYPE, WChar, 1
     TECH, WChar, 1
     UNIT, WChar, 6

SPAM2005_HARVA_TALL_BETA_1
     ALLOC_KEY, WChar, 8
     CELL5M, Double,
     CR_NAME_SH, WChar, 4
     CR_VALUE, Double,
     ISO3, WChar, 3
     PROD_LEVEL, WChar, 7
     REC_TYPE, WChar, 1
     TECH, WChar, 1
     UNIT, WChar, 6

SPAM2005_PROD_TALL_BETA_1
     ALLOC_KEY, WChar, 8
     CELL5M, Double,
     CR_NAME_SH, WChar, 4
     CR_VALUE, Double,
     ISO3, WChar, 3
     PROD_LEVEL, WChar, 7
     REC_TYPE, WChar, 1
     TECH, WChar, 1
     UNIT, WChar, 6

------------------------------------------------------------------------
--- describe.gms(81) 2 Mb
--- Starting execution - empty program
*** Status: Normal completion
--- Job describe.gms Stop 01/30/13 15:24:38 elapsed 0:00:00.316

I looked for some free DBF viewers, but they largely originate from obscure Russian web sites and Google warned not to open the downloaded zip files. This ADO based approach should at least create no problems like introducing viruses.

Monday, January 28, 2013

Reading a CSV file

This is a difficult CSV file to read using using the GAMS tool GDXXRW:

"BARID","BASIN","CAND","PASS1","PASS2","PASS3","USHAB","COST","DSID"

4980,"StJoseph",1,0,0,0,1563.44,0,-1

2217,"StJoseph",1,0,0,0,440.3,100000,4980

2307,"StJoseph",1,1,1,1,1577.67,0,4980

4992,"StJoseph",1,0,0,0,12.16,0,4980

…..

We can however read this with SQL2GMS using the following query:

Q1=select barid,dsid,'pass1',pass1 from %csvfile% where barid is not null \
   union \
   select barid,dsid,'pass2',pass2 from %csvfile% where barid is not null \
   union \
   select barid,dsid,'pass3',pass3 from %csvfile% where barid is not null \
   union \
   select barid,dsid,'cost',cost from %csvfile% where barid is not null

The empty lines are handled by the check for NULLs. Notice that the parameter will look like:

image

The advantage of SQL is that we have some flexibility in reorganizing and operating on the data before it is passed on to the GDX DLL. In some cases that can be very useful.

In virtually all of the recent projects I have been involved with, I used SQL2GMS to get (sometimes large) data into GAMS. Not all GAMS modelers are familiar with SQL but given the large amounts of data available in databases I would say that it is a worthwhile time investment to get some basic database knowledge.

SQL Server and PIVOT/UNPIVOT

http://msdn.microsoft.com/en-us/library/ms177410(v=sql.105).aspx

I think the syntax is somewhat confusing, but this thing can be handy.

Not all databases I encounter are normalized! In fact I see many tables that resemble spreadsheets (i.e. with many columns).

Sunday, January 27, 2013

GAMS sameas problem

I tried a construct that looks like the following code:

sets
   cell5m
   cell5m_spam
;

$gdxin d.gdx
$load cell5m cell5m_spam


set
cellmap(cell5m,cell5m_spam);
cellmap(cell5m,cell5m_spam)$sameas(cell5m,cell5m_spam) =
yes
;   (1)

Basically I take the intersection of the two sets and create a mapping between the two base sets. Unfortunately, the sets are big:

scalars
  n_cell5m
  n_cell5m_spam
;

n_cell5m =
card(cell5m);
n_cell5m_spam =
card
(cell5m_spam);

display
n_cell5m, n_cell5m_spam;

----     23 PARAMETER n_cell5m             =  2294094.000 
            PARAMETER n_cell5m_spam        =   806882.000 

as a result expression (1) takes an extraordinary long time to execute:

----     26 Assignment cellmap    2436.954  2437.032 SECS    383 Mb  806839

I believe the sameas function is just not properly implemented in GAMS. This performance is really not what one would expect.

Of course we can work around this:

sets
   cell5m(*)
   cell5m_spam(*)
;

$gdxin d.gdx
$load cell5m cell5m_spam


alias (*,u);
set
intersect(*);
intersect(u) = cell5m(u)*cell5m_spam(u);


set
cellmap(cell5m,cell5m_spam);
cellmap(cell5m(intersect),cell5m_spam(intersect)) =
yes
;

This now executes in much less than a second:

----     27 Assignment intersect     0.187     0.203 SECS    333 Mb  806839
----     33 Assignment cellmap       0.203     0.406 SECS    358 Mb  806839

Note that we had to change the declarations of the sets a little bit, in order to get this passed the domain checking of GAMS.

Monday, January 21, 2013

GAMS/GDX cannot read scalar from compressed GDX file

I encountered a problem when reading a scalar from a compressed GDX file:

GDXIN   C:\projects\ulrike - ifpri\gdx\spam.gdx
--- LOAD  cell5m = 2:cell5m
--- LOAD  rowcount = 1:rowcount
**** Problem reading symbol rowcount from gdx file
**** GDXmsg: Expected data marker (DATA) not found in GDX file

I did not really need the rowcount scalar so for me the workaround was: just skip reading this scalar.

It is amazing that this bug was not found earlier. I guess not many users are using compressed GDX files (the default is no compression) and of those few even fewer will ever store a scalar in such a file.

Some possible workarounds:

  • skip reading scalars
  • store scalars as a 1 dimensional parameter
  • store scalars in a separate uncompress gdx file
  • read scalars using gdxdump.

As said a possible workaround could be to use gdxdump for the scalars. This means instead of using:

$set gdxfile spam.gdx

set
cell5m;
scalar
rowcount;

* read from compressed GDX file

$gdxin %gdxfile%
$load cell5m  rowcount

we should implement something like:

$set gdxfile spam.gdx

set
cell5m;
scalar
rowcount;

* read from compressed GDX file
$gdxin %gdxfile%
$load cell5m

$call gdxdump %gdxfile% symb=rowcount output=rowcount.inc
$include rowcount.inc


display
rowcount;

No doubt this will be fixed in the next release.

Friday, January 18, 2013

Excel 1 million row limit

Excel 7 (and later) has limit of 1,048,576 rows (before version 7 this limit was 65,536 rows). When using COM automation (using Excel from a program) one usually turns off DisplayAlerts to prevent popping up messages related to saving workbooks etc. However this also means that a large CSV file will be silently truncated by the GAMS utility GDXXRW:

C:\projects\tmp>tail x.csv

2999991

2999992

2999993

2999994

2999995

2999996

2999997

2999998

2999999

3000000

 

C:\projects\tmp>wc x.csv

3000000 3000000 25888896 x.csv

 

C:\projects\tmp>gdxxrw i=x.csv o=x.gdx set=x rng=a1 rdim=1 cdim=0

 

GDXXRW           Dec 18, 2012 24.0.1 WIN 37366.37409 VS8 x86/MS Windows

Input file : C:\projects\tmp\x.csv

Output file: C:\projects\tmp\x.gdx

Total time = 7847 Ms

 

C:\projects\tmp>gdxdump x.gdx > x.gms

 

C:\projects\tmp>head x.gms

$onempty

 

Set x(*) /

'1',

'2',

'3',

'4',

'5',

'6',

'7',

 

C:\projects\tmp>tail x.gms

'1048569',

'1048570',

'1048571',

'1048572',

'1048573',

'1048574',

'1048575',

'1048576' /;

 

$offempty

 

C:\projects\tmp>

We can read unlimited size CSV files using SQL2GMS (see for instance http://yetanothermathprogrammingconsultant.blogspot.com/2010/04/gamsgdx-compression.html).

I talked with the GAMS people about this. The proposed fix for the next release: check if row 108576 is not empty. If that is the case issue a warning message.

Wednesday, January 16, 2013

Loading large data set into GAMS GDX file

image

For a project we convert data from a humongous table with 23 million records stored in a remote SQL Server database (global spatial grid with data on agricultural production and land use from the Spatial Production Allocation Model). It takes about 1 hour to run a query and populate a GDX file. Probably the performance is related to limited bandwidth. This is truly an example of “big data”. In a subsequent model we load the large GDX file into GAMS for further processing. This loading is much faster: about 4 seconds. It takes about 750 MB RAM to load the data in GAMS.

For these type of applications the use of GDX file as a high-performance staging “database” is very convenient. It would be a nightmare to load this data directly from the SQL Server database into GAMS: every run would start with a 1 hour startup time.

Compression works quite well on this gdx file (see also: http://yetanothermathprogrammingconsultant.blogspot.com/2010/04/gamsgdx-compression.html). Here are the file sizes:

Compression File size in bytes
Uncompressed 303,114,020
Compressed 77,879,752

Users are more and more throwing real large data at me. Besides the SQL Server database above, I was also delivered an 1.2 GB DBF file. DBF files are still popular with GIS applications. This file could be imported into a compressed GDX file with the following size reduction (also partly because we dropped some redundant columns but mainly because GDX stores strings more efficiently):

File format Size
dbf 1.2 GB
compressed GDX 62 MB

Sunday, January 13, 2013

C vs javascript

The difference between standard C compiled GLPK and the Javascript version seems to be a factor 10. As one could expect the solution path is slightly different.



C based glpsol.exe

GLPK Simplex Optimizer, v4.47
2726 rows, 6569 columns, 39488 non-zeros
Preprocessing...
2463 rows, 5594 columns, 33424 non-zeros
Scaling...
A: min|aij| = 4.104e-004  max|aij| = 1.389e+004  ratio = 3.384e+007
GM: min|aij| = 9.610e-002  max|aij| = 1.041e+001  ratio = 1.083e+002
EQ: min|aij| = 9.235e-003  max|aij| = 1.000e+000  ratio = 1.083e+002
Constructing initial basis...
Size of triangular part = 2463
      0: obj = -2.893228867e+010  infeas = 4.389e+005 (0)
    500: obj =  4.288287651e+009  infeas = 1.013e+005 (0)
   1000: obj = -2.415760147e+009  infeas = 9.499e+002 (0)
*  1496: obj = -2.117757487e+009  infeas = 1.461e-013 (0)
*  1500: obj = -2.049813931e+009  infeas = 1.461e-013 (0)
*  2000: obj = -2.582182378e+007  infeas = 2.005e-012 (0)
*  2500: obj = -5.675534925e+006  infeas = 1.362e-013 (0)
*  3000: obj =  5.134369324e+004  infeas = 1.049e-013 (0)
*  3500: obj =  8.575779704e+004  infeas = 3.918e-014 (0)
*  4000: obj =  1.034583169e+005  infeas = 7.290e-014 (0)
*  4500: obj =  1.148573480e+005  infeas = 4.027e-014 (0)
*  4527: obj =  1.148736555e+005  infeas = 3.876e-014 (0)
OPTIMAL SOLUTION FOUND
Time used:   1.2 secs

javascript
GLPK Simplex Optimizer, v4.47
2726 rows, 6569 columns, 39488 non-zeros
Preprocessing...
2463 rows, 5594 columns, 33424 non-zeros
Scaling...
A: min|aij| = 0.000410398983226989 max|aij| = 13886.775 ratio = 33837254.88500861
GM: min|aij| = 0.09609772925154227 max|aij| = 10.406073148538535 ratio = 108.2863583727347
EQ: min|aij| = 0.009234773567302722 max|aij| = 1.0000000000000002 ratio = 108.2863583727347
Constructing initial basis...
Size of triangular part = 2463
0: obj = -28932288674.902164 infeas = 438884.18813663867 (0)
500: obj = 2417695413.8613462 infeas = 91877.35020040894 (0)
1000: obj = -3476603174.5055895 infeas = 712.4076522292974 (0)
1500: obj = -3524070971.30494 infeas = 42.2144600534787 (0)
*1509: obj = -3523600633.5120773 infeas = 1.8645671302407697e-13 (0)
*2000: obj = -35554643.69264513 infeas = 2.521881663037279e-12 (0)
*2500: obj = -4067298.103876429 infeas = 1.475546792196709e-13 (0)
*3000: obj = 52841.771320373286 infeas = 8.990385481854632e-14 (0)
*3500: obj = 84846.60326744069 infeas = 1.5219189175881605e-14 (0)
*4000: obj = 99082.72629055602 infeas = 1.5758436355355321e-13 (0)
*4475: obj = 114873.65551217867 infeas = 3.595886618335783e-14 (0)
OPTIMAL SOLUTION FOUND
Elapsed time: 14.672 seconds



Friday, January 11, 2013

Links

Thursday, January 10, 2013

Clock based seed in GAMS

If you use a random numbers in GAMS, setting execseed to a clock based number can be used to make each random number sequence different. I sometimes see:

execseed = gmillisec(jnow);

The idea is to initialize the seed with the milliseconds part of the current date. This is not completely correct however. In very rare cases this will fail with the message:

**** Exec Error at line 4: Random seed has to be > 0.0 and < 10**9

A better version looks like:

execseed = 1+gmillisec(jnow);

I.e. the advice given by the GAMS people here is actually not that good: http://support.gams.com/doku.php?id=gams:random_number_generator_in_gams:

image

Note that runs with this clock based construct will make your runs irreproducable (and thus bugs difficult to handle).

Tuesday, January 8, 2013

Optimizing some GAMS code

Some part of a large GAMS model was running slow so we had to investigate. The first line in the box below was the culprit:

image

We tried two alternative formulations using implicit loops. The performance gains on this data set were enormous (now the GAMS part is no longer the bottleneck in the process, the bulk of the time has moved back to database operations). Explicit loops can be much more expensive than implicit loops, especially when dealing with big data.

Friday, January 4, 2013

Always check the data

Here some agricultural production statistics for Angola (source: XLS download from FAO/CountryStat):

image 

I am sure the harvest of 2011 was not that abundant.

Looks like an import problem. This is the XLS file. Also available is CSV data. That shows:

image

All other production data are coded as integers, but here we have a number with a decimal comma (and many more digits than warranted). Decimal commas/decimal points are always a point of contention.

PS: Crops are denoted in Portuguese (Horticolas=vegetables) as Angola was once a Portuguese colony.

PS2. I dropped a note to Countrystats support about this.