http://forum.openopt.org/viewtopic.php?id=579:
“This works though I'm worried it might look too scary to my students coming from GAMS.”
I am a full-time consultant and provide services related to the design, implementation and deployment of mathematical programming, optimization and data-science applications. I also teach courses and workshops. Usually I cannot blog about projects I am doing, but there are many technical notes I'd like to share. Not in the least so I have an easy way to search and find them again myself. You can reach me at erwin@amsterdamoptimization.com.
http://forum.openopt.org/viewtopic.php?id=579:
“This works though I'm worried it might look too scary to my students coming from GAMS.”
When receiving DBF files I usually look at the structure by loading it into Excel. In this case I received DBF files that were much too large to be handled by Excel. Before writing queries it is useful to be able to look at the column names and types. That is what this GAMS job does:
$ontext |
The output looks like:
------------------------------------------------------------------------ SPAM2005_HARVA_TALL_BETA SPAM2005_HARVA_TALL_BETA_1 SPAM2005_PROD_TALL_BETA_1 ------------------------------------------------------------------------ |
I looked for some free DBF viewers, but they largely originate from obscure Russian web sites and Google warned not to open the downloaded zip files. This ADO based approach should at least create no problems like introducing viruses.
This is a difficult CSV file to read using using the GAMS tool GDXXRW:
"BARID","BASIN","CAND","PASS1","PASS2","PASS3","USHAB","COST","DSID" 4980,"StJoseph",1,0,0,0,1563.44,0,-1 2217,"StJoseph",1,0,0,0,440.3,100000,4980 2307,"StJoseph",1,1,1,1,1577.67,0,4980 4992,"StJoseph",1,0,0,0,12.16,0,4980 |
We can however read this with SQL2GMS using the following query:
Q1=select barid,dsid,'pass1',pass1 from %csvfile% where barid is not null \ |
The empty lines are handled by the check for NULLs. Notice that the parameter will look like:
The advantage of SQL is that we have some flexibility in reorganizing and operating on the data before it is passed on to the GDX DLL. In some cases that can be very useful.
In virtually all of the recent projects I have been involved with, I used SQL2GMS to get (sometimes large) data into GAMS. Not all GAMS modelers are familiar with SQL but given the large amounts of data available in databases I would say that it is a worthwhile time investment to get some basic database knowledge.
http://msdn.microsoft.com/en-us/library/ms177410(v=sql.105).aspx
I think the syntax is somewhat confusing, but this thing can be handy.
Not all databases I encounter are normalized! In fact I see many tables that resemble spreadsheets (i.e. with many columns).
I tried a construct that looks like the following code:
sets |
Basically I take the intersection of the two sets and create a mapping between the two base sets. Unfortunately, the sets are big:
scalars n_cell5m n_cell5m_spam ; n_cell5m = card(cell5m); n_cell5m_spam = card(cell5m_spam); display n_cell5m, n_cell5m_spam; ---- 23 PARAMETER n_cell5m = 2294094.000 |
as a result expression (1) takes an extraordinary long time to execute:
---- 26 Assignment cellmap 2436.954 2437.032 SECS 383 Mb 806839 |
I believe the sameas function is just not properly implemented in GAMS. This performance is really not what one would expect.
Of course we can work around this:
sets alias (*,u); set intersect(*); intersect(u) = cell5m(u)*cell5m_spam(u); set cellmap(cell5m,cell5m_spam); cellmap(cell5m(intersect),cell5m_spam(intersect)) = yes; |
This now executes in much less than a second:
---- 27 Assignment intersect 0.187 0.203 SECS 333 Mb 806839 ---- 33 Assignment cellmap 0.203 0.406 SECS 358 Mb 806839 |
Note that we had to change the declarations of the sets a little bit, in order to get this passed the domain checking of GAMS.
I encountered a problem when reading a scalar from a compressed GDX file:
GDXIN C:\projects\ulrike - ifpri\gdx\spam.gdx |
I did not really need the rowcount scalar so for me the workaround was: just skip reading this scalar.
It is amazing that this bug was not found earlier. I guess not many users are using compressed GDX files (the default is no compression) and of those few even fewer will ever store a scalar in such a file.
Some possible workarounds:
As said a possible workaround could be to use gdxdump for the scalars. This means instead of using:
we should implement something like:$set gdxfile spam.gdx
set cell5m;
scalar rowcount;
* read from compressed GDX file
$gdxin %gdxfile%
$load cell5m rowcount
$set gdxfile spam.gdx
set cell5m;
scalar rowcount;
* read from compressed GDX file
$gdxin %gdxfile%
$load cell5m
$call gdxdump %gdxfile% symb=rowcount output=rowcount.inc
$include rowcount.inc
display rowcount;
No doubt this will be fixed in the next release.
Excel 7 (and later) has limit of 1,048,576 rows (before version 7 this limit was 65,536 rows). When using COM automation (using Excel from a program) one usually turns off DisplayAlerts to prevent popping up messages related to saving workbooks etc. However this also means that a large CSV file will be silently truncated by the GAMS utility GDXXRW:
C:\projects\tmp>tail x.csv 2999991 2999992 2999993 2999994 2999995 2999996 2999997 2999998 2999999 3000000
C:\projects\tmp>wc x.csv 3000000 3000000 25888896 x.csv
C:\projects\tmp>gdxxrw i=x.csv o=x.gdx set=x rng=a1 rdim=1 cdim=0
GDXXRW Dec 18, 2012 24.0.1 WIN 37366.37409 VS8 x86/MS Windows Input file : C:\projects\tmp\x.csv Output file: C:\projects\tmp\x.gdx Total time = 7847 Ms
C:\projects\tmp>gdxdump x.gdx > x.gms
C:\projects\tmp>head x.gms $onempty
Set x(*) / '1', '2', '3', '4', '5', '6', '7',
C:\projects\tmp>tail x.gms '1048569', '1048570', '1048571', '1048572', '1048573', '1048574', '1048575', '1048576' /;
$offempty
C:\projects\tmp> |
We can read unlimited size CSV files using SQL2GMS (see for instance http://yetanothermathprogrammingconsultant.blogspot.com/2010/04/gamsgdx-compression.html).
I talked with the GAMS people about this. The proposed fix for the next release: check if row 108576 is not empty. If that is the case issue a warning message.
For a project we convert data from a humongous table with 23 million records stored in a remote SQL Server database (global spatial grid with data on agricultural production and land use from the Spatial Production Allocation Model). It takes about 1 hour to run a query and populate a GDX file. Probably the performance is related to limited bandwidth. This is truly an example of “big data”. In a subsequent model we load the large GDX file into GAMS for further processing. This loading is much faster: about 4 seconds. It takes about 750 MB RAM to load the data in GAMS.
For these type of applications the use of GDX file as a high-performance staging “database” is very convenient. It would be a nightmare to load this data directly from the SQL Server database into GAMS: every run would start with a 1 hour startup time.
Compression works quite well on this gdx file (see also: http://yetanothermathprogrammingconsultant.blogspot.com/2010/04/gamsgdx-compression.html). Here are the file sizes:
Compression | File size in bytes |
Uncompressed | 303,114,020 |
Compressed | 77,879,752 |
Users are more and more throwing real large data at me. Besides the SQL Server database above, I was also delivered an 1.2 GB DBF file. DBF files are still popular with GIS applications. This file could be imported into a compressed GDX file with the following size reduction (also partly because we dropped some redundant columns but mainly because GDX stores strings more efficiently):
File format | Size |
dbf | 1.2 GB |
compressed GDX | 62 MB |
The difference between standard C compiled GLPK and the Javascript version seems to be a factor 10. As one could expect the solution path is slightly different.
C based glpsol.exe | GLPK Simplex Optimizer, v4.47 |
javascript | GLPK Simplex Optimizer, v4.47 |
If you use a random numbers in GAMS, setting execseed to a clock based number can be used to make each random number sequence different. I sometimes see:
execseed = gmillisec(jnow);
The idea is to initialize the seed with the milliseconds part of the current date. This is not completely correct however. In very rare cases this will fail with the message:
**** Exec Error at line 4: Random seed has to be > 0.0 and < 10**9
A better version looks like:
execseed = 1+gmillisec(jnow);
I.e. the advice given by the GAMS people here is actually not that good: http://support.gams.com/doku.php?id=gams:random_number_generator_in_gams:
Note that runs with this clock based construct will make your runs irreproducable (and thus bugs difficult to handle).
Some part of a large GAMS model was running slow so we had to investigate. The first line in the box below was the culprit:
We tried two alternative formulations using implicit loops. The performance gains on this data set were enormous (now the GAMS part is no longer the bottleneck in the process, the bulk of the time has moved back to database operations). Explicit loops can be much more expensive than implicit loops, especially when dealing with big data.
Is there any software available for non-convex QCP’s? (http://www.ibm.com/developerworks/forums/thread.jspa?messageID=14925577).
You may want to try GloMIQO which targets these type of problems.
See:
Here some agricultural production statistics for Angola (source: XLS download from FAO/CountryStat):
I am sure the harvest of 2011 was not that abundant.
Looks like an import problem. This is the XLS file. Also available is CSV data. That shows:
All other production data are coded as integers, but here we have a number with a decimal comma (and many more digits than warranted). Decimal commas/decimal points are always a point of contention.
PS: Crops are denoted in Portuguese (Horticolas=vegetables) as Angola was once a Portuguese colony.
PS2. I dropped a note to Countrystats support about this.