Yet Another Math Programming Consultant: Export GAMS GDX file to different Python formats (CSV,Feather,Pickle)

A GAMS GDX file is a binary file with a number of GAMS symbols with their data. When collaborating with colleagues that work with Python, it may be useful to convert a GAMS GDX with relevant data to data files that Python can consume.

The idea is here that a GAMS user can run the GAMS script, without knowing Python or having Python installed. (GAMS comes with its own somewhat barebones Python, which is used inside the GAMS script). On the other hand, the Python user will not need to have GAMS installed to read the generated data files. This is opposed to Python packages and tools that can interact with GDX files directly. You can see a few in the PyPI directory [1]. Those will require both a GAMS system and a Python environment.

The tool supports three different formats:

A directory with CSV files. One file for each symbol.
A directory with feather files. These are like binary CSV files [2]. Again one file per symbol.
A single Pickle file.

The input consists of:

GDX file
name of output directory
output format

The output is:

Data files.
Text file with a description of the content of the GDX file.
Python script containing example code on how to read the data files. The data is read into Pandas dataframes.

This tool relies on embedded Python and the gamstransfer package.

Example 1: sparse matrix

In this example, we generate a sparse matrix in GAMS, and export this using our tool. The code to generate this data set is as follows:

Sets

i /i1*i1000/

j /j1*j1000/

;

Parameter A(i,j) 'sparse matrix';

A(i,j)$(uniform(0,1)<0.05) = uniform(1,1000);

execute_unload "SparseMatrix.gdx",i,j,A;

The generated matrix has a density of about 5%. The number of non-zero elements is 50,166. After running the tool in CSV mode, we have an output directory with the CSV and some supporting files. The file gdxcontent.txt shows the symbols in the gdx file:

C:\Users\erwin\Downloads\sparsematrixCSV>type gdxcontent.txt

Symbol Dim Type Records Explanatory text

1 A 2 Par 50166 sparse matrix

2 i 1 Set 1000

3 j 1 Set 1000

The generated file howtoread.py shows how we can read the data files:

C:\Users\erwin\Downloads\sparsematrixCSV>type howtoread.py

# example Python reader for data from sparsematrix.gdx

import pandas as pd

i = pd.read_csv('i.csv')

j = pd.read_csv('j.csv')

A = pd.read_csv('A.csv')

When we run this, we see:

This is a bit confusing. Let me try to interpret this.

The column name 'uni_0' probably indicates this is a subset of the universe of set elements in the GAMS model. The zero is probably added because there may be multiple indices. In this case, there are no multiple columns, so the 0 suffix is not needed. In GAMS terms, this would be a '*' column.
The column 'element_text' is likely reserved for text associated with a set element. We don't have this kind of explanatory text in our model, so this column can be ignored. It probably should not even be there.
The column names 'i_0' and 'j_0' are mangled. They could just be 'i' and 'j'. This is probably to guard against duplicate column names. GAMS allows something like A(i,i) so the concern is understandable. However, it would be better to mangle only if needed.

The other formats store the same thing. The pickle format stores the dataframes in a single dictionary. The generated howtoread.py file looks like:

C:\Users\erwin\Downloads\sparsematrixPickle>type howtoread.py

# example Python reader for data from sparsematrix.gdx

import pandas as pd

import pickle

print('Pickle file was written with Pandas version 1.4.4')

print(f'Your version is {pd. __version__}. This should be the same or newer.')

with open('alldata.pickle', 'rb') as f:

alldata = pickle.load(f)

print(f'Keys:{alldata.keys()}')

print(f'Summary of data:\n{alldata}')

Some info about versions is added: we can only read Pickle files reliably with a matching (or newer) Python version. It also prints the keys, and a summary printout, so you know a bit about what is inside the 'alldata' dictionary. The output looks like:

C:\Users\erwin\Downloads\sparsematrixPickle>py howtoread.py

Pickle file was written with Pandas version 1.4.4

Your version is 1.5.2. This should be the same or newer.

Keys:dict_keys(['i', 'j', 'A'])

Summary of data:

{'i': uni_0 element_text

0 i1

1 i2

2 i3

3 i4

4 i5

.. ... ...

995 i996

996 i997

997 i998

998 i999

999 i1000

[1000 rows x 2 columns], 'j': uni_0 element_text

0 j1

1 j2

2 j3

3 j4

4 j5

.. ... ...

995 j996

996 j997

997 j998

998 j999

999 j1000

[1000 rows x 2 columns], 'A': i_0 j_1 value

0 i1 j44 339.211722

1 i1 j58 792.568281

2 i1 j64 585.546042

3 i1 j79 270.343439

4 i1 j98 401.826455

... ... ... ...

50161 i1000 j955 866.032624

50162 i1000 j963 633.530151

50163 i1000 j974 300.124396

50164 i1000 j985 987.395802

50165 i1000 j991 341.163135

[50166 rows x 3 columns]}

We see that alldata['i'], alldata['j'] and alldata['A'] contain our dataframes.

The third format is a feather file. These are basically like binary CSV files. They can be read as:

C:\Users\erwin\Downloads\sparsematrixFeather>type howtoread.py

# example Python reader for data from sparsematrix.gdx

import pandas as pd

i = pd.read_feather('i.feather')

j = pd.read_feather('j.feather')

A = pd.read_feather('A.feather')

There is one big issue with this format: the crippled Python version that comes with GAMS has no support for this, and neither is it easy to add. Here is a recipe:

>cd \gams\41\GMSPython (or whatever your GAMS installation directory is)
>curl -O https://bootstrap.pypa.io/get-pip.py
>python get-pip.py
>scripts\pip install pyarrow

After this, you can let the script export dataframes into Feather files.

The dataframes have some properties you should be aware of:

They are in long format (database format). This is how things are typically stored in a database. In a spreadsheet or wide format, we would pivot one of the index columns to the right. Pivot and unpivot operations are some of the most important ones when using dataframes.
They use a sparse representation: the zero values are not stored.
The index columns form a unique key. There are no duplicates.
The index columns are strings, and the value column is a double-precision floating point value. When using CSV (text) files, these may appear as integers.

If you want to convert the sparse representation of A into a proper 2d numpy matrix, you can do something like:

The ordering is taken from sets i and j. The matrix is very sparse, as is visible in this particular window.

Example 2: sparse network

Here is a simple procedure to generate a sparse directed network or graph in GAMS:

set i 'nodes' /node1*node100/;

alias(i,j);

set a(i,j) 'arcs';

a(i,j) = uniform(0,1)<0.1;

* only display if small

display$(card(i)<50) i,a;

* export network as GDX file

execute_unload "network.gdx",i,a;

This data set has a 2-dimensional set 'a' representing the arcs (or edges). After converting the GDX file to CSV files, we can import this in Python as:

The arcs look like:

Let's do some simple operations on the network. In GAMS, we can do:

* size of network

parameter size(*) 'number of nodes and arcs';

size('nodes') = card(i);

size('arcs') = card(a);

option size:0; display size;

* print some statistics on the degrees of the nodes

parameter maxdegree(*) 'max in/out-degree';

maxdegree('in') = smax(i,sum(a(j,i),1));

maxdegree('out') = smax(i,sum(a(i,j),1));

maxdegree('in/out') = smax(i,sum(a(j,i),1)+sum(a(i,j),1));

option maxdegree:0; display maxdegree;

This shows:

----     30 PARAMETER size  number of nodes and arcs

nodes  100,    arcs  1002


----     39 PARAMETER maxdegree  max in/out-degree

in     20,    out    17,    in/out 36

Let's do the same thing in Python, using the networkx package.

Conclusion

The GAMS script shown in the appendix is a tool to convert a GAMS GDX data file into a representation that can be consumed in standalone Python. This little tool has already been shown to be useful for GAMS modelers to produce data sets for further processing in a Python environment.

References

Python Package Index, https://pypi.org/search/?q=GDX
Feather File Format, https://arrow.apache.org/docs/python/feather.html

Appendix: gdx2python.gms

$onText

Export GDX file to Python data files

Supported export formats:

- multiple CSV files (one per symbol)

- multiple Feather files (binary, one per symbol)

- Pickle file (single file)

The Feather files are the fastest. However, you need to install support for this

(package pyarrow).

$offText

*-----------------------------------------------------------------------

* User-provided settings

*-----------------------------------------------------------------------

$set gdxfile network.gdx

$set outputdir networkdata

* Output format

* can be: CSV,Feather,Pickle

$set format CSV

*$set format Feather

*$set format Pickle

*-----------------------------------------------------------------------

* Body

*-----------------------------------------------------------------------

* quick checks

$if not exist %gdxfile% $abort %gdxfile% does not exist

$call mkdir %outputdir%

$if errorlevel 1 $abort could not create %outputdir%

* you can comment out the previous like if you want to go ahead

* in the case the directory already exist.

* for reference dump the table of contents of the GDX file to the output directory

$call gdxdump %gdxfile% symbols > %outputdir%\gdxcontent.txt

$call type %outputdir%\gdxcontent.txt

* Here we loop over all symbols in the GDX file and export them.

$onEmbeddedCode Python:

import gamstransfer as gt

import pandas as pd

import pickle

import sys

import time

gdxfile = r'%gdxfile%'

outputdir = r'%outputdir%'

format = '%format%'

start = time.time()

print(f'Reading {gdxfile}')

gdx = gt.Container(gdxfile)

print(f'Read time:{time.time()-start:.1f} seconds')

start = time.time()

howtoread = f'''# example Python reader for data from {gdxfile}

import pandas as pd

'''

if format=='Pickle':

alldata = {}

pandasversion = pd. __version__

howtoread += 'import pickle'

# run through all symbols in gdx file

for sym in gdx.data.keys():

df = gdx.data[sym].records

if df is None: continue

#print(df.head())

if format=='CSV':

fln = f'{outputdir}\\{sym}.csv'

print(f' {sym} --> {fln}')

df.to_csv(fln,index=False)

howtoread += f"\n{sym} = pd.read_csv('{sym}.csv')"

if format=='Feather':

fln = f'%outputdir%\\{sym}.feather'

print(f' {sym} --> {fln}')

df.to_feather(fln)

howtoread += f"\n{sym} = pd.read_feather('{sym}.feather')"

if format=='Pickle':

print(f' Reading {sym}')

alldata[sym] = df

if format=='Pickle':

fln = f'{outputdir}\\alldata.pickle'

print(f' {gdxfile} --> {fln}')

with open(fln, 'wb') as handle:

pickle.dump(alldata, handle, protocol=pickle.HIGHEST_PROTOCOL)

howtoread += f"\n\nprint('Pickle file was written with Pandas version {pandasversion}')"

howtoread += "\nprint(f'Your version is {pd. __version__}. This should be the same or newer.')"

howtoread += "\n\nwith open('alldata.pickle', 'rb') as f:"

howtoread += "\n alldata = pickle.load(f)"

howtoread += "\n\nprint(f'Keys:{alldata.keys()}')"

howtoread += "\n\nprint(f'Summary of data:\\n{alldata}')"

fln = f'{outputdir}\\howtoread.py'

print(f' Writing {fln}')

with open(fln,'w') as f:

f.write(howtoread)

print(f'Write time:{time.time()-start:.1f} seconds')

$offEmbeddedCode

Yet Another Math Programming Consultant

Tuesday, January 24, 2023

Export GAMS GDX file to different Python formats (CSV,Feather,Pickle)

Example 1: sparse matrix

Example 2: sparse network

Conclusion

References

Appendix: gdx2python.gms

1 comment: