Tuesday, January 24, 2023

Export GAMS GDX file to different Python formats (CSV,Feather,Pickle)

A GAMS GDX file is a binary file with a number of GAMS symbols with their data. When collaborating with colleagues that work with Python, it may be useful to convert a GAMS GDX with relevant data to data files that Python can consume. 

The idea is here that a GAMS user can run the GAMS script, without knowing Python or having Python installed. (GAMS comes with its own somewhat barebones Python, which is used inside the GAMS script). On the other hand, the Python user will not need to have GAMS installed to read the generated data files. This is opposed to Python packages and tools that can interact with GDX files directly. You can see a few in the PyPI directory [1]. Those will require both a GAMS system and a Python environment.


The tool supports three different formats:
  1. A directory with CSV files. One file for each symbol.
  2. A directory with feather files. These are like binary CSV files [2]. Again one file per symbol.
  3. A single Pickle file.

The input consists of:
  1. GDX file
  2. name of output directory
  3. output format
The output is:
  1. Data files.
  2. Text file with a description of the content of the GDX file.
  3. Python script containing example code on how to read the data files. The data is read into Pandas dataframes.

This tool relies on embedded Python and the gamstransfer package.

Example 1: sparse matrix


In this example, we generate a sparse matrix in GAMS, and export this using our tool. The code to generate this data set is as follows:

Sets

 i /i1*i1000/

 j /j1*j1000/

;

 

Parameter A(i,j'sparse matrix';

A(i,j)$(uniform(0,1)<0.05) = uniform(1,1000);

 

execute_unload "SparseMatrix.gdx",i,j,A;

 

The generated matrix has a density of about 5%. The number of non-zero elements is 50,166.  After running the tool in CSV mode, we have an output directory with the CSV and some supporting files. The file gdxcontent.txt shows the symbols in the gdx file:

C:\Users\erwin\Downloads\sparsematrixCSV>type gdxcontent.txt

  Symbol Dim Type Records  Explanatory text

1 A        2  Par   50166  sparse matrix

i        1  Set    1000

3 j        1  Set    1000

 

The generated file howtoread.py shows how we can read the data files: 

C:\Users\erwin\Downloads\sparsematrixCSV>type howtoread.py

# example Python reader for data from sparsematrix.gdx

import pandas as pd

 

i = pd.read_csv('i.csv')

j = pd.read_csv('j.csv')

A = pd.read_csv('A.csv')

 

When we run this, we see:




This is a bit confusing. Let me try to interpret this.

  • The column name 'uni_0' probably indicates this is a subset of the universe of set elements in the GAMS model. The zero is probably added because there may be multiple indices. In this case, there are no multiple columns, so the 0 suffix is not needed. In GAMS terms, this would be a '*' column. 
  • The column 'element_text' is likely reserved for text associated with a set element. We don't have this kind of explanatory text in our model, so this column can be ignored. It probably should not even be there.
  • The column names 'i_0' and 'j_0' are mangled. They could just be 'i' and 'j'. This is probably to guard against duplicate column names. GAMS allows something like A(i,i) so the concern is understandable. However, it would be better to mangle only if needed. 

The other formats store the same thing. The pickle format stores the dataframes in a single dictionary. The generated howtoread.py file looks like:

C:\Users\erwin\Downloads\sparsematrixPickle>type howtoread.py

# example Python reader for data from sparsematrix.gdx

import pandas as pd

import pickle

 

print('Pickle file was written with Pandas version 1.4.4')

print(f'Your version is {pd. __version__}. This should be the same or newer.')

 

with open('alldata.pickle', 'rb') as f:

   alldata = pickle.load(f)

 

print(f'Keys:{alldata.keys()}')

 

print(f'Summary of data:\n{alldata}')

 

Some info about versions is added: we can only read Pickle files reliably with a matching (or newer) Python version. It also prints the keys, and a summary printout, so you know a bit about what is inside the 'alldata' dictionary. The output looks like:

C:\Users\erwin\Downloads\sparsematrixPickle>py howtoread.py

Pickle file was written with Pandas version 1.4.4

Your version is 1.5.2. This should be the same or newer.

Keys:dict_keys(['i', 'j', 'A'])

Summary of data:

{'i':      uni_0 element_text

0       i1

1       i2

2       i3

3       i4

4       i5

..     ...          ...

995   i996

996   i997

997   i998

998   i999

999  i1000

 

[1000 rows x 2 columns], 'j':      uni_0 element_text

0       j1

1       j2

2       j3

3       j4

4       j5

..     ...          ...

995   j996

996   j997

997   j998

998   j999

999  j1000

 

[1000 rows x 2 columns], 'A':          i_0   j_1       value

0         i1   j44  339.211722

1         i1   j58  792.568281

2         i1   j64  585.546042

3         i1   j79  270.343439

4         i1   j98  401.826455

...      ...   ...         ...

50161  i1000  j955  866.032624

50162  i1000  j963  633.530151

50163  i1000  j974  300.124396

50164  i1000  j985  987.395802

50165  i1000  j991  341.163135

 

[50166 rows x 3 columns]}

 

We see that alldata['i'], alldata['j'] and alldata['A'] contain our dataframes.

The third format is a feather file. These are basically like binary CSV files. They can be read as:

C:\Users\erwin\Downloads\sparsematrixFeather>type howtoread.py

example Python reader for data from sparsematrix.gdx

import pandas as pd

 

i = pd.read_feather('i.feather')

j = pd.read_feather('j.feather')

A = pd.read_feather('A.feather')

 

There is one big issue with this format: the crippled Python version that comes with GAMS has no support for this, and neither is it easy to add. Here is a recipe:

>cd \gams\41\GMSPython    (or whatever your GAMS installation directory is)
>curl -O https://bootstrap.pypa.io/get-pip.py
>python get-pip.py
>scripts\pip install pyarrow


After this, you can let the script export dataframes into Feather files.

The dataframes have some properties you should be aware of:
  • They are in long format (database format). This is how things are typically stored in a database. In a spreadsheet or wide format, we would pivot one of the index columns to the right. Pivot and unpivot operations are some of the most important ones when using dataframes.
  • They use a sparse representation: the zero values are not stored.
  • The index columns form a unique key. There are no duplicates.
  • The index columns are strings, and the value column is a double-precision floating point value. When using CSV (text) files, these may appear as integers. 
If you want to convert the sparse representation of A into a proper 2d numpy matrix, you can do something like:


The ordering is taken from sets i and j.


Example 2: sparse network


Here is a simple procedure to generate a sparse directed network or graph in GAMS:

set i 'nodes' /node1*node100/;

alias(i,j);

set a(i,j'arcs';

 

a(i,j) = uniform(0,1)<0.1;

 

*

only display if small

*

display$(card(i)<50) i,a;

 

*

export network as GDX file

*

execute_unload "network.gdx",i,a;

 

 

This data set has a 2-dimensional set 'a' representing the arcs (or edges). After converting the GDX file to CSV files, we can import this in Python as:


 The arcs look like:



Let's do some simple operations on the network. In GAMS, we can do:

*

size of network

*

parameter size(*) 'number of nodes and arcs';

size('nodes') = card(i);

size('arcs') = card(a);

option size:0; display size;

 

*

print some statistics on the degrees of the nodes

*

parameter maxdegree(*) 'max in/out-degree';

maxdegree('in') = smax(i,sum(a(j,i),1));

maxdegree('out') = smax(i,sum(a(i,j),1));

maxdegree('in/out') = smax(i,sum(a(j,i),1)+sum(a(i,j),1));

option maxdegree:0; display maxdegree;

 

 

This shows:


----     30 PARAMETER size  number of nodes and arcs

nodes  100,    arcs  1002


----     39 PARAMETER maxdegree  max in/out-degree

in     20,    out    17,    in/out 36



Let's do the same thing in Python, using the networkx package.




Conclusion


The GAMS script shown in the appendix is a tool to convert a GAMS GDX data file into a representation that can be consumed in standalone Python. This little tool has already been shown to be useful for GAMS modelers to produce data sets for further processing in a Python environment.


References


Appendix: gdx2python.gms


$onText

 

   Export GDX file to Python data files

  

   Supported export formats:

  

     - multiple CSV files (one per symbol)

     - multiple Feather files (binary, one per symbol)

     - Pickle file (single file)

    

  The Feather files are the fastest. However, you need to install support for this

  (package pyarrow).

   

 

$offText

 

 

*-----------------------------------------------------------------------

* User-provided settings

*-----------------------------------------------------------------------

 

$set gdxfile      network.gdx

$set outputdir    networkdata

 

* Output format

* can be: CSV,Feather,Pickle

$set format       CSV

*$set format       Feather

*$set format       Pickle

 

 

*-----------------------------------------------------------------------

* Body

*-----------------------------------------------------------------------

 

 

quick checks

$if not exist %gdxfile%   $abort %gdxfile% does not exist

$call mkdir %outputdir%

$if errorlevel 1 $abort could not create %outputdir%

you can comment out the previous like if you want to go ahead

in the case the directory already exist.

 

 

for reference dump the table of contents of the GDX file to the output directory

$call gdxdump %gdxfile% symbols > %outputdir%\gdxcontent.txt

$call type %outputdir%\gdxcontent.txt

 

 

* Here we loop over all symbols in the GDX file and export them.

 

$onEmbeddedCode Python:

 

import gamstransfer as gt

import pandas as pd

import pickle

import sys

import time

 

gdxfile = r'%gdxfile%'

outputdir = r'%outputdir%'

format = '%format%'

 

start = time.time()

print(f'Reading {gdxfile}')

gdx = gt.Container(gdxfile)

print(f'Read time:{time.time()-start:.1f} seconds')

 

start = time.time()

 

howtoread = f'''# example Python reader for data from {gdxfile}

import pandas as pd

'''

 

if format=='Pickle':

   alldata = {}

   pandasversion = pd. __version__

   howtoread += 'import pickle'

 

run through all symbols in gdx file

for sym in gdx.data.keys():

   df = gdx.data[sym].records

   if df is None: continue

   #print(df.head())

   if format=='CSV':

        fln = f'{outputdir}\\{sym}.csv'

        print(f'  {sym} --> {fln}')

        df.to_csv(fln,index=False)

        howtoread += f"\n{sym} = pd.read_csv('{sym}.csv')"

   if format=='Feather':

        fln = f'%outputdir%\\{sym}.feather'

        print(f'  {sym} --> {fln}')

        df.to_feather(fln)

        howtoread += f"\n{sym} = pd.read_feather('{sym}.feather')"

   if format=='Pickle':

        print(f'  Reading {sym}')

        alldata[sym] = df

   

if format=='Pickle':

   fln = f'{outputdir}\\alldata.pickle'

   print(f'  {gdxfile} --> {fln}')

   with open(fln, 'wb') as handle:

        pickle.dump(alldata, handle, protocol=pickle.HIGHEST_PROTOCOL)

   howtoread += f"\n\nprint('Pickle file was written with Pandas version {pandasversion}')"

   howtoread += "\nprint(f'Your version is {pd. __version__}. This should be the same or newer.')"

   howtoread += "\n\nwith open('alldata.pickle', 'rb') as f:"

   howtoread += "\n   alldata = pickle.load(f)"    

   howtoread += "\n\nprint(f'Keys:{alldata.keys()}')"    

   howtoread += "\n\nprint(f'Summary of data:\\n{alldata}')"    

 

fln = f'{outputdir}\\howtoread.py'

print(f'  Writing {fln}')

with open(fln,'w') as f:

    f.write(howtoread)

 

print(f'Write time:{time.time()-start:.1f} seconds')

        

$offEmbeddedCode

 

 

No comments:

Post a Comment