Thursday, September 7, 2017

Quantiles with Pandas

In [1] I showed how quantiles in GAMS are difficult, and how they can be calculated better in R. Here is some Python code to do the same:

import pandas as pd
df = pd.read_csv("p.csv")
q=df.groupby(['i','j']).quantile([0,.25,.5,.75,1])
print(q)
q.to_csv("q.csv")

The q data frame looks like:

                  Val
i  j                
i1 j1 0.00  18.002966
       0.25  28.242058
       0.50  33.222936
       0.75  62.736221
       1.00  85.770764
    j2 0.00   7.644259
       0.25  41.375281
       0.50  61.313381
       0.75  82.127640
       1.00  99.813645
    j3 0.00  14.017667
       0.25  16.559221
       0.50  30.775334
       0.75  38.482815
       1.00  67.223932
i2 j1 0.00  11.938737
       0.25  29.259331
       0.50  55.029177
       0.75  69.633259
       1.00  83.258388
    j2 0.00  16.857103
       0.25  28.783298
       0.50  53.358812
       0.75  65.534761
       1.00  87.373769
    j3 0.00   5.608600
       0.25  17.433855
       0.50  33.311746
       0.75  45.566497
       1.00  64.926986

This is pretty clean. Data frames can easily be read in from CSV files (see the example above) or databases.

The new GAMS Python scripting tool is not very friendly in this respect. We need to do a lot of transformations leading to a low signal-to-noise ratio:

embeddedCode Python:
   import pandas as pd
   p = list(gams.get('p', keyFormat=KeyFormat.FLAT))
   df = pd.DataFrame(p, columns=["i", "j", "k", "value"])
   print("");print("df");print(df)
   q=df.groupby(['i','j']).quantile([0,.25,.5,.75,1])
   print("q");print(q)
# q has one multi-level index
# convert to standard indices
   q2 = q.reset_index()
   print("q2");print(q2)
# to extract a set q: convert to list of strings
   qset = q2["level_2"].unique().astype('str').tolist()
   print("qset");print(qset)
   gams.set("q", qset)
# get data itself: list of tuples
   quantiles = list(zip(q2["i"],q2["j"],q2["level_2"].astype('str'),q2["value"]))
   print("quantiles");print(quantiles)
   gams.set("quantiles",quantiles)
endEmbeddedCode q,quantiles
display q,quantiles;

Note that I am trying to prevent looping over data frame rows here: all operations are on complete columns. The CSV input/output is actually much shorter and cleaner. In this code, there is really one line that does really some work; the rest can be considered as just overhead.

The GAMS interface should probably support data frames directly to overcome this “impedance mismatch.” When users need to mix and match different languages the interface should make things as easy as possible. A GAMS/Python interface that is too low level and stays too close to GAMS asks the user to be reasonably fluent in in both GAMS and Python (the intersection of knowledgeable GAMS and Python users is likely to be small), and write quite some glue-code dealing with getting data in and out. Choosing a better abstraction level would probably help here.  

References

  1. http://yetanothermathprogrammingconsultant.blogspot.nl/2017/08/quantiles.html