Tuesday, December 28, 2021

Importing JSON files into GAMS

JSON (JavaScript Object Notation) files [1] provide a flexible and simple way to store and exchange data. It is text, so we can easily look at it. JSON files typically have the extension .json. (using it may help syntax coloring in editors). JSON can represent rather complex tree-like data structures that may require some work to convert into simpler formats such as tabular data. 

EIA (U.S. Energy Information Administration) provides bulk data in JSON format [2]. These files are not strictly JSON. They have a .txt extension. Basically, each line is a JSON object. 


To make it a proper JSON file one would have to wrap this into an array.

If we reformat the first object a bit, we see:


To store this in a GAMS parameter, we can envision something like: 

parameter series(id,t) 'data for each series';

where t is a time/date index.

Apparently, a value of "-" means: not available. GAMS has a special value (NA) for this. However, we need to be a bit careful as these JSON data files have different conventions and codes for this purpose:

  • null: this is a JSON keyword indicating no data
  • "-", "--", "- -": different forms of dashes are used
  • "W", "NM", "NA": different strings indicating "withheld", "not measured", "not available".
Even the same file can contain a bunch of different ways to indicate not available. In order to not throw this information away, I also store an error set, indicating how the NA was signaled. 

Here is some GAMS+Python code that reads the file TOTAL.txt.

$ontext

  
Reads json file using embedded Python

  
Data from: https://www.eia.gov/opendata/bulkfiles.php

$offtext


$set dataset  TOTAL
$set datafile %dataset%.txt
$set gdxfile  %dataset%.gdx

$setenv gdxCompress=1

set
  t  
'date/year'
  id
'series_id from txt file, explanatory text has name'
  e 
'error strings where we have for NA (not available)'
  errors(id,t,e<)
'could not extract a value from these'
;

parameter
   series(id,t<)
'data for each series'
;

$onEmbeddedCode Python:
import json
print("")

fln =
'%datafile%'

print(f"read {fln}")
with open(fln) as f:
     lines = f.readlines()
print(f"lines:{len(lines)}")

ids = []
# 1D set
series = []
# 2d parameter
errors = []
# 3d set

print("processing")
for s in lines:
     obj = json.loads(s)
    
if 'series_id' in obj:
           id = obj[
'series_id']
           name = obj[
'name']
           ids.append((id,name))
          
for item in obj['data']:
                  date = item[
0]
                  value = item[
1]
                 
if type(value) in [float,int]:
                         series.append((id,date,value))
                 
else:
                         series.append((id,date,float(
'nan')))
                         errors.append((id,date,str(value)))

gams.set(
'id',ids)
gams.set(
'series',series)
gams.set(
'errors',errors)

$offEmbeddedCode id series errors

* count cases
parameter errCount(e) 'observations recoded as NA';
errCount(e) =
sum(errors(id,t,e),1);
option errCount:0;
display errCount;

execute_unload "%gdxfile%";


This will export a GAMS GDX file with the data in GAMS readable format. The errCount parameter shows how many times we saw a given string instead of a number:

----     67 PARAMETER errCount  observations recoded as NA

-    3431,    NA  36015,    NM     22,    W    1450,    - -  1787,    --      3

There are quite a few of "-", "NA" and "W".

Of course, in GAMS we can see:


The values with NA are replicated in the errors sets. The empty cells correspond to zero values: GAMS does not store zeros (it uses a sparse storage scheme).

Some (somewhat nerdy) notes:
  • Some of the files are very large. elec.txt is 1.1 GB. It takes a while to process this file.
  • Not all lines contain data series. We only look at lines with a series_id.
  • The series_id string has encoded information in it. I did not unpack that here.
  • The dates are not ordered from old to new. GAMS will order things based on how it receives set elements. If you want to order things say from old to new, then just create a dummy set with dates that are ordered, and declare that dummy set at the beginning of the model.
  • If you want to explicitly store the zeros in GAMS, you can use the EPS special value for that. It is numerically zero but is actually stored.
  • Reading JSON often requires a bit of programming outside GAMS (like here in Python). Writing JSON can be done with the PUT facility (JSON is just text). I often write JSON or JavaScript data files to visualize solutions in the browser. Examples are [3,4].


Conclusion


With just a little bit of Python code, we can read JSON objects and store them in a GAMS parameter.


References


  1. JSON, json.org.
  2. EIA Open Data Bulk Download Facility, https://www.eia.gov/opendata/bulkfiles.php
  3. Shortest path in GAMS, https://yetanothermathprogrammingconsultant.blogspot.com/2021/11/shortest-path-in-gams.html
  4. MacMahon Squares, https://yetanothermathprogrammingconsultant.blogspot.com/2021/12/macmahon-squares.html

No comments:

Post a Comment