Friday, December 8, 2017

Sparse Matrix export from GAMS to R


The problem

A colleague posed the question to me: how we can export a GAMS parameter to an R sparse matrix?. A sparse matrix is a matrix where most of the elements are zero. In large scale optimization, sparse matrices play an important role. Software (solvers, modeling systems) are exploiting sparsity to reduce memory requirements and to speed up calculations. In fact very large problems can only be solved by using sparse data structures. I actually don't think the user wants a sparse matrix, but let's assume we really want one. The R Matrix package [3] has facilities to create and manipulate sparse matrices.

We assume we have the following GAMS data:
set
   i
/a,b,c,d,e/
   j
/e,f,g,h,i,j/
;

table d(i,j)
   
e f g h i j
 
a  1   2   4
 
b
 
c      3
 
d
 
e          5
;
execute_unload "gamsdata";

Notice that will export three items: sets i and j and the parameter d.

The big issue here is that GAMS uses a sparse storage scheme throughout. This means only the nonzero elements of matrix d are stored. Unfortunately, this also means that empty columns and empty rows are discarded: they are simply not visible. We can see this when we display the matrix:

----     17 PARAMETER d  

            e           g           i

a       1.000       2.000       4.000
c                   3.000
e                               5.000

The trick is to export also the domains i and j. With the aid of these domain sets we can reconstruct the sparse matrix with named rows and columns and with all empty rows and columns preserved.

Although this example is a small matrix, we assume the matrix is very larger and sparse. This means we don't want to use in R a pivot operation on d to create a 2D dense matrix. In all operations in this write-up we store and operate on just the non-zero elements.

Below I show three different approaches to populate a sparse matrix in R with GAMS data. There are some interesting issues in each approach.

Method 1: GDXRRW and R


We start with reading in the data from the GAMS GDX file:


library(gdxrrw)
# read d,i,j into data frames
D <- rgdx.param("gamsdata","d")
I <- rgdx.set("gamsdata","i")
J <- rgdx.set("gamsdata","j")

When we print things we see they arrived as follows:
D
##   i j d
## 1 a e 1
## 2 a g 2
## 3 a i 4
## 4 c g 3
## 5 e i 5
I
##   i
## 1 a
## 2 b
## 3 c
## 4 d
## 5 e
J
##   i
## 1 e
## 2 f
## 3 g
## 4 h
## 5 i
## 6 j
Note that all these three identifiers represent data frames. Warning: note that the columns for the tables I and J are always called i. This is a quirk of gdxrrw.

We need to create mappings from name to index position for both the rows and columns. So "a" → 1, "b" → 2, etc. R has an interesting feature: named vectors allow us to index a vector by strings. We created these named vectors as follows:


# create named vectors for index positions
# this allows us to map from name -> index position
ordi <- seq(nrow(I))
names(ordi) <- I$i
ordj <- seq(nrow(J))
names(ordj) <- J$i

When we print these mapping vectors we see:
ordi
## a b c d e 
## 1 2 3 4 5
ordj
## e f g h i j 
## 1 2 3 4 5 6

Using these vectors we can easily convert columns i and j in the D data frame from strings to numbers.

ordi[ D$i ] # row indices
## a a a c e 
## 1 1 1 3 5
ordj[ D$j ] # column indices
## e g i g i 
## 1 3 5 3 5

With this we are ready to construct the sparse matrix in R:


library(Matrix)
# create sparse matrix
M <- sparseMatrix(
      i = ordi[ D$i ],      # row indices
      j = ordj[ D$j ],      # column indices 
      x = D$d,              # nonzero values
      dims = c(nrow(I),nrow(J)),  # size of matrix
      dimnames = list(I$i,J$i)    # names of rows and columns
  )
When we print this we see:
M
## 5 x 6 sparse Matrix of class "dgCMatrix"
##   e f g h i j
## a 1 . 2 . 4 .
## b . . . . . .
## c . . 3 . . .
## d . . . . . .
## e . . . . 5 .

In this method we used named vectors to be able to index vectors by strings. This trick helped to translate the strings forming the sets i and j into their ordinal values or index positions.

Method 2: Add some GAMS code


We can simplify the R code a bit by letting GAMS do part of the work. We can augment the matrix d with extra information about the index positions for non-zero values.

set v /value,ordi,ordj/;
parameter d2(i,j,v);
d2(i,j,
'value') = d(i,j);
d2(i,j,
'ordi')$d(i,j) = ord(i);
d2(i,j,
'ordj')$d(i,j) = ord(j);
display d2;

The displayed output looks like:

----     22 PARAMETER d2  

          value        ordi        ordj

a.e       1.000       1.000       1.000
a.g       2.000       1.000       3.000
a.i       4.000       1.000       5.000
c.g       3.000       3.000       3.000
e.i       5.000       5.000       5.000

Reading this data is basically the same as before:


# read d,i,j into data frames
D2 <- rgdx.param("gamsdata","d2")
I <- rgdx.set("gamsdata","i")
J <- rgdx.set("gamsdata","j")

The D2 data frame looks like:

D2
##    i j     v d2
## 1  a e value  1
## 2  a e  ordi  1
## 3  a e  ordj  1
## 4  a g value  2
## 5  a g  ordi  1
## 6  a g  ordj  3
## 7  a i value  4
## 8  a i  ordi  1
## 9  a i  ordj  5
## 10 c g value  3
## 11 c g  ordi  3
## 12 c g  ordj  3
## 13 e i value  5
## 14 e i  ordi  5
## 15 e i  ordj  5

The row and columns indices and the values can be extracted by standard subsetting:


# subset D2 to extract ordi, ordj and value data
ordi <- D2[D2$v=="ordi","d2"]
ordj <- D2[D2$v=="ordj","d2"]
v <- D2[D2$v=="value","d2"]

The sparse matrix is now easy to create:

# create sparse matrix
m <- sparseMatrix(
      i = ordi,        # row indices
      j = ordj,        # column indices
      x = v,           # nonzero values
      dims = c(nrow(I),nrow(J)),    # size of matrix
      dimnames = list(I$i,J$i)      # names of rows and columns  
  )

Output is as expected:
m
## 5 x 6 sparse Matrix of class "dgCMatrix"
##   e f g h i j
## a 1 . 2 . 4 .
## b . . . . . .
## c . . 3 . . .
## d . . . . . .
## e . . . . 5 .

Method 3: Using gdxrrw smarter


As pointed out in the comments, the gdxrrw tool can actually returns all the information we need:


library(Matrix)
library(gdxrrw)
d <- rgdx('gamsdata',list(name='d')) 
M2 <- sparseMatrix(
  i = d$val[,1], # row indices
  j = d$val[,2], # column indices 
  x = d$val[,3], # nonzero values
  dims = c(length(d$uels[[1]]),length(d$uels[[2]])), # size of matrix
  dimnames = d$uels # names of rows and columns
)

Method 4: SQLite and R


Here we use SQLite as an intermediate data store. This will allow us to do some SQL joins while importing the data. First we convert the GAMS data into SQLite database. This is very easy:


system("gdx2sqlite -i gamsdata.gdx -o gamsdata.db")

The reading of the data is straightforward:

library(RSQLite)
sqlite<-dbDriver("SQLite") 
db <- dbConnect(sqlite,"gamsdata.db")
D <- dbReadTable(db,"d")
I <- dbReadTable(db,"i")
J <- dbReadTable(db,"j")

The data arrives as:

D
##   i j value
## 1 a e     1
## 2 a g     2
## 3 a i     4
## 4 c g     3
## 5 e i     5
I
##   i
## 1 a
## 2 b
## 3 c
## 4 d
## 5 e
J
##   j
## 1 e
## 2 f
## 3 g
## 4 h
## 5 i
## 6 j

The row numbers in an SQLite table can be retrieved by the rowid. For example:
dbGetQuery(db,"select rowid,* from i")
##   rowid i
## 1     1 a
## 2     2 b
## 3     3 c
## 4     4 d
## 5     5 e

Note: the rowid is automatically incremented by SQLite when adding rows. However, after deleting rows the rowid has no longer a direct relation to the row number. Our data base is newly created, so in our case we can assume the rowid is the same as the row number.

We use this to map from index names for the sets i and j to their ordinal numbers. We do this by using a join on tables d and i (or d and j).

The row and column indices can be retrieved as:

ordi <- dbGetQuery(db,"select i.rowid from d inner join i on i.i=d.i")[[1]]
ordj <- dbGetQuery(db,"select j.rowid from d inner join j on j.j=d.j")[[1]]

These vectors look like:

ordi
## [1] 1 1 1 3 5
ordj
## [1] 1 3 5 3 5

By now creating the sparse matrix is simple:

m <- sparseMatrix(
      i = ordi,     # row indices
      j = ordj,     # column indices
      x = D$value,  # nonzero values
      dims = c(nrow(I),nrow(J)),    # size of matrix
      dimnames = list(I$i,J$j)      # names of rows and columns 
  )

The output is no surprise:
m
## 5 x 6 sparse Matrix of class "dgCMatrix"
##   e f g h i j
## a 1 . 2 . 4 .
## b . . . . . .
## c . . 3 . . .
## d . . . . . .
## e . . . . 5 .

Method 4: Pivoting data frames


It is my assumption the user just wanted to pivot from "long" format to "wide" format [1]. This is efficiently done with the function spread [2]:

D <- rgdx.param("gamsdata","d")
library(tidyr)
P <- spread(D,j,d,fill=0)

This yields:
D
##   i j d
## 1 a e 1
## 2 a g 2
## 3 a i 4
## 4 c g 3
## 5 e i 5
P
##   i e g i
## 1 a 1 2 4
## 2 c 0 3 0
## 3 e 0 0 5

If you don't specify the fill parameter, spread will impute NA values. Note this operation is dense: all zeros are explicitly stored.

The duplication in column names can be easily fixed by something like:
colnames(P)][1]=""
P
##     e g i
## 1 a 1 2 4
## 2 c 0 3 0
## 3 e 0 0 5

If we want to keep the empty rows and columns from the original matrix, it is easiest to fix this at the GAMS level:

parameter d3(i,j) 'dense version';
d3(i,j) = d(i,j) +
eps;
display d3;

The addition of the special value eps to a number is somewhat special in GAMS. We have:

\[x+\text{eps} = \begin{cases}x&\text{if }x\ne 0\\ \text{eps}&\text{if }x=0\end{cases} \]

This trick can be used to preserve all zero (and non-zero) values. Indeed, the values of d3 are:

----     26 PARAMETER d3  dense version

            e           f           g           h           i           j

a       1.000         EPS       2.000         EPS       4.000         EPS
b         EPS         EPS         EPS         EPS         EPS         EPS
c         EPS         EPS       3.000         EPS         EPS         EPS
d         EPS         EPS         EPS         EPS         EPS         EPS
e         EPS         EPS         EPS         EPS       5.000         EPS


We can import this into R as follows:

D3 <- rgdx.param("gamsdata","d3",squeeze=F)
P <- spread(D3,j,d)

The squeeze option helps to deal with the eps values: we have to make sure that we map eps back to a physical zero. The results look like:

P
##   i e f g h i j
## 1 a 1 0 2 0 4 0
## 2 b 0 0 0 0 0 0
## 3 c 0 0 3 0 0 0
## 4 d 0 0 0 0 0 0
## 5 e 0 0 0 0 5 0

Conclusion 


Tons of ways to do this. Using rgdxrrw::rgdx the smart way is the simplest.

References


  1. Pivoting a table: GAMS, SQLite, R and Python, http://yetanothermathprogrammingconsultant.blogspot.com/2016/01/pivoting-table-gams-sqlite-r-and-python.html
  2. Brad Boehmke, Data Processing with dplyr & tyidr, https://rpubs.com/bradleyboehmke/data_wrangling
  3. Martin Maechler and Douglas Bates, 2nd Introduction to the Matrix package, https://cran.r-project.org/web/packages/Matrix/vignettes/Intro2Matrix.pdf

No comments:

Post a Comment