Thursday, April 23, 2020

Random Sparse Arcs in GAMS

I was playing a bit with generating a large network (graph) with \(n=5000\) nodes and say \[\frac{n^2}{100}= 250,000\] arcs. The standard way to generate this in GAMS is:

*
* nodes
*
set i 'nodes' /node1*node5000/;
alias(i,j);

*
* random arcs (1%)
*
set A(i,j) 'arcs';
A(i,j) = uniform(0,1)
<0.01;

scalar numArcs 'number of arcs';
numArcs =
card(A);
display numArcs;


Basically, this approach generates \(n^2\) random numbers  \( r_{i,j} \sim U(0,1)\) (uniform distribution) and picks the arcs related to a value \(r_{i,j}\lt 0.01\). This is a randomized process, so we don't see exactly 250,000 arcs, but rather:


----     16 PARAMETER numArcs              =   249170.000  number of arcs


This operation is not exactly cheap. It takes 11.2 seconds on my laptop.

There are a few other approaches we can try.

Random locations


Instead of generating \(n^2\) random numbers, we can generate \(k=250,000\) pairs of \(i,j\) values (integers between 1 and \(n\)). A straigthforward implementation would be:

scalars k,n,nn,ni,nj;
n =
card(i);
nn = n*n/100;
for (k = 1 to nn,
   ni = uniformint(1,n);
   nj = uniformint(1,n);
   A(i,j)$(
ord(i)=ni and ord(j)=nj) = yes;
);

Unfortunately, this is extremely slow. I stopped it after 2,000 seconds: it was still not finished. GAMS is horribly slow when doing loops like this.

Crazy indexing


A variant of the above approach is to try to trick GAMS into a vectorized version of this. Well, this is not so easy. Here is a version that uses leads/lags when indexing:

sets
   A(*,*)
'arcs'
   k
/node1*node5000,k5001*k250000/
   ij
/i,j/
;
parameter r(k,ij) 'random offset from k';
r(k,ij) = uniformint(1,
card(i)) - ord(k);
A(k+r(k,
'i'),k+r(k,'j')) = yes;


This requires some explanation. 
  • The set A is no longer domain-checked. We use funky indexing here, so we need loosen things up.
  • The set k has 250,000 elements. The first 5000 are identical to \(i,j\).
  • The parameter r contains \(2\times 250,000\) random integers between 1 and \(n\). We store them as offsets from the index \(k\). This sounds crazy but the next statement explains why this is done.
  • The assignment to A is using leads. Think of it as a variant of A(k,k) = yes.  This would populate a diagonal. Similarly, A(k+1,k) = yes shifts the diagonal by one. Using A(k+r(k,'i'),k+r(k,'j')) we index exactly the correct location.
  • We may generate duplicates, so the number will be slightly below 250k arcs.
This beast actually works, but the performance is not great. The fragment takes about 13 seconds, so no gain compared to our original approach.

Python loop


Let's try some Python. We can build up sets using Python within a GAMS model:

set A(i,j) 'arcs';

$onEmbeddedCode Python:
from random import seed,randint
seed(
999)
n = len(list(gams.get(
"i")))
nn = n*n//
100
a = {}
for k in range(nn):
   ni = randint(
1,n)
   nj = randint(
1,n)
   elem = (
"node{}".format(ni),"node{}".format(nj))
   a[elem] =
1
gams.set(
"A",list(a))
$offEmbeddedCode A



This is just a k-loop. We use here a Python dictionary to store elements as this handles possible duplicates correctly. This approach also takes about 11 seconds, so no gain compared to our first approach.

R to the rescue


So let's see if we can use R to speed things up:

$set n    5000
$set inc  data.inc

*
* R script
*
$onecho > script.R
library(data.table)
n <- %n%
nn <- n^2/100
df <- data.frame(
        
ni=sample(n,nn,replace=TRUE),
        
nj=sample(n,nn,replace=TRUE))
df <- df[order(df$ni,df$nj),]
v <- unique(paste0("node",df$ni,".node",df$nj))
fwrite(list(v),"%inc%",col.names=F,quote=F)
$offecho
$call '"c:\program files\R\R-3.5.0\bin\Rscript.exe" script.R'

*
* nodes and arcs
*
set i 'nodes' /node1*node%n%/;
alias(i,j);

set A(i,j) 'arcs' /
$offlisting
$include %inc%
$onlisting
/;


We write from inside GAMS an R script that gets executed by rscript.exe. The steps are:
  • sample the arcs,
  • sort in the order that GAMS likes,
  • create a string representation of the tuple, e.g. "node1.node2",
  • remove duplicates (this means we end up with a number of arcs that is smaller than 250k),
  • write to an include file (note: the function fwrite from the data.table package seems a bit faster than the function writeLines from the base package)
The include file is then processed by GAMS. To speed things up we remove echoing to the listing file. 

This method is the winner: it takes just 3.5 seconds. Sometimes using a plain old text file as a communication channel is not so bad.


GDX file is slower


The code:

$set n    5000
$set gdx  data.gdx

*
* R script
*
$onecho > script.R
library(gdxrrw)
library(dplyr)
n <- 5000
nn <- n^2/100
df <- data.frame(
 
ni=paste0("node",sample(n,nn,replace=TRUE)),
 
nj=paste0("node",sample(n,nn,replace=TRUE)),
 
stringsAsFactors = F)
df <- distinct(df,ni,nj)
nr <- nrow(df)
wgdx("%gdx%",list(name="A", type="set", dim=2, form="sparse", ts="arcs", val=cbind(1:nr,1:nr), uels=c(list(df$ni),list(df$nj))))
$offecho
$call '"c:\program files\R\R-3.5.0\bin\Rscript.exe" script.R'

*
* nodes and arcs
*
set i 'nodes' /node1*node%n%/;
alias(i,j);

set A(i,j) 'arcs';
$gdxin %gdx%
$loaddc A


is slower than using a text file. The reason is that wgdx is not as efficient.

3 comments:

  1. So I tried the same in AMPL: generate 250000 pairs with elements drawn from [1,5000]. Here are three possibilities that I tested:

    (1)
    set A = {1..5000, 1..5000: Uniform01() < 0.01};

    (2)
    set A = setof {1..250000} (ceil(Uniform(0,5000)),ceil(Uniform(0,5000)));

    (3)
    set A dimen 2, default {};
    repeat {
    let A := A union setof {1..250000-card(A)} (ceil(Uniform(0,5000)),ceil(Uniform(0,5000)));
    } until card(A) = 250000;

    (1) generates approximately 250000 pairs. (2) always generates fewer than 250000, because some are created more than once, but it comes close when the set is only a small fraction of all possible pairs. (3) generates exactly the required number, by using a loop together with the idea of (2) to keep trying until the desired number of pairs is generated. Approximate execution times in seconds are as follows:

    (1) 1.1 (2) 0.5 (3) 0.8

    To generate the same number of pairs but with elements drawn from [1,50000], you just have to change 5000 to 50000 and 0.01 to 0.0001. Then (2) and (3) are much faster:

    (1) 107.4 (2) 1.0 (3) 1.1

    The work of (1) is proportional to the square of the size of the set from which elements are drawn, while the work of (2) and (3) is mainly influenced by the number of pairs to be drawn.

    ReplyDelete
  2. I do not have GAMS, but the Python example seems to be fast until the line gams.set("A",list(a)). You could also write the list to disk first like in R. I can imagine that the sparse data structure in GAMS benefits from receiving sorted data as to avoid random insertions in a binary tree or linked list.

    import numpy as np
    ni = np.random.randint(0, n, nn)
    nj = np.random.randint(0, n, nn)
    arcs = sorted(list(set( (i,j) for i,j in zip(ni,nj) )))
    arcs = [("node{}".format(i),"node{}".format(j)) for i,j in arcs]

    ReplyDelete