Thursday, April 23, 2020

Random Sparse Arcs in GAMS

I was playing a bit with generating a large network (graph) with \(n=5000\) nodes and say \[\frac{n^2}{100}= 250,000\] arcs. The standard way to generate this in GAMS is:

*
* nodes
*
set i 'nodes' /node1*node5000/;
alias(i,j);

*
* random arcs (1%)
*
set A(i,j) 'arcs';
A(i,j) = uniform(0,1)
<0.01;

scalar numArcs 'number of arcs';
numArcs =
card(A);
display numArcs;


Basically, this approach generates \(n^2\) random numbers  \( r_{i,j} \sim U(0,1)\) (uniform distribution) and picks the arcs related to a value \(r_{i,j}\lt 0.01\). This is a randomized process, so we don't see exactly 250,000 arcs, but rather:


----     16 PARAMETER numArcs              =   249170.000  number of arcs


This operation is not exactly cheap. It takes 11.2 seconds on my laptop.

There are a few other approaches we can try.

Random locations


Instead of generating \(n^2\) random numbers, we can generate \(k=250,000\) pairs of \(i,j\) values (integers between 1 and \(n\)). A straigthforward implementation would be:

scalars k,n,nn,ni,nj;
n =
card(i);
nn = n*n/100;
for (k = 1 to nn,
   ni = uniformint(1,n);
   nj = uniformint(1,n);
   A(i,j)$(
ord(i)=ni and ord(j)=nj) = yes;
);

Unfortunately, this is extremely slow. I stopped it after 2,000 seconds: it was still not finished. GAMS is horribly slow when doing loops like this.

Crazy indexing


A variant of the above approach is to try to trick GAMS into a vectorized version of this. Well, this is not so easy. Here is a version that uses leads/lags when indexing:

sets
   A(*,*)
'arcs'
   k
/node1*node5000,k5001*k250000/
   ij
/i,j/
;
parameter r(k,ij) 'random offset from k';
r(k,ij) = uniformint(1,
card(i)) - ord(k);
A(k+r(k,
'i'),k+r(k,'j')) = yes;


This requires some explanation. 
  • The set A is no longer domain-checked. We use funky indexing here, so we need loosen things up.
  • The set k has 250,000 elements. The first 5000 are identical to \(i,j\).
  • The parameter r contains \(2\times 250,000\) random integers between 1 and \(n\). We store them as offsets from the index \(k\). This sounds crazy but the next statement explains why this is done.
  • The assignment to A is using leads. Think of it as a variant of A(k,k) = yes.  This would populate a diagonal. Similarly, A(k+1,k) = yes shifts the diagonal by one. Using A(k+r(k,'i'),k+r(k,'j')) we index exactly the correct location.
  • We may generate duplicates, so the number will be slightly below 250k arcs.
This beast actually works, but the performance is not great. The fragment takes about 13 seconds, so no gain compared to our original approach.

Python loop


Let's try some Python. We can build up sets using Python within a GAMS model:

set A(i,j) 'arcs';

$onEmbeddedCode Python:
from random import seed,randint
seed(
999)
n = len(list(gams.get(
"i")))
nn = n*n//
100
a = {}
for k in range(nn):
   ni = randint(
1,n)
   nj = randint(
1,n)
   elem = (
"node{}".format(ni),"node{}".format(nj))
   a[elem] =
1
gams.set(
"A",list(a))
$offEmbeddedCode A



This is just a k-loop. We use here a Python dictionary to store elements as this handles possible duplicates correctly. This approach also takes about 11 seconds, so no gain compared to our first approach.

R to the rescue


So let's see if we can use R to speed things up:

$set n    5000
$set inc  data.inc

*
* R script
*
$onecho > script.R
library(data.table)
n <- %n%
nn <- n^2/100
df <- data.frame(
        
ni=sample(n,nn,replace=TRUE),
        
nj=sample(n,nn,replace=TRUE))
df <- df[order(df$ni,df$nj),]
v <- unique(paste0("node",df$ni,".node",df$nj))
fwrite(list(v),"%inc%",col.names=F,quote=F)
$offecho
$call '"c:\program files\R\R-3.5.0\bin\Rscript.exe" script.R'

*
* nodes and arcs
*
set i 'nodes' /node1*node%n%/;
alias(i,j);

set A(i,j) 'arcs' /
$offlisting
$include %inc%
$onlisting
/;


We write from inside GAMS an R script that gets executed by rscript.exe. The steps are:
  • sample the arcs,
  • sort in the order that GAMS likes,
  • create a string representation of the tuple, e.g. "node1.node2",
  • remove duplicates (this means we end up with a number of arcs that is smaller than 250k),
  • write to an include file (note: the function fwrite from the data.table package seems a bit faster than the function writeLines from the base package)
The include file is then processed by GAMS. To speed things up we remove echoing to the listing file. 

This method is the winner: it takes just 3.5 seconds. Sometimes using a plain old text file as a communication channel is not so bad.


GDX file is slower


The code:

$set n    5000
$set gdx  data.gdx

*
* R script
*
$onecho > script.R
library(gdxrrw)
library(dplyr)
n <- 5000
nn <- n^2/100
df <- data.frame(
 
ni=paste0("node",sample(n,nn,replace=TRUE)),
 
nj=paste0("node",sample(n,nn,replace=TRUE)),
 
stringsAsFactors = F)
df <- distinct(df,ni,nj)
nr <- nrow(df)
wgdx("%gdx%",list(name="A", type="set", dim=2, form="sparse", ts="arcs", val=cbind(1:nr,1:nr), uels=c(list(df$ni),list(df$nj))))
$offecho
$call '"c:\program files\R\R-3.5.0\bin\Rscript.exe" script.R'

*
* nodes and arcs
*
set i 'nodes' /node1*node%n%/;
alias(i,j);

set A(i,j) 'arcs';
$gdxin %gdx%
$loaddc A


is slower than using a text file. The reason is that wgdx is not as efficient.

Monday, April 6, 2020

Mondriaan Tilings 2

In [1], I showed a Mixed Integer Programming formulation for the Mondriaan Tiling Problem. To summarize, the problem can be stated as:

  1. Given an \(n\times n\) square area,
  2. Partition this area into differently sized rectangles,
  3. Such that the difference in size between the largest and the smallest rectangle is minimized.

Instead of partitioning, we can look at the problem as a tiling problem: given a set of differently sized tiles, place a subset of them on the area such that the whole area is covered and the difference in the size of used tiles (largest minus smallest) is minimized.

The model in [1] looked like:

Original MIP Model
\[\begin{align}\min\> & \color{darkred}u - \color{darkred}\ell\\ & \sum_{k,r,i,j|\color{darkblue}{\mathit{cover}}(k,r,i,j,i',j')}\color{darkred}x_{k,r,i',j'}=1 && \forall i',j'&&\text{cover cell $(i',j')$}\\ &\color{darkred}y_k = \sum_{r,i,j|\color{darkblue}{\mathit{ok}}(k,r,i,j)}\color{darkred}x_{k,r,i,j}&& \forall k && \text{select max one type of rectangle} \\ & \color{darkred}\ell \le \color{darkblue}{\mathit{area}}_k \cdot\color{darkred}y_k + \color{darkblue}M (1-\color{darkred}y_k) && \forall k && \text{bound smallest area}\\ & \color{darkred}u \ge \color{darkblue}{\mathit{area}}_k \cdot\color{darkred}y_k  && \forall k && \text{bound largest area}\\ & \sum_k \color{darkblue}{\mathit{area}}_k \cdot\color{darkred}y_k = \color{darkblue} n^2 && && \text{extra constraint}\\ & \color{darkred}x_{k,r,i,j} \in \{0,1\} \\&  \color{darkred}y_k \in \{0,1\} \\ & \color{darkred}u, \color{darkred}\ell \ge 0 \\& \color{darkblue}M = \max_k \color{darkblue}{\mathit{area}}_k\end{align}\]

The main (binary) variables are:

Decision Variables
\[\color{darkred} y_k = \begin{cases} 1 & \text{if tile $k$ is selected} \\ 0 & \text{otherwise} \end{cases}\]
\[\color{darkred} x_{k,r,i,j} = \begin{cases} 1 & \text{if tile $k$ (orientation $r$) is placed at location $(i,j)$} \\ 0 & \text{otherwise} \end{cases}\]


Here I want to discuss a different approach. Instead of solving the problem in one swoop for both \(y\) (select tiles) and \(x\) (place tiles), we can design a simple algorithm that iterates between two different models until we find a feasible integer solution. By ordering things from best to worse, as soon as we find a feasible solution, we have the optimal solution.

The first model is dedicated to select \(y\) variables (i.e. select tiles) such that \[ \sum_k \mathit{area}_k \cdot y_k = n^2\] and where the objective \[ \min \> u-\ell\] is optimized. The second model takes a solution for \(y\) and checks if it is feasible. If not feasible, we go back to model 1 and generate a new, next best solution for \(y\).

Model 1: selection of tiles, can look like:

MIP Model 1: Generate proposals for \(y\)
\[\begin{align}\min\> & \color{darkred}u - \color{darkred}\ell\\ & \color{darkred}\ell \le \color{darkblue}{\mathit{area}}_k \cdot\color{darkred}y_k + \color{darkblue}M (1-\color{darkred}y_k) && \forall k && \text{bound smallest area}\\ & \color{darkred}u \ge \color{darkblue}{\mathit{area}}_k \cdot\color{darkred}y_k  && \forall k && \text{bound largest area}\\ & \sum_k \color{darkblue}{\mathit{area}}_k \cdot\color{darkred}y_k = \color{darkblue} n^2 && && \text{area constraint}\\ & \sum_k (2 \color{darkblue}Y^*_{p,k} -1) \color{darkred}y_k \le \sum_k \color{darkblue}Y^*_{p,k}-1 && p \in P && \text{cut off previously found solutions} \\ &  \color{darkred}y_k \in \{0,1\} \\ & \color{darkred}u, \color{darkred}\ell \ge 0 \\& \color{darkblue}M = \max_k \color{darkblue}{\mathit{area}}_k\end{align}\]

Previously found solutions for \(y\) are stored in the data structure \(Y^*\). The cut will forbid any previously found solution to be considered again. In the first solve, we can use \(P=\emptyset\) (no previously found solutions).

The second model establishes whether the solution \(y^*\) in model 1 is feasible. This model, given a selection of tiles (\(y^*\)) will try to place the tiles such that all locations of the whole area are covered by exactly one tile.


MIP Model 2: Check feasibility of \(y^*\)
\[\begin{align}\min\> & 0 \\ & \sum_{k^*,r,i,j|\color{darkblue}{\mathit{cover}}(k^*,r,i,j,i',j')}\color{darkred}x_{k^*,r,i',j'}=1 && \forall i',j' \text{ and } k^* \in \color{darkblue} K^* &&\text{cover cell $(i',j')$}\\ &\sum_{r,i,j|\color{darkblue}{\mathit{ok}}(k^*,r,i,j)}\color{darkred}x_{k^*,r,i,j} = 1&& \text{where } k^* \in \color{darkblue} K^* && \text{select one type of rectangle}\\ & \color{darkred}x_{k^*,r,i,j} \in \{0,1\}  \\ & \color{darkblue} K^* = \{k|\color{darkblue}y_k^*=1\} && &&\text{set of selected tiles}\end{align}\]

The algorithm can look like:

  1. Initialization: \(P = \emptyset\).
  2. Solve model 1
  3. Record the solution \(y^*_k\), augment the set \(P\) and add this solution to \(Y^*_{p,k}\).
  4. Solve model 2
  5. If feasible: STOP. We have found the optimal solution.
  6. Go to step 2.

Small example


When we solve a \(10\times 10\) problem, we need 10 cycles:



----    227 PARAMETER ysolrep  solutions for m1 (ordered by obj)

              obj    smallest     largest      status

sol1            3          24          27  Infeasible
sol2            6          18          24  Infeasible
sol3            6          18          24  Infeasible
sol4            7           9          16  Infeasible
sol5            7           9          16  Infeasible
sol6            7          21          28  Infeasible
sol7            7          21          28  Infeasible
sol8            7          14          21  Infeasible
sol9            7          14          21  Infeasible
sol10           8           4          12    Feasible


The status column indicates what model 2 thinks about this proposal. The first 9 generated proposals were infeasible. The first tiling with a difference of 8 was feasible, and we can declare that an optimal solution. We have proven that no better solution exists.

The results look like:

Solution for 10x10 problem


Large example


The results for a \(25\times 25\) problem are:

Optimal solution for 25x25 problem

This problem required 246 cycles (i.e. 496 models to be solved).

References


  1. Mondriaan Tilings, https://yetanothermathprogrammingconsultant.blogspot.com/2020/04/mondriaan-tilings.html The problem and a mathematical model are detailed in this post. 

Thursday, April 2, 2020

On the importance of well-written error messages

This is funny:
"I also tried to get Fortran programs running on Imperial's IBM 7010, since I sort of knew Fortran, certainly better than I knew Cobol, and Fortran would have been better suited for data analysis. It was only after weeks of fighting JCL, IBM's Job Control Language, that I deduced that there was no Fortran compiler on the 7010, but the JCL error messages were so inscrutable that no one else had figured that out either." [1]

Indeed, even today error messages are often poorly formulated and leaving users in the dark.

IBM 7010


References

  1. Brian Kernighan, UNIX, A History and a Memoir, 2020. 

Wednesday, April 1, 2020

Mondriaan Tilings


Introduction



Piet Mondriaan (later Mondrian) (1872-1944) was a Dutch painter, famous for his abstract works [1]. 



A tiling problem is related to this [2]:



The problem can be stated as follows:
  • Given is a square \((n \times n)\) area,
  • Partition this square into multiple, integer-sided, non-congruent rectangles,
  • Minimize the difference in size between the largest and the smallest rectangle.
Non-congruent means: only one \((p \times q)\) or \((q\times p)\) rectangle can be used.

It is interesting to make some predictions.

  • We try not to use a combination of very small and very large rectangles. That is obvious.
  • Only using small rectangles will always lead to a larger difference, as we need many small rectangles to cover the complete square. This will lead automatically to a larger range of sizes.
  • Very large rectangles are probably no good either: high probability we need some very small rectangles to fill up any remaining holes.
  • So I guess, we will use medium-sized rectangles of similar size. 


Example


Consider a small problem with \(n=5\). The possible sides for each rectangle are \(1,\dots,n\). From this, we can form a set of possible rectangles (I automated this step):



----     65 SET s  sizes of rectangles (height or width)

s1,    s2,    s3,    s4,    s5


----     65 SET k  rectangle id

k1 ,    k2 ,    k3 ,    k4 ,    k5 ,    k6 ,    k7 ,    k8 ,    k9 ,    k10,    k11,    k12,    k13,    k14


----     65 SET rec  rectangles

                s1          s2          s3          s4

k1 .s1         YES
k2 .s2         YES
k3 .s2                     YES
k4 .s3         YES
k5 .s3                     YES
k6 .s3                                 YES
k7 .s4         YES
k8 .s4                     YES
k9 .s4                                 YES
k10.s4                                             YES
k11.s5         YES
k12.s5                     YES
k13.s5                                 YES
k14.s5                                             YES

We see there are 14 different rectangles we can use. Note that the rectangle with size \((5 \times 5)\) is excluded. It would completely fill the area, while we need to place multiple rectangles. Also note: we only have \((p \times q)\) but not \((q \times p)\).  We cannot use both such rectangles as they are congruent. We can rotate a rectangle when placing it.

Optimal solution


An optimal solution for this problem is:


These are rectangles \(k5, k6, k12\) with \(k5\) being rotated. Obviously, this solution is not unique.

Optimization model


To model this, I use a binary variable: \[ x_{k,r,i,j} = \begin{cases} 1 & \text{if rectangle $k$ (rotated if $r=\mathrm{y}$) is placed at location $(i,j)$} \\ 0 & \text{otherwise}\end{cases}\] Here \(r = \{\mathrm{n},\mathrm{y}\}\) indicates if a rectangle is rotated. Note that not all rectangles can be rotated (it does not make sense to rotate a square, or there may not be space left to rotate for a given position).

For convenience, I also introduce \[y_{k} = \begin{cases} 1 & \text{if rectangle $k$ is used} \\  0 & \text{otherwise}\end{cases}\] This is essentially an aggration of \(x\).

In addition, I use a few more complex data structures: \[\mathit{ok}_{k,r,i,j} = \begin{cases} \mathit{Yes} & \text{if we can place rectangle $k$ at location $(i,j)$} \\ \mathit{No} & \text{otherwise}\end{cases}\] Again, \(r\) indicates if rectangle \(k\) is rotated. I implemented this as a GAMS set. You can consider it as a Boolean data structure. Furthermore: \[\mathit{cover}_{k,r,i,j,i',j'} = \begin{cases} \mathit{Yes} & \text{if cell $(i',j')$ is covered by placing rectangle $k$ at location $(i,j)$} \\ \mathit{No} & \text{otherwise}\end{cases}\]

These sets are large. The set ok looks like:


----    103 SET ok  usable position

INDEX 1 = k1

              j1          j2          j3          j4          j5

n.i1         YES         YES         YES         YES         YES
n.i2         YES         YES         YES         YES         YES
n.i3         YES         YES         YES         YES         YES
n.i4         YES         YES         YES         YES         YES
n.i5         YES         YES         YES         YES         YES

INDEX 1 = k2

              j1          j2          j3          j4          j5

n.i1         YES         YES         YES         YES         YES
n.i2         YES         YES         YES         YES         YES
n.i3         YES         YES         YES         YES         YES
n.i4         YES         YES         YES         YES         YES
y.i1         YES         YES         YES         YES
y.i2         YES         YES         YES         YES
y.i3         YES         YES         YES         YES
y.i4         YES         YES         YES         YES
y.i5         YES         YES         YES         YES

...

INDEX 1 = k13

              j1          j2          j3

n.i1         YES         YES         YES
y.i1         YES
y.i2         YES
y.i3         YES

INDEX 1 = k14

              j1          j2

n.i1         YES         YES
y.i1         YES
y.i2         YES


The set cover is even larger:


----    103 SET cover  coverage of cells

INDEX 1 = k1  INDEX 2 = n  INDEX 3 = i1  INDEX 4 = j1

            j1

i1         YES

INDEX 1 = k1  INDEX 2 = n  INDEX 3 = i1  INDEX 4 = j2

            j2

i1         YES

...

INDEX 1 = k14  INDEX 2 = y  INDEX 3 = i1  INDEX 4 = j1

            j1          j2          j3          j4          j5

i1         YES         YES         YES         YES         YES
i2         YES         YES         YES         YES         YES
i3         YES         YES         YES         YES         YES
i4         YES         YES         YES         YES         YES

INDEX 1 = k14  INDEX 2 = y  INDEX 3 = i2  INDEX 4 = j1

            j1          j2          j3          j4          j5

i2         YES         YES         YES         YES         YES
i3         YES         YES         YES         YES         YES
i4         YES         YES         YES         YES         YES
i5         YES         YES         YES         YES         YES


The reason to develop these sets is to make the equations simpler. In addition, we can develop, test and debug the calculation of these sets in isolation of the test of the model and before the equations are written.

With this, we can formulate:

MIP Model
\[\begin{align}\min\> & \color{darkred}u - \color{darkred}\ell\\ & \sum_{k,r,i,j|\color{darkblue}{\mathit{cover}}(k,r,i,j,i',j')}\color{darkred}x_{k,r,i',j'}=1 && \forall i',j'&&\text{cover cell $(i',j')$}\\ &\color{darkred}y_k = \sum_{r,i,j|\color{darkblue}{\mathit{ok}}(k,r,i,j)}\color{darkred}x_{k,r,i,j}&& \forall k && \text{select max one type of rectangle} \\ & \color{darkred}\ell \le \color{darkblue}{\mathit{area}}_k \cdot\color{darkred}y_k + \color{darkblue}M (1-\color{darkred}y_k) && \forall k && \text{bound smallest area}\\ & \color{darkred}u \ge \color{darkblue}{\mathit{area}}_k \cdot\color{darkred}y_k  && \forall k && \text{bound largest area}\\ & \sum_k \color{darkblue}{\mathit{area}}_k \cdot\color{darkred}y_k = \color{darkblue} n^2 && && \text{extra constraint}\\ & \color{darkred}x_{k,r,i,j} \in \{0,1\} \\&  \color{darkred}y_k \in \{0,1\} \\ & \color{darkred}u, \color{darkred}\ell \ge 0 \\& \color{darkblue}M = \max_k \color{darkblue}{\mathit{area}}_k\end{align}\]


When solving this model, we see:


----    146 VARIABLE x.L  place tile

                  j1          j4

k5 .y.i4           1
k6 .n.i1           1
k12.n.i1                       1


----    149 VARIABLE y.L  tile is used

k5  1,    k6  1,    k12 1


----    157 PARAMETER v  tile numbers 

            j1          j2          j3          j4          j5

i1           6           6           6          12          12
i2           6           6           6          12          12
i3           6           6           6          12          12
i4           5           5           5          12          12
i5           5           5           5          12          12


----    159 VARIABLE smallest.L            =        6.000  smallest used tile
            VARIABLE largest.L             =       10.000  largest used tile
            VARIABLE z.L                   =        4.000  objective


It is noted that these types of formulations lead to very large models [5].

One remaining question is whether we can use some symmetry-breaking constraint to speed up solution times.

A larger problem


It looks like \(n=19\) is interesting [3]. This is about the limit that we can solve comfortably (a few hours to global optimality) with a MIP solver. This model had about 36,000 binary variables.

An optimal 19x19 problem

The colors are not part of the model: they were added by hand to make the picture more Mondrianic.

A list of optimal objective values can be found in [4].

References


  1. Piet Mondrian, https://en.wikipedia.org/wiki/Piet_Mondrian 
  2. Mondrian Puzzle -- Numberphile, https://www.youtube.com/watch?v=49KvZrioFB0
  3. Brady Haran, Mondrian Art Puzzlehttps://www.bradyharanblog.com/blog/mondrian-art-puzzle
  4. A276523 in The Online Encyclopedia of Sequenceshttps://oeis.org/A276523
  5. Tiling, https://yetanothermathprogrammingconsultant.blogspot.com/2020/03/tiling.html This post shows a comparison of the above formulation with a "no-overlap" formulation for a related problem.
  6. Mondriaan Tilings 2, http://yetanothermathprogrammingconsultant.blogspot.com/2020/04/mondriaan-tilings-2.html. This post demonstrates an alternative approach to this problem.