Wednesday, February 21, 2018

Modeling vs Programming, Python vs GAMS

In [1] a question is posed about formulating an LP model dealing with a shortest path problem. The suggested solution is programmed in Python. This is a good example where coding even in a high-level programming language is less compact than a formulation in a modeling language. Lots of code is needed to shoe-horn the model and data into an LP data structure. We see that a standard matrix form LP:

\[\begin{align} \min\>&c^Tx\\ &Ax=b\\&x\ge 0\end{align}\]

as data structure is not always a good abstraction level. In the GAMS model below, we actually use a model that is closer to the problem at hand:

\[\begin{align} \min\> & z = \sum_{(i,j) \in A} d_{i,j} x_{i,j} \\& \sum_{(j,i) \in A} x_{j,i} + b_i = \sum_{(i,j) \in A} x_{i,j} & \forall i\\& x_{i,j}\ge 0 \end{align}\]

Here \(A\) is the set of (directed) arcs. I.e. we lower the abstraction level. This looks like a bad idea, but it makes the "step" we need to make from problem to code much smaller. Quantitatively, we need between 2 and 3 times the number of lines of code in Python compared to the corresponding GAMS model. Python is well-known for its compactness: no need for declarations and a rich library of built-in functions for data manipulation lead to fewer lines of code than needed in many other languages. The GAMS model devotes quite a few lines to declarations. This requires some extra typing but it will provide type and domain checks that can help for larger, more complex models. In the end we need much more code in the Python model than in the GAMS model.  Besides just looking at compactness, I also think the Python version is much more obfuscated: it is difficult to even recognize this as a shortest path network LP model. The GAMS model is very close to the mathematical model. Readability is probably the most important feature when dealing with implementing maintainable, real optimization models. In the Python code this got lost. Note that this shortest path LP is a very simple model. The above observations are even more relevant in larger, more complex models that we see in practice. Finally, of course there are many Python modeling frameworks that provide better tools to create LP models.

I often do not understand the attractiveness of solving LP models using matrix oriented APIs. These type of APIs are found in Python and are prevalent in R and Matlab [2]. Only for some very structured models this interface is really easy to use. The argument between modeling systems vs matrix generators actually goes back a long time [3].

Python code

Code is taken from [1].

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import numpy as np
from scipy.optimize import linprog

""" DATA"""
edges = [('A', 'B', 8),
         ('A', 'F', 10),
         ('B', 'C', 4),
         ('B', 'E', 10),
         ('C', 'D', 3),
         ('D', 'E', 25),
         ('D', 'F', 18),
         ('E', 'D', 9),
         ('E', 'G', 7),
         ('F', 'A', 5),
         ('F', 'B', 7),
         ('F', 'C', 3),
         ('F', 'E', 2),
         ('G', 'D', 2),
         ('G', 'H', 3),
         ('H', 'A', 4),
         ('H', 'B', 9)]
s, t = 'G', 'C'

""" Preprocessing """
nodes = sorted(set([i[0] for i in edges]))  # assumption: each node has an outedge
n_nodes = len(nodes)
n_edges = len(edges)

edge_matrix = np.zeros((len(nodes), len(nodes)), dtype=int)
for edge in edges:
    i, j, v = edge
    i_ind = nodes.index(i)  # slow
    j_ind = nodes.index(j)  # """
    edge_matrix[i_ind, j_ind] = v

nnz_edges = np.nonzero(edge_matrix)
edge_dict = {}
counter = 0
for e in range(n_edges):
    a, b = nnz_edges[0][e], nnz_edges[1][e]
    edge_dict[(a,b)] = counter
    counter += 1

s_ind = nodes.index(s)
t_ind = nodes.index(t)

""" LP """
bounds = [(0, 1) for i in range(n_edges)]
c = [i[2] for i in edges]

A_rows = []
b_rows = []
for source in range(n_nodes):
    out_inds = np.flatnonzero(edge_matrix[source, :])
    in_inds = np.flatnonzero(edge_matrix[:, source])

    rhs = 0
    if source == s_ind:
        rhs = 1
    elif source == t_ind:
        rhs = -1

    n_out = len(out_inds)
    n_in = len(in_inds)

    out_edges = [edge_dict[a, b] for a, b in np.vstack((np.full(n_out, source), out_inds)).T]
    in_edges = [edge_dict[a, b] for a, b in np.vstack((in_inds, np.full(n_in, source))).T]

    A_row = np.zeros(n_edges)
    A_row[out_edges] = 1
    A_row[in_edges] = -1

    A_rows.append(A_row)
    b_rows.append(rhs)

A = np.vstack(A_rows)
b = np.array(b_rows)
res = linprog(c, A_eq=A, b_eq=b, bounds=bounds)
print(res)

The reason this model is less than obvious is that we need to map an irregular collection of vertices \((i,j)\) to a one-dimensional decision variable \(x_k\). If the graph is dense we can easily do a simple calculation: \(k := i + j*n\). But in this case our graph is sparse so the mapping \((i,j) \rightarrow k\) is not straightforward. In addition we need to be able to find all arcs going into \(i\) or leaving \(i\). Some well-chosen dicts can help here. In the end we do a lot of programming not directly related to the mathematical model. This means the model itself gets obscured by large amounts of code. And of course, this code needs to be debugged, maintained etc. so it is certainly not without cost.

GAMS Model


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
set i 'nodes' /A*H/;
alias(i,j);
parameters
  d(i,j) 'distances' /
    A.(B 8, F 10)
    B.(C 4, E 10)
    C.D 3
    D.(E 25, F 18)
    E.(D 9, G 7)
    F.(A 5, B 7, C 3, E 2)
    G.(D 2, H 3)
    H.(A 4, B 9) /
  inflow(i) 'rhs' /
    G 1
    C -1
   /;

set a(i,j) 'arcs exist if we have a distance';
a(i,j) = d(i,j);

positive variables x(i,j) 'flow';
variable z 'obj';

equations
   obj   'objective'
   nodebal(i) 'node balance'
;

obj.. z =e= sum(a,d(a)*x(a));
nodebal(i)..  sum(a(j,i), x(j,i)) + inflow(i) =e= sum(a(i,j), x(i,j));

model m /all/;
solve m minimizing z using lp;
display x.l;

The operations that were so problematic in Python, are actually straightforward in GAMS. Just write the equations as stated in the mathematical formulation. The set a implements the set of arcs. This allows us to have quite a natural expression of the flow conservation constraints.

The GAMS model is actually well suited to solve large sparse instances. To test this we can generate random data:

set i 'nodes' /node1*node1000/;
alias(i,j);
parameters
  d(i,j) 
'distances'
  inflow(i) 
'rhs' /
    
node1 1
    
node1000 -1
   
/;
d(i,j)$(uniform(0,1)  < 0.05) = uniform(1,100);


This will give us a network with 1,000 nodes and about 50,000 arcs. We don't need to change any of the data representation as all parameters and (multi-dimensional) sets are automatically stored using sparse data structures. This model with 1,000 equations and 50k variables takes 0.3 seconds to generate and 0.3 seconds to solve (it is a very easy LP problem). Note that the solver called in the above Python code is not able to handle a model of this size. The Python code takes 5.5 seconds to generate the LP problem; then we are stuck in the LP solver for hours. Dense LP solvers are really only for very small problems.


References


  1. Using SciPy's minimize to find the shortest path in a graph, https://stackoverflow.com/questions/48898447/using-scipys-minimize-to-find-the-shortest-path-in-a-graph
  2. Matlab vs GAMS: linear programming, http://yetanothermathprogrammingconsultant.blogspot.com/2016/09/matlab-vs-gams-linear-programming.html
  3. R. Fourer, Modeling Languages versus Matrix Generators for Linear Programming. ACM Transactions on Mathematical Software 9 (1983) 143-183.