Thursday, January 28, 2021

Assignment problem with a wrinkle formulated as a network problem

In a previous post [1] I discussed a simple problem. But it turned out not so easy to solve for some of the larger data sets. Basically, it was an assignment problem with an extra condition. The problem was a follows:

Consider two arrays $$\color{darkblue}a_i$$ (length $$\color{darkblue}m$$) and $$\color{darkblue}b_j$$ (length $$\color{darkblue}n$$) with $$\color{darkblue}m \lt \color{darkblue}n$$. Assign all values $$\color{darkblue}a_i$$ to a $$\color{darkblue}b_j$$ such that:

• Each $$\color{darkblue}b_j$$ can have 0 or 1 $$\color{darkblue}a_i$$ assigned to it.
• The assignments need to maintain the original order of $$\color{darkblue}a_i$$. I.e. if $$\color{darkblue}a_i \rightarrow \color{darkblue}b_j$$ then $$\color{darkblue}a_{i+1}$$ must be assigned to a slot in $$\color{darkblue}b$$ that is beyond slot $$j$$. In the picture below that means that arrows cannot cross.
• Do this while minimizing the sum of the products.

In [1], I attacked this as a mixed-integer programming problem. In this post, I want to see if we can solve this as a network problem. This was largely inspired by the comments in [1].

Shortest path algorithms and models

The shortest path problem can be solved in different ways. Dijkstra's algorithm is a popular choice due to its simplicity and speed. In [2] we can read a quote by Edsger Dijkstra:

What is the shortest way to travel from Rotterdam to Groningen, in general: from given city to given city. It is the algorithm for the shortest path, which I designed in about twenty minutes. One morning I was shopping in Amsterdam with my young fiancée, and tired, we sat down on the café terrace to drink a cup of coffee and I was just thinking about whether I could do this, and I then designed the algorithm for the shortest path. As I said, it was a twenty-minute invention. In fact, it was published in '59, three years later. The publication is still readable, it is, in fact, quite nice. One of the reasons that it is so nice was that I designed it without pencil and paper. I learned later that one of the advantages of designing without pencil and paper is that you are almost forced to avoid all avoidable complexities. Eventually, that algorithm became to my great amazement, one of the cornerstones of my fame.
For an early implementation in a computer program, Dijkstra used a set of 64 nodes corresponding to Dutch cities. 64 was chosen so the node number could be encoded in 6 bits [2].

The shortest path can also be formulated as an LP problem. Standard LP solvers can handle quite large problems (each node becomes a constraint, and each arc a variable). In addition, a solver like Cplex contains a specialized network solver for such LPs.

Graph

We start with the nodes. We denote the nodes by $$\color{darkblue}n_{i,j}$$ representing: $$\color{darkblue}a_i$$ is assigned to $$\color{darkblue}b_j$$. Not all assignments are possible. For instance, we cannot assign $$\color{darkblue}a_2$$ to $$\color{darkblue}b_1$$. That means: node  $$\color{darkblue}n_{2,1}$$  does not exist.

We also need a source node and a sink node.

The arcs indicate: after assigning $$\color{darkblue}a_i \rightarrow \color{darkblue}b_j$$ we need to assign the next item $$\color{darkblue}a_{i+1}\rightarrow \color{darkblue}b_{j+k}$$ for some $$k\ge 1$$. In addition we need to connect the source node to all nodes with $$i=1$$ and the sink node to all nodes with $$i=3$$ (our last $$\color{darkblue}a_i$$).  So our network looks like:

 Network representation

Note how any arc not connected to the sink or source node, goes to the right and downwards.

We have costs for visiting each node: $$\color{darkblue}a_i\cdot\color{darkblue}b_j$$.  As we want to formulate this as a shortest path problem, we need to allocate these costs to arcs. I used the incoming arcs for this. So any arc $$e_{i,j,i',j'}$$ gets a cost $$\color{darkblue}a_{i'}\cdot\color{darkblue}b_{j'}$$.  The arcs to the sink node have a zero cost.

Note: this graph is acyclic, so negative arc lengths are ok. We will never see negative cycles.

As you can see: because the nodes have two indices, the arcs have four! That will be fun.

Implementation

I implemented the network model in GAMS and solved it as an LP.

The data for our tiny data set looks like:

----     15 SET i

i1,    i2,    i3

----     15 SET j

j1,    j2,    j3,    j4,    j5,    j6

----     15 PARAMETER a

i1 1.000,    i2 2.000,    i3 3.000

----     15 PARAMETER b

j1  4.000,    j2  9.000,    j3  5.000,    j4  3.000,    j5  2.000,    j6 10.000


From this, we generate our nodes and arcs:

----     31 SET n  nodes

j1          j2          j3          j4          j5          j6

src         YES
i1                      YES         YES         YES         YES
i2                                  YES         YES         YES         YES
i3                                              YES         YES         YES         YES
snk         YES

----     32 PARAMETER numnodes             =       14.000

----     41 SET e  arcs

src.  .i1 .j1,    src.  .i1 .j2,    src.  .i1 .j3,    src.  .i1 .j4,    i1 .j1.i2 .j2,    i1 .j1.i2 .j3
i1 .j1.i2 .j4,    i1 .j1.i2 .j5,    i1 .j2.i2 .j3,    i1 .j2.i2 .j4,    i1 .j2.i2 .j5,    i1 .j3.i2 .j4
i1 .j3.i2 .j5,    i1 .j4.i2 .j5,    i2 .j2.i3 .j3,    i2 .j2.i3 .j4,    i2 .j2.i3 .j5,    i2 .j2.i3 .j6
i2 .j3.i3 .j4,    i2 .j3.i3 .j5,    i2 .j3.i3 .j6,    i2 .j4.i3 .j5,    i2 .j4.i3 .j6,    i2 .j5.i3 .j6
i3 .j3.snk.  ,    i3 .j4.snk.  ,    i3 .j5.snk.  ,    i3 .j6.snk.

----     42 PARAMETER numarcs              =       28.000

----     47 PARAMETER c  cost of arcs

src.  .i1.j1  4.000,    src.  .i1.j2  9.000,    src.  .i1.j3  5.000,    src.  .i1.j4  3.000,    i1 .j1.i2.j2 18.000
i1 .j1.i2.j3 10.000,    i1 .j1.i2.j4  6.000,    i1 .j1.i2.j5  4.000,    i1 .j2.i2.j3 10.000,    i1 .j2.i2.j4  6.000
i1 .j2.i2.j5  4.000,    i1 .j3.i2.j4  6.000,    i1 .j3.i2.j5  4.000,    i1 .j4.i2.j5  4.000,    i2 .j2.i3.j3 15.000
i2 .j2.i3.j4  9.000,    i2 .j2.i3.j5  6.000,    i2 .j2.i3.j6 30.000,    i2 .j3.i3.j4  9.000,    i2 .j3.i3.j5  6.000
i2 .j3.i3.j6 30.000,    i2 .j4.i3.j5  6.000,    i2 .j4.i3.j6 30.000,    i2 .j5.i3.j6 30.000


Our nodes are two-dimensional, so somewhat artificially, the source and the sink node are denoted by $$\color{darkblue}n_{'src',''}$$ and $$\color{darkblue}n_{'snk',''}$$. Again, the costs of visiting a node are allocated to the incoming arcs. Note that zero costs are not printed (the parameter is stored as a sparse data structure, so zero and does not exist is the same).

The LP model for this problem can look like:

Shortest Path LP Model
\begin{align}\min& \sum_{i,j,i',j'}\color{darkred}f_{i,j,i',j'}\cdot\color{darkblue}c_{i,j,i',j'}\\ &\sum_{i',j'|e(i',j',i,j)} \color{darkred}f_{i',j',i,j} + \color{darkblue}g_{i,j} = \sum_{i',j'|e(i,j,i',j')} \color{darkred}f_{i,j,i',j'} &&\forall \color{darkblue}n_{i,j}\\ & \color{darkred}f_{i,j,i',j'} \in [0,1] \end{align}

where $$\color{darkred}f$$ is our flow variable and $$\color{darkblue}g$$ is exogenous inflow. In our case:$\color{darkblue}g_{i,j} = \begin{cases} 1 & \text{for the source node}\\ -1 & \text{for the sink node} \\ 0 & \text{for all other nodes}\end{cases}$

The results are:

----     75 VARIABLE cost.L                =       16.000  objective

----     75 VARIABLE f.L  flow

src.  .i1 .j1 1.000,    i1 .j1.i2 .j4 1.000,    i2 .j4.i3 .j5 1.000,    i3 .j5.snk.   1.000

----     79 SET assign  assignments recovered from flows

j1          j4          j5

i1         YES
i2                     YES
i3                                 YES


There are different ways we can solve a shortest path problem. Here we look at three:
1. Cplex default LP solver (dual simplex)
2. Cplex network solver
3. Sparse version of Dijkstra's algorithm from scipy.sparse.csgraph. This requires some programming to get the data organized in a form that is accepted by this function.
The $$m=100, n=1000$$ problem is a bit large for Cplex to handle as an LP, but let's do the $$m=50, n=500$$ data set. Here are the results:

MIP ModelNetwork LP ModelDijkstra
solverMIPLP defaultNetworkSparse
a/b length50/50050/50050/50050/500
nodes/arcs22,552/4,995,27622,552/4,995,27622,552/4,995,276
rows/columns/nz650/25,025/100,14722,553/4,995,277/14,985,37822,553/4,995,277/14,985,378
objective6466.6736466.6736466.6736466.673
time59721total:38
presolve:32
extract network:2
solve network:3
0.1
b&b nodes5,023
iterations60,19911,535100,446

This looks better. My setup is a bit slow using unoptimized Python code, but the raw solve is very fast. When trying the $$m=100,n=1000$$ data set, I got:

MIP ModelDijkstra
a/b length100/1000100/1000
variables/constraints1,300/100,100
nodes/arcs90,102/40,230,551
objective14,371.45514,371.455
time1,8530.9

Note that the shortest path version would result in an LP with 90k rows and 40 million columns. That is very large. So here a specialized shortest path algorithm is far superior to more general tools.

The code for setting up the problem for Dijkstra's algorithm is not completely trivial. It is easy to make mistakes here. In this case, it was very beneficial to have access to a working MIP and LP model. This makes debugging the Python code easier. In addition, we have a way to validate the results. Solving a model in (very) different ways is an excellent approach to make sure we produce correct results.

Negative lengths

It is noted that Dijkstra assumes non-negative lengths (costs). After talking to the author of the original post, it was observed that some of the $$\color{darkblue}b_j$$ can be negative. In that case, a different algorithm (like Johnson or Bellman-Ford) may be more appropriate.

Notes:
• When using Dijkstra with negative lengths, I see the message:
<ipython-input-121-4c8fd24b72ef>:67: UserWarning: Graph has negative weights: dijkstra will give inaccurate results if the graph contains negative cycles. Consider johnson or bellman_ford.
distmat,pred = scipy.sparse.csgraph.dijkstra(spmat,indices=source,return_predecessors=True)
• This message seems to indicate that negative lengths are only a problem if there are negative cycles. As there are no cycles in our problem, does this imply we can use Dijkstra? I suspect this is not the case.
• Q: Does Dijkstra require non-negative lengths or only non-negative cycles? I think the answer is: non-negative lengths.
• Even with Johnson we may see:
NegativeCycleError: Negative cycle detected on node 0
• This can't be true, as there are no cycles in our network. It is a DAG (Direct Acyclic Graph). I think that the function complains as soon as a path has a negative length. That is slightly different.
• This is a bit more delicate than I expected. Not in the least because the error and warning messages are not as precise as they could be. Formulating good messages requires some thought, and I think that often not much thought is going into this.

Appendix 1: GAMS code to solve network problem

The code to form a shortest path network for this problem is not completely trivial. So here is my GAMS model for this:

 $ontext Network version of assignment problem with ordering$offtext * macros for sizing the problem $set m 10$set n 100 $set irange i1*i%m%$set jrange  j1*j%n% sets   dummy 'ordering just for better looking output' /src,%irange%,%jrange%,snk/   i0  'includes src/snk' /%irange%,src,snk/   j0  'includes blank for use with src/snk' /%jrange%,''/   i(i0) 'core network' /%irange%/   j(j0) 'core network' /%jrange%/   ifirst(i) /i1/   ilast(i)  /i%m%/ ; alias(i,ii),(j,jj),(i0,ii0),(j0,jj0); * random data parameter   a(i)   b(j) ; a(i) = uniform(0,100); b(j) = uniform(0,100); display a,b; scalars   lena 'length of a' /%m%/   lenb 'length of b' /%n%/ ; *-------------------------------------------------------------- * network topology *-------------------------------------------------------------- scalars   numnodes   numarcs ; set n(i0,j0) 'nodes (correspond to assignments)'; n(i,j) = ord(j)>=ord(i) and ord(j)<=lenb-lena+ord(i); n('src','') = yes; n('snk','') = yes; numnodes = card(n); display$(numnodes<1000) n; display numnodes; alias(n,nn); set e(i0,j0,ii0,jj0) 'arcs'; e(n(i,j),nn(i+1,jj)) = ord(jj)>ord(j); e('src','',n(ifirst,j)) = yes; e(n(ilast,j),'snk','') = yes; option e:0:0:7; numarcs = card(e); display$(numarcs<10000) e; display numarcs; parameter c(i0,j0,ii0,jj0) 'cost of arcs'; c(e(n,i,j)) = a(i)*b(j); option c:3:0:7; display$(numarcs<1000) c; *-------------------------------------------------------------- * shortest path model *-------------------------------------------------------------- parameter exinflow(i0,j0) 'exogenous inflow' / src.'' 1 snk.'' -1 /; binary variable f(i0,j0,i0,j0) 'flow'; variable cost 'objective'; equations obj 'objective' flowbal(i0,j0) 'flow balance at node n' ; obj.. cost =e= sum(e, f(e)*c(e)); flowbal(n).. sum(e(nn,n),f(e)) + exinflow(n) =e= sum(e(n,nn),f(e)); model m /all/; option limrow=0,limcol=0,solprint=off; m.optfile=1; solve m minimizing cost using rmip; option f:3:0:7; display cost.l,f.l; set assign(i0,j0) 'assignments recovered from flows'; assign(n) = sum(e(nn,n),c(e)*f.l(e)); option assign:0:0:7; display assign;$onecho > cplex.opt * save some memory/time names no * select network solver lpmethod 3 \$offecho

Appendix B: Python code

This is the python code I used to solve the problem using Dijkstra's shortest path algorithm. Setting up the problem is way more expensive than solving it. Q: is there a way to make the setup really fast? And maybe get rid of the complex indexing?

import sys
import pickle
import scipy.sparse
import scipy.sparse.csgraph
import time

print(sys.version)

#
#
data = pickle.load( open( r"d:\projects\blog\R\save.p", "rb" ) )

a = data['a']
b = data['b']

m = len(a)
n = len(b)
print("m={},n={}".format(m,n))

#
# core network
#
nodes = {(i,j):((n-m+1)*i+j-i) for i in range(m) for j in range(i,n-m+i+1)}
arcs = {(i,j,i+1,k):a[i+1]*b[k] for i in range(m-1) for j in range(i,n-m+i+1) for k in range(j+1,n-m+i+2)}
# value for nodes is the node sequence number
# value for arcs is the cost coefficient

#
# add source and sink node
#
src = ("src","src")
snk = ("snk","snk")
nodes[src]=len(nodes)
nodes[snk]=len(nodes)

# add arcs related to src,snk nodes
for j in range(0,n-m+1):
arcs[('src','src',0,j)] = a[0]*b[j]
for j in range(m-1,n):
arcs[(m-1,j,'snk','snk')] = 0

numnodes,numarcs = len(nodes),len(arcs)
print("numnodes={},numarcs={}".format(numnodes,numarcs))

#
# setup sparse matrix
#
row = [nodes[(i,j)] for (i,j,k,l) in arcs ]
col = [nodes[(k,l)] for (i,j,k,l) in arcs ]
weight = list(arcs.values())
spmat = scipy.sparse.csr_matrix((weight,(row,col)),shape=(numnodes,numnodes))

#
# solve
#
source = [nodes[src]]
start = time.time()
sol = scipy.sparse.csgraph.dijkstra(spmat,indices=source)
t = time.time()-start
print("shortest path length: {}".format(sol[0,nodes[snk]]))
print("time: {0:.1f} seconds".format(t))


The output looks like:

3.9.1 (tags/v3.9.1:1e5d33e, Dec  7 2020, 17:08:21) [MSC v.1927 64 bit (AMD64)]
m=100,n=1000
numnodes=90102,numarcs=40230551
shortest path length: 14371.455440918733
time: 0.9 seconds


1. I can reduce the time a bit by adding a special variant of ShortestPath which takes advantage of the fact that we have a DAG. In this case, the user would need to tell the solver that the graph is acyclic. It will throw an error if it is not.

Before --
NOTE: Processing the shortest paths problem used 0.31 (cpu: 0.31) seconds.

After --
NOTE: Processing the shortest paths problem used 0.24 (cpu: 0.24) seconds.

1. That performance is mind-boggling. Not sure how to calculate the number of possible paths. Should be a little bit less than 1000^100.

2. Well, a lot less.

3. Are you asking how to enumerate all possible paths?

Or, are you asking how the shortest path algorithm is so fast - which obviously does not do full enumeration?

The standard approach uses Dijkstra's algorithm which is O(m + n log n). The specialized algorithm for a DAG uses topological sort and then a labeling pass similar to Dijkstra and is O(m + n).

4. Don't worry. I was just thinking out loud.