Tuesday, September 28, 2021

A covering problem: runners vs Ragnar relay legs

In  the following problem was posted.

1. We have a number of legs with their lengths (in miles). E.g.

legs = [3.3, 4.2, 5.2, 3, 2.7, 4,
5.3, 4.5, 3, 5.8, 3.3, 4.9,
3.1, 3.2, 4, 3.5, 4.9, 2.3,
3.2, 4.6, 4.5, 4, 5.3, 5.9,
2.8, 1.9, 2.1, 3, 2.5, 5.6,
1.3, 4.6, 1.5, 1.2, 4.1, 8.1]

2. We also have a number of runs performed by runners. Again, we have distances in miles:

runs = [3.2, 12.3, 5.2, 2.9, 2.9, 5.5]

A run can be allocated to different legs, as long a the sum of these legs is less than the length of the run.
3. We want to maximize the number of legs covered by the runs, but with a wrinkle: the legs should be consecutive: 1,2,3,... (we can't skip a leg).
4. A solution for this data set can look like:

----     60 PARAMETER report  results

run1        run2        run3        run4        run5        run6

leg1                       3.3
leg2                       4.2
leg3                                   5.2
leg4           3.0
leg5                                               2.7
leg6                       4.0
leg7                                                                       5.3
total          3.0        11.5         5.2         2.7                     5.3
runLen         3.2        12.3         5.2         2.9         2.9         5.5

It is noted that solutions are not necessarily unique. In this case, we could have swapped run 4 for run 5 to cover leg 5.

I am not sure what a Ragnar relay  is, but I suspect this has to do with how these legs are assigned to different runners.

The text in the original question  was not immediately clear to me. English prose is a poor vehicle to succinctly describe a problem. Mathematics is much better for this. It is (or should be) compact, precise and unambiguous. My point is that a mathematical model, besides the input for a solver, is also an excellent communication tool.

Mathematical model

We define two sets of binary variables: \begin{align}&\color{darkred}{\mathit{assign}}_{i,j} = \begin{cases}1 & \text{if leg i is assigned to run j} \\ 0 & \text{otherwise} \end{cases} \\ & \color{darkred}{\mathit{covered}}_i = \begin{cases}1 & \text{if leg i is covered by a run} \\ 0& \text{otherwise}\end{cases}\end{align} With this we can write:

MIP Model
\begin{align}\max&\sum_i \color{darkred}{\mathit{covered}}_i \\ & \sum_i \color{darkblue}{\mathit{legs}}_i \cdot \color{darkred}{\mathit{assign}}_{i,j} \le \color{darkblue}{\mathit{runs}}_j && \forall j \\ & \color{darkred}{\mathit{covered}}_i \le \sum_j \color{darkred}{\mathit{assign}}_{i,j} && \forall i \\ & \color{darkred}{\mathit{covered}}_{i-1} \ge \color{darkred}{\mathit{covered}}_i && \forall i \gt 1 \\ & \color{darkred}{\mathit{covered}}_i \in \{0,1\} \\& \color{darkred}{\mathit{assign}}_{i,j} \in \{0,1\} \end{align}

The GAMS model is reproduced in the appendix below. It follows this model quite literally.

It may be tempting just to print the optimal values for the variable $$\color{darkred}{\mathit{assign}}_{i,j}$$ as solution report. However, there may be some spurious assignments that are not part of the covered legs. Here is an example:

----     56 VARIABLE assign.L  assign leg to runner

run1        run2        run3        run4        run5        run6

leg1                        1
leg2                        1
leg3                                    1
leg4            1
leg5                                                1
leg6                        1
leg7                                                                        1
leg18                                                           1

----     56 VARIABLE covered.L  leg is covered

leg1 1,    leg2 1,    leg3 1,    leg4 1,    leg5 1,    leg6 1,    leg7 1

----     56 VARIABLE numLegs.L             =            7  number of covered legs

The assignment to leg 18 should not be part of the optimal solution presented to the user. For that reason the parameter report was calculated as: $\color{darkblue}{\mathit{report}}_{i,j} := \color{darkred}{\mathit{}assign}^*_{i,j}\cdot\color{darkred}{\mathit{covered}}^*_i \cdot\color{darkblue}{\mathit{legs}}_i$

CVXPY

This looks like a good candidate to formulate in CVXPY: we deal with vectors and (dense) matrices and there are no difficult exceptions in the model.

Model in Matrix Notation
\begin{align}\max\>&\color{darkblue}e^T\cdot\color{darkred}{\mathit{covered}}\\ & \color{darkred}{\mathit{assign}}^T\cdot\color{darkblue}{\mathit{legs}} \le \color{darkblue}{\mathit{runs}} \\ & \color{darkred}{\mathit{covered}} \le \color{darkred}{\mathit{assign}} \cdot\color{darkblue}e \\& \color{darkred}{\mathit{covered}}[1:n-1]\ge\color{darkred}{\mathit{covered}}[2:n]\\& \color{darkred}{\mathit{assign}},\color{darkred}{\mathit{covered}} \in \{0,1\} \end{align}

Here $$\color{darkblue}e$$ is a (column) vector of ones of appropriate size, The notation $$a[1:n]$$  indicates taking the first $$n$$ numbers from vector $$a$$. I personally find the indexed model more intuitive and easier to read and write. Fortunately, CVXPY has a few convenience functions to calculate sums, so we can write:

assign = cp.Variable((n,m),boolean=True)
covered = cp.Variable(n,boolean=True)

prob = cp.Problem(cp.Maximize(cp.sum(covered)),
[
assign.T@legs <= runs,
covered <= cp.sum(assign,axis=1),
covered[1:n]<=covered[0:(n-1)]
])

prob.solve()

The axis=1 argument on the sum indicates we take the sum "along the rows". (I am never sure what that means. Similar with row- and column sums: what exactly are we summing over). With an indexed version of the model. there is no need for a transpose, vectors with ones, or summing along an axis. Of course, Python has zero-based lists, while in mathematics we usually work with 1-based vectors and matrices.

Sidenote: I am not the only one confused by row and column sums. This is from the R documentation:

rowsum: Give Column Sums of a Matrix or Data Frame, Based on a Grouping Variable

The complete CVXPY model is reproduced in appendix B. The results look like:

Elapsed 0.033 sec.
Results:
0    1    2    3    4    5
0  0.0  3.3  0.0  0.0  0.0  0.0
1  0.0  4.2  0.0  0.0  0.0  0.0
2  0.0  0.0  5.2  0.0  0.0  0.0
3  3.0  0.0  0.0  0.0  0.0  0.0
4  0.0  0.0  0.0  0.0  2.7  0.0
5  0.0  4.0  0.0  0.0  0.0  0.0
6  0.0  0.0  0.0  0.0  0.0  5.3

The same model with a slightly different set of runs:

runs = [20.0, 5.4, 3.3, 26.4, 2.4, 8.6, 14.6, 20]

is much more difficult for GLPK:

Elapsed 3.2e+03 sec.
Results:
0    1    2    3    4    5    6    7
0   0.0  0.0  3.3  0.0  0.0  0.0  0.0  0.0
1   0.0  0.0  0.0  0.0  0.0  0.0  0.0  4.2
2   0.0  0.0  0.0  0.0  0.0  0.0  5.2  0.0
3   0.0  0.0  0.0  3.0  0.0  0.0  0.0  0.0
4   2.7  0.0  0.0  0.0  0.0  0.0  0.0  0.0
5   4.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
6   0.0  5.3  0.0  0.0  0.0  0.0  0.0  0.0
7   4.5  0.0  0.0  0.0  0.0  0.0  0.0  0.0
8   3.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
9   5.8  0.0  0.0  0.0  0.0  0.0  0.0  0.0
10  0.0  0.0  0.0  0.0  0.0  3.3  0.0  0.0
11  0.0  0.0  0.0  0.0  0.0  0.0  0.0  4.9
12  0.0  0.0  0.0  0.0  0.0  0.0  0.0  3.1
13  0.0  0.0  0.0  3.2  0.0  0.0  0.0  0.0
14  0.0  0.0  0.0  4.0  0.0  0.0  0.0  0.0
15  0.0  0.0  0.0  0.0  0.0  0.0  3.5  0.0
16  0.0  0.0  0.0  4.9  0.0  0.0  0.0  0.0
17  0.0  0.0  0.0  0.0  2.3  0.0  0.0  0.0
18  0.0  0.0  0.0  0.0  0.0  0.0  0.0  3.2
19  0.0  0.0  0.0  0.0  0.0  0.0  0.0  4.6
20  0.0  0.0  0.0  4.5  0.0  0.0  0.0  0.0
21  0.0  0.0  0.0  4.0  0.0  0.0  0.0  0.0
22  0.0  0.0  0.0  0.0  0.0  5.3  0.0  0.0
23  0.0  0.0  0.0  0.0  0.0  0.0  5.9  0.0
24  0.0  0.0  0.0  2.8  0.0  0.0  0.0  0.0

Of course, Cplex has no problem: less than a second. Again, the solution is not unique:

----     60 PARAMETER report  results

run1        run2        run3        run4        run5        run6        run7        run8

leg1                                                                       3.3
leg2           4.2
leg3                                                                                   5.2
leg4                                                                                               3.0
leg5           2.7
leg6                                               4.0
leg7                                                                       5.3
leg8                                                                                   4.5
leg9                                               3.0
leg10                                                                                              5.8
leg11                                  3.3
leg12                                                                                  4.9
leg13                                                                                              3.1
leg14          3.2
leg15          4.0
leg16                                              3.5
leg17                                                                                              4.9
leg18                                                          2.3
leg19                                                                                              3.2
leg20                                              4.6
leg21                                              4.5
leg22                                              4.0
leg23                      5.3
leg24          5.9
leg25                                              2.8
total         20.0         5.3         3.3        26.4         2.3         8.6        14.6        20.0
runLen        20.0         5.4         3.3        26.4         2.4         8.6        14.6        20.0

Can we help glpk a bit? Yes, we can replace the constraint $\color{darkred}{\mathit{covered}}_i \le \sum_j \color{darkred}{\mathit{assign}}_{i,j}$ by $\color{darkred}{\mathit{covered}}_i = \sum_j \color{darkred}{\mathit{assign}}_{i,j}$ This will reduce the glpk solution time from 3200 seconds to 340 seconds. The new constraint reduces the size of the feasible region in two ways. First, we allow only one run $$j$$ to cover a leg $$i$$. Allow more than one run does not help us finding the maximum coverage of legs. Also, if there is an $$\color{darkred}{\mathit{assign}}_{i,j}=1$$ then immediately we force the corresponding $$\color{darkred}{\mathit{covered}}_i=1$$ instead of relying on the objective. This reformulation is slightly diverging from the description in , but still calculates the largest set of consecutive legs that we can cover.

Another idea is to do a bit of manual presolve: don't generate the variable  $$\color{darkred}{\mathit{assign}}_{i,j}$$ when $$\color{darkblue}{\mathit{legs}}_i \gt \color{darkblue}{\mathit{runs}}_j$$. In GAMS this is easy to do, but I am not sure how to approach this in CVXPY. Without trying, I am not sure if this has any effect on the performance of GLPK. It depends a bit on the quality of the presolver in GLPK.

Conclusions

• CVXPY models can look very clean and compact. But there are some problems:
• We can't see what CVXPY generates. How to debug CVXPY models?
• Jupyter notebooks may not show solvers logs. This is very unfortunate, as they produce valuable feedback on the solver's behavior.
• Some important modeling constructs are difficult or impossible to express in CVXPY.
• Matrix notation is not always easy to read and write and is limited to 0, 1, and 2-dimensional constructs.
• All indexing is positional.
• GAMS  uses indexed models. This provides great flexibility and readability.
• Indexing by strings and referential integrity (domain checking) is a boon for large models.
• Printing of the results in GAMS is cleaner and the output is easier to read.

References

1. Algorithm on how to put as many numbers of a list into different capacity of buckets,  https://stackoverflow.com/questions/69273836/algorithm-on-how-to-put-as-many-numbers-of-a-list-into-different-capacity-of-buc
2. Ragnar Relay Series, https://en.wikipedia.org/wiki/Ragnar_Relay_Series

Appendix A: GAMS code

 $ontext Assigning legs to runs in order to maximize total number of covered legs in a Ragnar relay. Extra condition: the covered legs should be 1,2,3... without holes.$offtext *--------------------------------------------------------------------- * data *--------------------------------------------------------------------- set   i 'legs' /leg1*leg36/   j 'runs' /run1*run6/ ; parameter    legs(i) 'lengths in miles' /        leg1 3.3,  leg2  4.2, leg3  5.2, leg4  3,   leg5  2.7, leg6  4,        leg7 5.3,  leg8  4.5, leg9  3,   leg10 5.8, leg11 3.3, leg12 4.9,        leg13 3.1, leg14 3.2, leg15 4,   leg16 3.5, leg17 4.9, leg18 2.3,        leg19 3.2, leg20 4.6, leg21 4.5, leg22 4,   leg23 5.3, leg24 5.9,        leg25 2.8, leg26 1.9, leg27 2.1, leg28 3,   leg29 2.5, leg30 5.6,        leg31 1.3, leg32 4.6, leg33 1.5, leg34 1.2, leg35 4.1, leg36 8.1        /     runs(j) 'lengths in miles'  /        run1  3.2, run2 12.3, run3 5.2, run4  2.9, run5 2.9, run6 5.5        / ; *--------------------------------------------------------------------- * model *--------------------------------------------------------------------- binary variable    assign(i,j) 'assign leg to runner'    covered(i)  'leg is covered' ; variable numLegs 'number of covered legs'; equations     objective   'max number of covered legs'     eassign     'assign legs to runs'     isCovered   'leg is covered by a run'     order       'cannot skip a leg' ; objective..     numLegs =e= sum(i, covered(i)); eassign(j)..    sum(i, legs(i)*assign(i,j)) =l= runs(j); isCovered(i)..  covered(i) =l= sum(j,assign(i,j)); order(i-1)..    covered(i-1) =g= covered(i); model m /all/; option optcr=0; solve m maximizing numLegs using mip; *--------------------------------------------------------------------- * reporting *--------------------------------------------------------------------- option assign:0,covered:0, numLegs:0; display assign.l, covered.l, numLegs.l; parameter report(*,*)  'results'; report(i,j) = assign.L(i,j)*covered.l(i)*legs(i); report('total',j) = sum(i,report(i,j)); report('runLen',j) = runs(j); option report:1; display report;

Notes:
1. The notation order(i-1) in the equation definition means we generate the equation for all but the first element of set $$i$$.
2. The parameter report is declared as report(*,*). This will disable domain checking so we can add totals to this parameter.

Appendix B: CVXPY model

import cvxpy as cp
import pandas as pd
import time

legs = [3.3, 4.2, 5.2, 3, 2.7, 4,
5.3, 4.5, 3, 5.8, 3.3, 4.9,
3.1, 3.2, 4, 3.5, 4.9, 2.3,
3.2, 4.6, 4.5, 4, 5.3, 5.9,
2.8, 1.9, 2.1, 3, 2.5, 5.6,
1.3, 4.6, 1.5, 1.2, 4.1, 8.1]

runs = [3.2, 12.3, 5.2, 2.9, 2.9, 5.5]
#runs = [20.0, 5.4, 3.3, 26.4, 2.4, 8.6, 14.6, 20]

n = len(legs)
m = len(runs)

#
# model
#
assign = cp.Variable((n,m),boolean=True)
covered = cp.Variable(n,boolean=True)
prob = cp.Problem(cp.Maximize(cp.sum(covered)),
[
assign.T@legs <= runs,
covered <= cp.sum(assign,axis=1),
covered[1:n]<=covered[0:(n-1)]
])

#
# solve
#
t0 = time.time()
prob.solve(solver=cp.GLPK_MI,verbose=True)
#prob.solve(solver=cp.CPLEX,verbose=True)
print()
# print(f"Solve time: {prob.solver_stats.solve_time}")  # does not seem to work
print(f"Elapsed {time.time()-t0:{0}.{2}} sec.")

#
# reporting
#
L = round(prob.value)
result = assign.value[0:L,]
for i in range(L):
for j in range(m):
result[i,j] *= covered.value[i]*legs[i]
print("Results:")
print(pd.DataFrame(result))

Notes:
1. When running under Jupyter (browser-based), we don't see the solver log for glpk (but cplex shows the log).
2. The solve_time attribute does not seem to work.
3. Cplex can be used after running: pip install cplex. This model (with both data sets) fits in the community edition of Cplex.