Saturday, February 24, 2018

On the scheduling of reading book chapters

In [1] an interesting problem is posted:

I want to read a book in a given number of days.  I want to read an integer number of chapters each day (there are more chapters than days), no stopping part way through a chapter, and at least 1 chapter each day.  The chapters are very non uniform in length (some very short, a few very long, many in between) so I would like to come up with a reading schedule that minimizes the variance of the length of the days readings (read multiple short chapters on the same day, long chapters are the only one read that day).  I also want to read through the book in order (no skipping ahead to combine short chapters that are not naturally next to each other. 

The data set is the Book of Mormons [2]:

Chapter Words Letters Verses Chapter Words Letters Verses
1 Nephi 1908374620Alma 261665684037
1 Nephi 2879364024Alma 271260529430
1 Nephi 31067429531Alma 28575247114
1 Nephi 41262496838Alma 29708274817
1 Nephi 5761320222Alma 3026661076260
1 Nephi 62027776Alma 311478626938
1 Nephi 7992394122Alma 321837747943
1 Nephi 81221476938Alma 33855341623
1 Nephi 925910756Alma 341576647741
1 Nephi 10924382922Alma 35699304316
1 Nephi 111315515936Alma 361229471930
1 Nephi 12860352723Alma 372027876447
1 Nephi 131899787442Alma 38651249415
1 Nephi 141284523630Alma 39793316819
1 Nephi 151488614936Alma 401153479026
1 Nephi 161618655539Alma 41701298315
1 Nephi 1725231018055Alma 421229512931
1 Nephi 181217496025Alma 432221979554
1 Nephi 191292539924Alma 441243504324
1 Nephi 20698276722Alma 45911391324
1 Nephi 21945375826Alma 461809763541
1 Nephi 221506627531Alma 471572669336
2 Nephi 11543639132Alma 481073466925
2 Nephi 21460615130Alma 491345596830
2 Nephi 31170468525Alma 501734750840
2 Nephi 41300526735Alma 511682727137
2 Nephi 51169476734Alma 521757754340
2 Nephi 6895378018Alma 531085473023
2 Nephi 7405157411Alma 54988409524
2 Nephi 8812326125Alma 551218515235
2 Nephi 92388998554Alma 562177912957
2 Nephi 10966410625Alma 571517639536
2 Nephi 1133813328Alma 581748735741
2 Nephi 12647255722Alma 59489216613
2 Nephi 13587245426Alma 601777751236
2 Nephi 142038106Alma 61854354421
2 Nephi 15857354030Alma 622220955452
2 Nephi 16370140613Alma 63593250517
2 Nephi 17687266825Helaman 11418617134
2 Nephi 18570230422Helaman 2599256814
2 Nephi 19587247421Helaman 31527654237
2 Nephi 20928371534Helaman 41076478126
2 Nephi 21520209016Helaman 52084865552
2 Nephi 221345336Helaman 61797772141
2 Nephi 23587244222Helaman 71142475129
2 Nephi 24891358732Helaman 81210513428
2 Nephi 251699729430Helaman 91480603241
2 Nephi 261483636333Helaman 10720301319
2 Nephi 271461580935Helaman 111561653638
2 Nephi 281240499732Helaman 12873350426
2 Nephi 29804319714Helaman 131855752739
2 Nephi 30708294118Helaman 141241510631
2 Nephi 31988406521Helaman 15824360717
2 Nephi 3242617439Helaman 161025424725
2 Nephi 336472539153 Nephi 11340552430
Jacob 17193149193 Nephi 2769340419
Jacob 213655815353 Nephi 31353594726
Jacob 36192765143 Nephi 41518650633
Jacob 49294027183 Nephi 5931389126
Jacob 5375815366773 Nephi 61200525630
Jacob 65112091133 Nephi 71132494426
Jacob 712424988273 Nephi 8871370025
Enos 111604730273 Nephi 9965400622
Jarom 17343149153 Nephi 10807339219
Omni 113985906303 Nephi 111434573241
Words of Mormon 18573704183 Nephi 121294518548
Mosiah 19664188183 Nephi 13834340234
Mosiah 221128675413 Nephi 14628245127
Mosiah 311174771273 Nephi 15731296224
Mosiah 416056586303 Nephi 16913367920
Mosiah 57402965153 Nephi 17879355025
Mosiah 6309131373 Nephi 181344542939
Mosiah 715556326333 Nephi 191233508336
Mosiah 89383948213 Nephi 201538635446
Mosiah 98643489193 Nephi 211155465929
Mosiah 109573942223 Nephi 22507215517
Mosiah 1112715205293 Nephi 23415175414
Mosiah 1212915309373 Nephi 24620249618
Mosiah 1310734300353 Nephi 251837216
Mosiah 143911574123 Nephi 26793332621
Mosiah 1510564559313 Nephi 271269491733
Mosiah 165602393153 Nephi 281457585639
Mosiah 176502655203 Nephi 2939415899
Mosiah 1813825755353 Nephi 301305642
Mosiah 199784070294 Nephi 11949830049
Mosiah 20963402726Mormon 1706298719
Mosiah 211328564936Mormon 21262531729
Mosiah 22620265116Mormon 3939382022
Mosiah 231186497839Mormon 4822359523
Mosiah 24954397025Mormon 51067448324
Mosiah 25837361024Mormon 6914366422
Mosiah 261200511239Mormon 7454180510
Mosiah 271601673337Mormon 81710693141
Mosiah 28812351320Mormon 91510622537
Mosiah 291879799047Ether 1901350843
Alma 11496635633Ether 21370548125
Alma 21433617938Ether 31253500728
Alma 31016434027Ether 4892362319
Alma 41035443220Ether 52268916
Alma 527871126062Ether 61048426530
Alma 639016578Ether 7899374627
Alma 71443591327Ether 81231504126
Alma 81231503332Ether 91424582335
Alma 91511627434Ether 101415572634
Alma 101442583532Ether 11762323823
Alma 111466586546Ether 121536644341
Alma 121858780237Ether 131219505331
Alma 131347577631Ether 141132466431
Alma 141526629529Ether 151314537034
Alma 15711303419Moroni 11476144
Alma 161010438321Moroni 21194593
Alma 171848784239Moroni 31204914
Alma 181623665943Moroni 41536293
Alma 191701690536Moroni 5913512
Alma 201279510430Moroni 633514579
Alma 21955416223Moroni 71929781248
Alma 221871773335Moroni 81064467730
Alma 23764332118Moroni 91012430526
Alma 241509624130Moroni 101149460334
Alma 25752321217

Some additional data:

  • The number of days available is \(T=128\).
  • The number of chapters to read is \(n=239\). 
  • We can minimize the variance of any of the quantities verses, words or letters. Here I use verses.
  • The total number of verses is \(\sum_i v_i = 6603\). 
  • The average number of verses we want read per day is:\[\mu = \frac{\sum_i v_i}{T}= 51.6\]
  • I add the constraint we want to read at least one chapter per day.
  • We define the objective to minimize as: \[\min z=\sum_t (nv_t-\mu)^2\] where \(nv_t\) is the number of verses we read during day \(t\).

This is not a very small problem.


Method 1: Dynamic Programming


A dynamic programming approach is found in an answer provided in [1]. Let:

\(f_{i,t} = \) SSQ of optimal schedule when reading chapters \(1,\dots,i\) in \(t\) days.

where SSQ means sum of squares objective. This leads to the recursion:
\[\bbox[lightcyan,10px,border:3px solid darkblue]{\large{f_{i,t} = \min_{k=t-1,\dots,i-1}  \left\{ f_{k,t-1} + \left( \sum_{j=k+1}^i v_j - \mu\right)^2 \right\} }}\]


Now we just need to evaluate \(f_{n,T}\). 

The R code in [1] looks like:


f<-function(X=bom3$Verses,days=128){ 
# find optimal BOM reading schedule for Greg Snow 
# minimize variance of quantity to read per day over 128 days 
# 
N = length(X) 
SSQ<- matrix(NA,nrow=days,ncol=N) 
Cuts <- list() 
# 
#  SSQ[i,j]: the ssqs about the overall mean for the optimal partition 
#           for i days on the chapters 1 to j 
# 
M <- sum(X)/days 
SSQ[1,] <- (cumsum(X)-M)^2 
Cuts[[1]] <- as.list(1:N) 

for (m in 2:days){ 
   Cuts[[m]]=list() 
   for (n in m:N){ 
       CS = cumsum(X[n:1])[n:1] 
       SSQ1 = (CS-M)^2 
       j = (m-1):(n-1) 
       TS = SSQ[m-1,j]+(SSQ1[j+1]) 
       SSQ[m,n] = min(TS) 
       k = min(which((min(TS)== TS)))+m-1 
       Cuts[[m]][[n]] = c(Cuts[[m-1]][[k-1]],n) 
   } 
} 
list(SSQ=SSQ[days,N],Cuts=Cuts[[days]][[N]]) 
} 

This runs very fast:


> library(tictoc)
> tic()
> f()
$SSQ
[1] 11241.05

$Cuts
  [1]   2   4   7   9  11  13  15  16  17  19  21  23  25  27  30  31  34  37  39  41  44  46  48  50
 [25]  53  56  59  60  62  64  66  68  70  73  75  77  78  80  82  84  86  88  89  91  92  94  95  96
 [49]  97  99 100 103 105 106 108 110 112 113 115 117 119 121 124 125 126 127 129 131 132 135 137 138
 [73] 140 141 142 144 145 146 148 150 151 152 154 156 157 160 162 163 164 166 167 169 171 173 175 177
 [97] 179 181 183 185 186 188 190 192 193 194 196 199 201 204 205 207 209 211 213 214 215 217 220 222
[121] 223 225 226 228 234 236 238 239

> toc()
1.08 sec elapsed
> 

To see how we can interpret the solution, let's reproduce the objective function value:


> Days <- 128
> mu <- sum(bom3$Verses)/Days 
> 
> results <- f()
> start <- c(1, head(results$Cuts,-1)+1)
> end <- results$Cuts
> 
> verses <- rep(0,Days)
> for (i in 1:Days) {
+    verses[i] <- sum(v[start[i]:end[i]])
+ }
> 
> sum( (verses-mu)^2 )
[1] 11241.05


Method 2: A nonlinear Assignment Problem formulation


If we define binary variables:\[x_{i,t} = \begin{cases} 1 & \text{if chapter $i$ is read during day $t$}\\ 0 &\text{otherwise}\end{cases}\] we basically have an assignment problem. On top of this we add some ordering constraints and a nonlinear objective that measures the variance in the number of verses read.

The model can look like:

\[\bbox[lightcyan,10px,border:3px solid darkblue]{\begin{align} \min&\sum_t \left(nv_t - \mu\right)^2\\ &\sum_t x_{i,t} = 1 && \forall i\\ &\sum_i x_{i,t} \ge 1 && \forall t \\& x_{i,t} = 1 \Rightarrow x_{i+1,t} + x_{i+1,t+1}=1\\ &nv_t = \sum_i x_{i,t} v_i\\ & x_{1,1} = x_{n,T} = 1\\ & x_{i,t} \in \{0,1\}  \end{align} }\]

Besides assigning each chapter to a day, we need to establish the ordering:

A proper schedule

First we note that we know that the first chapter is read in the first day and the last chapter is read in the last day. This is modeled by fixing \(x_{1,1} = x_{n,T} = 1\). Secondly we have the rule that if \(x_{i,t}=1\) (i.e. chapter \(i\) is read during day \(t\)), then we know that the next chapter \(i+1\) is read during the same day or the next day. This constraint will ensure we build a proper schedule. Formally: \(x_{i,t} = 1 \Rightarrow x_{i+1,t} + x_{i+1,t+1}=1\). The implication can be linearized as: \[ x_{i,t} \le x_{i+1,t} + x_{i+1,t+1} \le 2-x_{i,t}\]
Looking at the allowed patterns, we can actually strengthen this to:
\[ x_{i,t} \le x_{i+1,t} + x_{i+1,t+1} \le 1\]
(this does not seem to help).

This MIQP model is very difficult to solve even with the best available solvers. One of the reasons is that the model is very large: it contains 30,592 binary variables. I am sorry to say this formulation is not very useful. Too bad: I think this formulation has a few interesting wrinkles.

Method 3: A Set Covering Approach


We can look at the problem in a different way. Let
\[x_{i,j} = \begin{cases} 1 & \text{if we read chapters $i$ through $j$}\\ 0 & \text{otherwise} \end{cases}\]

Of course we have \(i \le j \le n\). A set covering type of model can look like:

\[\bbox[lightcyan,10px,border:3px solid darkblue]{\begin{align}\min & \sum_{j\ge i} c_{i,j} x_{i,j}\\ &   \sum_{i \le k \le j} x_{i,j} = 1 && \forall k\\ & \sum_{j\ge i} x_{i,j} = T\\ & x_{i,j} \in \{0,1\}\end{align}}\]

Basically we do the following:

Covering

Select \(T\) variables \(x_{i,j}\) such that each chapter is covered exactly once. In the picture above: select \(T\) rows such that each column is covered by exactly one blue cell. On the right we see a feasible solution for \(T=3\) days. The constraint \(\sum_{i \le k \le j} x_{i,j} = 1, \forall k\) counts the number of ones in a column: it should be exactly one. The constraint \(\sum_{j\ge i} x_{i,j} = T\) selects \(T\) variables.

We can slightly optimize things by noting that we can not read more than  \(n-T+1\) chapters in a single day. We would finish the book too quickly. Remember, we want to read at least a single chapter each day. So, in advance, we can remove from the problem all variables with more than \(n-T+1\) chapters.

Before and after preprocessing

In the picture above we have \(n=4\) and \(T=3\). We can remove all \(x_{i,j}\)'s with length three and longer.

This is now a linear model with binary variables! The variables deal with a whole day, so the quadratic cost is now linear in the decision variables.  It turns out to be not a very difficult MIP:


Iteration log . . .
Iteration:     1   Dual objective     =          2476.602417
Root relaxation solution time = 0.36 sec. (126.25 ticks)

        Nodes                                         Cuts/
   Node  Left     Objective  IInf  Best Integer    Best Bound    ItCnt     Gap

*     0+    0                       134595.0547        0.0000           100.00%
Found incumbent of value 134595.054688 after 9.50 sec. (3892.85 ticks)
*     0     0      integral     0    11241.0547    11241.0547      180    0.00%
Elapsed time = 9.52 sec. (3899.95 ticks, tree = 0.00 MB, solutions = 2)
Found incumbent of value 11241.054688 after 9.52 sec. (3899.95 ticks)

Root node processing (before b&c):
  Real time             =    9.53 sec. (3900.78 ticks)
Sequential b&c:
  Real time             =    0.00 sec. (0.00 ticks)
                          ------------
Total (root+branch&cut) =    9.53 sec. (3900.78 ticks)
MIP status(101): integer optimal solution
Cplex Time: 9.53sec (det. 3900.84 ticks)

Not bad, but slower than the Dynamic Programming method.

Note that our optimization: remove all variables \(x_{i,j}\) with number of chapters exceeding \(n-T+1\) really helps:


Model Constraints Variables Solution time
Original24028,68030 seconds
Optimized24020,55210 seconds

The presolver does not detect this, so we need to preprocess the problem ourselves.

This model has also a very large number of binary variables, but performs infinitely much better than the previous model. One more indication that just counting variables is not a good measure for difficulty.

Method 4: A Network Model


We can look at this problem as a network [3]:


  • The nodes are formed by the chapters plus a sink node. So we have \(n+1\) nodes.
  • The arcs \((i,j)\) are interpreted as follows: we have a link \((i,j)\) if we start reading at chapter \(i\) and stop at chapter \(j-1\). I.e. arcs only exist for \(j\gt i\).
  • In addition we have arcs \((i,\mathit{sink})\). These arcs correspond to reading all remaining chapters starting at \(i\).
  • Each arc \((i,j)\) has a cost \(c_{i,j}\) where \[c_{i,j} = \sum_{k=i}^{j-1} \left(v_k - \mu\right)^2\] The sink can be thought of being chapter \(n+1\).
An example with four chapters can look like:

Network Representation


Along the arcs we see which chapters are read. Note there are only forward arcs in this network. 

The idea is to solve this as a shortest path problem from "ch1" to "sink" under the condition that we visit \(T\) "blue" nodes. A standard shortest path is easily formulated as an LP [4]:

\[\begin{align} \min&\sum_{(i,j)\in A} c_{i,j} x_{i,j}\\ & \sum_{(j,i)\in A} x_{j,i} + b_i = \sum_{(i,j)\in A} x_{i,j} && \forall i\\ &x_{i,j} \ge 0 \end{align}\]

where \[b_i = \begin{cases} 1 &\text{if $i=\mathit{ch1}$}\\  -1 & \text{if $i=\mathit{sink}$}\\  0 & \text{otherwise} \end{cases}\]


To make sure we visit \(T\) nodes we split the node balance equation.


\[\bbox[lightcyan,10px,border:3px solid darkblue]{\begin{align} \min&\sum_{(i,j)\in A} c_{i,j} x_{i,j}\\ & \sum_{(j,i)\in A} x_{j,i} + b_i = y_i && \forall i\\ & y_i = \sum_{(i,j)\in A} x_{i,j} && \forall i\\ & \sum_i y_i = T \\ &x_{i,j}, y_i \ge 0 \end{align}}\]

Here \(y_i\) measures the total flow leaving node \(i\) for consumption by other nodes. The set \(A\) is the set of arcs. Again this is now a completely linear model. This model also seems to give integer solutions automatically, so we don't need to use a MIP solver. (Not exactly sure why; I expected we destroyed enough of the network structure this would no longer be the case. To be on the safe side we can solve this as a MIP of course.). The LP model solves very fast:


Tried aggregator 1 time.
LP Presolve eliminated 3 rows and 3 columns.
Aggregator did 2 substitutions.
Reduced LP has 477 rows, 28916 columns, and 58070 nonzeros.
Presolve time = 0.06 sec. (13.96 ticks)
Initializing dual steep norms . . .

Iteration log . . .
Iteration:     1   Dual objective     =          2509.398865
Iteration:   150   Dual objective     =          9379.320313
LP status(1): optimal
Cplex Time: 0.11sec (det. 35.60 ticks)

Optimal solution found.
Objective :       11241.054688

Finally, this version beats the Dynamic Programming algorithm.

We can do slightly better by using the same optimization as in the set covering approach: ignore all \(x_{i,j}\) that correspond to reading more chapters than \(n-T+1\). This leads to a minor improvement in performance:


Tried aggregator 1 time.
LP Presolve eliminated 3 rows and 3 columns.
Aggregator did 2 substitutions.
Reduced LP has 477 rows, 20788 columns, and 41814 nonzeros.
Presolve time = 0.05 sec. (10.07 ticks)
Initializing dual steep norms . . .

Iteration log . . .
Iteration:     1   Dual objective     =          2580.195313
Iteration:   133   Dual objective     =          8804.367188
LP status(1): optimal
Cplex Time: 0.08sec (det. 24.21 ticks)

Optimal solution found.
Objective :       11241.054688


Conclusion


There is more than one way to skin a cat. These are four very different ways to optimize the reading schedule. If a certain model does not work very well, it makes sense to take a step back and see if you can develop something very different.


Resulting schedule



Day Chapters Verses Cost
Day 11 Nephi 1, 1 Nephi 24457.5464
Day 21 Nephi 3, 1 Nephi 469303.2496
Day 31 Nephi 5, 1 Nephi 6, 1 Nephi 7502.5152
Day 41 Nephi 8, 1 Nephi 94457.5464
Day 51 Nephi 10, 1 Nephi 115841.1402
Day 61 Nephi 12, 1 Nephi 1365179.9371
Day 71 Nephi 14, 1 Nephi 1566207.7652
Day 81 Nephi 1639158.4058
Day 91 Nephi 175511.6558
Day 101 Nephi 18, 1 Nephi 19496.6871
Day 111 Nephi 20, 1 Nephi 214812.8589
Day 121 Nephi 22, 2 Nephi 163130.2808
Day 132 Nephi 2, 2 Nephi 35511.6558
Day 142 Nephi 4, 2 Nephi 569303.2496
Day 152 Nephi 6, 2 Nephi 7, 2 Nephi 8545.8277
Day 162 Nephi 9545.8277
Day 172 Nephi 10, 2 Nephi 11, 2 Nephi 125511.6558
Day 182 Nephi 13, 2 Nephi 14, 2 Nephi 1562108.4527
Day 192 Nephi 16, 2 Nephi 1738184.5777
Day 202 Nephi 18, 2 Nephi 194373.7183
Day 212 Nephi 20, 2 Nephi 21, 2 Nephi 225619.4839
Day 222 Nephi 23, 2 Nephi 24545.8277
Day 232 Nephi 25, 2 Nephi 2663130.2808
Day 242 Nephi 27, 2 Nephi 2867237.5933
Day 252 Nephi 29, 2 Nephi 30, 2 Nephi 31531.9996
Day 262 Nephi 32, 2 Nephi 33, Jacob 14373.7183
Day 27Jacob 2, Jacob 3, Jacob 467237.5933
Day 28Jacob 577645.8746
Day 29Jacob 6, Jacob 740134.2339
Day 30Enos 1, Jarom 14291.8902
Day 31Omni 1, Words of Mormon 14812.8589
Day 32Mosiah 1, Mosiah 25954.9683
Day 33Mosiah 3, Mosiah 45729.3121
Day 34Mosiah 5, Mosiah 6, Mosiah 75511.6558
Day 35Mosiah 8, Mosiah 940134.2339
Day 36Mosiah 10, Mosiah 11510.3433
Day 37Mosiah 1237212.7496
Day 38Mosiah 13, Mosiah 144721.0308
Day 39Mosiah 15, Mosiah 164631.2027
Day 40Mosiah 17, Mosiah 185511.6558
Day 41Mosiah 19, Mosiah 205511.6558
Day 42Mosiah 21, Mosiah 22520.1714
Day 43Mosiah 2339158.4058
Day 44Mosiah 24, Mosiah 25496.6871
Day 45Mosiah 2639158.4058
Day 46Mosiah 27, Mosiah 285729.3121
Day 47Mosiah 294721.0308
Day 48Alma 133345.4371
Day 49Alma 238184.5777
Day 50Alma 3, Alma 44721.0308
Day 51Alma 562108.4527
Day 52Alma 6, Alma 7, Alma 867237.5933
Day 53Alma 9, Alma 1066207.7652
Day 54Alma 114631.2027
Day 55Alma 12, Alma 1368269.4214
Day 56Alma 14, Alma 154812.8589
Day 57Alma 16, Alma 176070.7964
Day 58Alma 184373.7183
Day 59Alma 19, Alma 2066207.7652
Day 60Alma 21, Alma 225841.1402
Day 61Alma 23, Alma 244812.8589
Day 62Alma 25, Alma 26545.8277
Day 63Alma 27, Alma 28, Alma 296188.6246
Day 64Alma 306070.7964
Day 65Alma 3138184.5777
Day 66Alma 324373.7183
Day 67Alma 33, Alma 3464154.1089
Day 68Alma 35, Alma 364631.2027
Day 69Alma 374721.0308
Day 70Alma 38, Alma 39, Alma 406070.7964
Day 71Alma 41, Alma 424631.2027
Day 72Alma 43545.8277
Day 73Alma 44, Alma 454812.8589
Day 74Alma 4641112.0621
Day 75Alma 4736242.9214
Day 76Alma 48, Alma 495511.6558
Day 77Alma 5040134.2339
Day 78Alma 5137212.7496
Day 79Alma 52, Alma 5363130.2808
Day 80Alma 54, Alma 555954.9683
Day 81Alma 565729.3121
Day 82Alma 5736242.9214
Day 83Alma 58, Alma 59545.8277
Day 84Alma 60, Alma 615729.3121
Day 85Alma 62520.1714
Day 86Alma 63, Helaman 1, Helaman 265179.9371
Day 87Helaman 3, Helaman 463130.2808
Day 88Helaman 5520.1714
Day 89Helaman 641112.0621
Day 90Helaman 7, Helaman 85729.3121
Day 91Helaman 941112.0621
Day 92Helaman 10, Helaman 115729.3121
Day 93Helaman 12, Helaman 1365179.9371
Day 94Helaman 14, Helaman 154812.8589
Day 95Helaman 16, 3 Nephi 15511.6558
Day 963 Nephi 2, 3 Nephi 34543.3746
Day 973 Nephi 4, 3 Nephi 55954.9683
Day 983 Nephi 6, 3 Nephi 75619.4839
Day 993 Nephi 8, 3 Nephi 94721.0308
Day 1003 Nephi 10, 3 Nephi 116070.7964
Day 1013 Nephi 124812.8589
Day 1023 Nephi 13, 3 Nephi 146188.6246
Day 1033 Nephi 15, 3 Nephi 164457.5464
Day 1043 Nephi 17, 3 Nephi 1864154.1089
Day 1053 Nephi 1936242.9214
Day 1063 Nephi 204631.2027
Day 1073 Nephi 21, 3 Nephi 224631.2027
Day 1083 Nephi 23, 3 Nephi 24, 3 Nephi 2538184.5777
Day 1093 Nephi 26, 3 Nephi 27545.8277
Day 1103 Nephi 28, 3 Nephi 29, 3 Nephi 30502.5152
Day 1114 Nephi 1496.6871
Day 112Mormon 1, Mormon 24812.8589
Day 113Mormon 3, Mormon 44543.3746
Day 114Mormon 5, Mormon 64631.2027
Day 115Mormon 7, Mormon 8510.3433
Day 116Mormon 937212.7496
Day 117Ether 14373.7183
Day 118Ether 2, Ether 3531.9996
Day 119Ether 4, Ether 5, Ether 65511.6558
Day 120Ether 7, Ether 8531.9996
Day 121Ether 935275.0933
Day 122Ether 10, Ether 115729.3121
Day 123Ether 1241112.0621
Day 124Ether 13, Ether 1462108.4527
Day 125Ether 15, Moroni 1, Moroni 2, Moroni 3, Moroni 4, Moroni 5502.5152
Day 126Moroni 6, Moroni 75729.3121
Day 127Moroni 8, Moroni 95619.4839
Day 128Moroni 1034309.2652

References


Wednesday, February 21, 2018

Modeling vs Programming, Python vs GAMS

In [1] a question is posed about formulating an LP model dealing with a shortest path problem. The suggested solution is programmed in Python. This is a good example where coding even in a high-level programming language is less compact than a formulation in a modeling language. Lots of code is needed to shoe-horn the model and data into an LP data structure. We see that a standard matrix form LP:

\[\begin{align} \min\>&c^Tx\\ &Ax=b\\&x\ge 0\end{align}\]

as data structure is not always a good abstraction level. In the GAMS model below, we actually use a model that is closer to the problem at hand:

\[\begin{align} \min\> & z = \sum_{(i,j) \in A} d_{i,j} x_{i,j} \\& \sum_{(j,i) \in A} x_{j,i} + b_i = \sum_{(i,j) \in A} x_{i,j} & \forall i\\& x_{i,j}\ge 0 \end{align}\]

Here \(A\) is the set of (directed) arcs. I.e. we lower the abstraction level. This looks like a bad idea, but it makes the "step" we need to make from problem to code much smaller. Quantitatively, we need between 2 and 3 times the number of lines of code in Python compared to the corresponding GAMS model. Python is well-known for its compactness: no need for declarations and a rich library of built-in functions for data manipulation lead to fewer lines of code than needed in many other languages. The GAMS model devotes quite a few lines to declarations. This requires some extra typing but it will provide type and domain checks that can help for larger, more complex models. In the end we need much more code in the Python model than in the GAMS model.  Besides just looking at compactness, I also think the Python version is much more obfuscated: it is difficult to even recognize this as a shortest path network LP model. The GAMS model is very close to the mathematical model. Readability is probably the most important feature when dealing with implementing maintainable, real optimization models. In the Python code this got lost. Note that this shortest path LP is a very simple model. The above observations are even more relevant in larger, more complex models that we see in practice. Finally, of course there are many Python modeling frameworks that provide better tools to create LP models.

I often do not understand the attractiveness of solving LP models using matrix oriented APIs. These type of APIs are found in Python and are prevalent in R and Matlab [2]. Only for some very structured models this interface is really easy to use. The argument between modeling systems vs matrix generators actually goes back a long time [3].

Python code

Code is taken from [1].

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import numpy as np
from scipy.optimize import linprog

""" DATA"""
edges = [('A', 'B', 8),
         ('A', 'F', 10),
         ('B', 'C', 4),
         ('B', 'E', 10),
         ('C', 'D', 3),
         ('D', 'E', 25),
         ('D', 'F', 18),
         ('E', 'D', 9),
         ('E', 'G', 7),
         ('F', 'A', 5),
         ('F', 'B', 7),
         ('F', 'C', 3),
         ('F', 'E', 2),
         ('G', 'D', 2),
         ('G', 'H', 3),
         ('H', 'A', 4),
         ('H', 'B', 9)]
s, t = 'G', 'C'

""" Preprocessing """
nodes = sorted(set([i[0] for i in edges]))  # assumption: each node has an outedge
n_nodes = len(nodes)
n_edges = len(edges)

edge_matrix = np.zeros((len(nodes), len(nodes)), dtype=int)
for edge in edges:
    i, j, v = edge
    i_ind = nodes.index(i)  # slow
    j_ind = nodes.index(j)  # """
    edge_matrix[i_ind, j_ind] = v

nnz_edges = np.nonzero(edge_matrix)
edge_dict = {}
counter = 0
for e in range(n_edges):
    a, b = nnz_edges[0][e], nnz_edges[1][e]
    edge_dict[(a,b)] = counter
    counter += 1

s_ind = nodes.index(s)
t_ind = nodes.index(t)

""" LP """
bounds = [(0, 1) for i in range(n_edges)]
c = [i[2] for i in edges]

A_rows = []
b_rows = []
for source in range(n_nodes):
    out_inds = np.flatnonzero(edge_matrix[source, :])
    in_inds = np.flatnonzero(edge_matrix[:, source])

    rhs = 0
    if source == s_ind:
        rhs = 1
    elif source == t_ind:
        rhs = -1

    n_out = len(out_inds)
    n_in = len(in_inds)

    out_edges = [edge_dict[a, b] for a, b in np.vstack((np.full(n_out, source), out_inds)).T]
    in_edges = [edge_dict[a, b] for a, b in np.vstack((in_inds, np.full(n_in, source))).T]

    A_row = np.zeros(n_edges)
    A_row[out_edges] = 1
    A_row[in_edges] = -1

    A_rows.append(A_row)
    b_rows.append(rhs)

A = np.vstack(A_rows)
b = np.array(b_rows)
res = linprog(c, A_eq=A, b_eq=b, bounds=bounds)
print(res)

The reason this model is less than obvious is that we need to map an irregular collection of vertices \((i,j)\) to a one-dimensional decision variable \(x_k\). If the graph is dense we can easily do a simple calculation: \(k := i + j*n\). But in this case our graph is sparse so the mapping \((i,j) \rightarrow k\) is not straightforward. In addition we need to be able to find all arcs going into \(i\) or leaving \(i\). Some well-chosen dicts can help here. In the end we do a lot of programming not directly related to the mathematical model. This means the model itself gets obscured by large amounts of code. And of course, this code needs to be debugged, maintained etc. so it is certainly not without cost.

GAMS Model


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
set i 'nodes' /A*H/;
alias(i,j);
parameters
  d(i,j) 'distances' /
    A.(B 8, F 10)
    B.(C 4, E 10)
    C.D 3
    D.(E 25, F 18)
    E.(D 9, G 7)
    F.(A 5, B 7, C 3, E 2)
    G.(D 2, H 3)
    H.(A 4, B 9) /
  inflow(i) 'rhs' /
    G 1
    C -1
   /;

set a(i,j) 'arcs exist if we have a distance';
a(i,j) = d(i,j);

positive variables x(i,j) 'flow';
variable z 'obj';

equations
   obj   'objective'
   nodebal(i) 'node balance'
;

obj.. z =e= sum(a,d(a)*x(a));
nodebal(i)..  sum(a(j,i), x(j,i)) + inflow(i) =e= sum(a(i,j), x(i,j));

model m /all/;
solve m minimizing z using lp;
display x.l;

The operations that were so problematic in Python, are actually straightforward in GAMS. Just write the equations as stated in the mathematical formulation. The set a implements the set of arcs. This allows us to have quite a natural expression of the flow conservation constraints.

The GAMS model is actually well suited to solve large sparse instances. To test this we can generate random data:

set i 'nodes' /node1*node1000/;
alias(i,j);
parameters
  d(i,j) 
'distances'
  inflow(i) 
'rhs' /
    
node1 1
    
node1000 -1
   
/;
d(i,j)$(uniform(0,1)  < 0.05) = uniform(1,100);


This will give us a network with 1,000 nodes and about 50,000 arcs. We don't need to change any of the data representation as all parameters and (multi-dimensional) sets are automatically stored using sparse data structures. This model with 1,000 equations and 50k variables takes 0.3 seconds to generate and 0.3 seconds to solve (it is a very easy LP problem). Note that the solver called in the above Python code is not able to handle a model of this size. The Python code takes 5.5 seconds to generate the LP problem; then we are stuck in the LP solver for hours. Dense LP solvers are really only for very small problems.


References


  1. Using SciPy's minimize to find the shortest path in a graph, https://stackoverflow.com/questions/48898447/using-scipys-minimize-to-find-the-shortest-path-in-a-graph
  2. Matlab vs GAMS: linear programming, http://yetanothermathprogrammingconsultant.blogspot.com/2016/09/matlab-vs-gams-linear-programming.html
  3. R. Fourer, Modeling Languages versus Matrix Generators for Linear Programming. ACM Transactions on Mathematical Software 9 (1983) 143-183.

Wednesday, February 14, 2018

Van der Waerden Puzzle

In [1] a puzzle is mentioned:

Find a binary sequence \(x_1, \dots , x_8\) that has no three equally spaced \(0\)s and no three equally spaced \(1\)s

I.e. we want to forbid patterns like \(111\),\(1x1x1\), \(1xx1xx1\) and \(000\),\(0x0x0\), \(0xx0xx0\). It is solved as a SAT problem in [1], but we can also write this down as a MIP. Instead of just 0s and 1s we make it a bit more general: any color from a set \(c\) can be used (this allows us to use more colors than just two and we can make nice pictures of the results). A MIP formulation can look like the two-liner:

\[\begin{align}& x_{i,c} + x_{i+k,c} + x_{i+2k,c} \le 2 && \forall i,k,c  \text{ such that } i+2k \le n\\ & \sum_c x_{i,c}=1 && \forall i\end{align} \]

where \(x_{i,c} \in \{0,1\}\) indicating whether color \(c\) is selected for position \(i\). For \(n=8\) and two colors we find six solutions:


For \(n=9\) we can not find any solution. This is sometimes denoted as the Van der Waerden number \(W(3,3)=9\).

For small problems we can enumerate these by adding cuts and resolving. For slightly larger problems the solution pool technique in solvers like Cplex and Gurobi can help.

For three colors \(W(3,3,3)=27\) i.e. for \(n=27\) we cannot find a solution without three equally spaced colors. For \(n=26\) we can enumerate all 48 solutions:


Only making this problem a little bit bigger is bringing our model to a screeching halt. E.g. \(W(3,3,3,3)=76\). I could not find solutions for \(n=75\) within 1,000 seconds, let alone enumerating them all.

MIP vs CP formulation


In the MIP formulation we used a binary representation of the decision variables \(x_{i,c}\). If we would use a Constraint Programming solver (or an SMT Solver), we can can use integer variables \(x_i\) directly.

Binary MIP formulationInteger CP formulation
\[\large{\begin{align}\min\>&0\\& x_{i,c} + x_{i+k,c} + x_{i+2k,c} \le 2 &&  \forall  i+2k \le n\>\forall c \\ & \sum_c x_{i,c}=1 && \forall i\\ & x_{i,c} \in \{0,1\} \end{align}} \] \[ \large{\begin{align} &x_{i} \ne x_{i+k} \vee x_{i+k} \ne x_{i+2k} && \forall  i+2k \le n\\ & x_i \in \{1,\dots,nc\} \end{align} }\]


Python/Z3 version


This is to illustrate the integer CP formulation. Indexing in Python starts at 0 so we have to slightly adapt the model for this:

from z3 import *

n = 26
nc = 3

# integer variables X[i]
X = [ Int('x%s' % i ) for i in range(n) ]

# lower bounds X[i] >= 1
Xlo = And([X[i] >= 1 for i in range(n)])
# upper bounds X[i] <= nc
Xup = And([X[i] <= nc for i in range(n)])
# forbid equally spaced colors 
Xforbid = And( [ Or(X[i] != X[i+k+1], X[i+k+1] != X[i+2*(k+1)] ) for i in range(n) \
               for k in range(n) if i+2*(k+1) < n])
# combine all constraints
Cons = [Xlo,Xup,Xforbid]
       
# find all solutions    
s = Solver()
s.add(Cons)
k = 0
res = []
while s.check() == sat:
    m = s.model()
    k = k + 1
    sol = [ m.evaluate(X[i]) for i in range(n)]
    res = res + [(k,sol)]
    forbid = Or([X[i] != sol[i] for i in range(n) ])
    s.add(forbid)     
print(res)

Here we use andor and != to model our Xforbid constraint. We could have used the same binary variable approach as done in the integer programming model. I tested that also: it turned out to be slower than using integers directly.

Performance


Here are some timings for generating these 48 solutions for the above example:

Model/Solver Statistics Solution Time
Binary formulation with Cplex (solution pool) 494 rows
78 binary variables
51092 simplex iterations
3721 nodes
1.6 seconds
Integer formulation with Z3 26 integer variables
693 boolean variables (first model)
5158 clauses (first model)
871 boolean variables (last model)
13990 clauses (last model)
5.5 seconds
Binary formulation with Z3 78 binary variables
860 boolean variables (first model)
2974 clauses (first model)
1015 boolean variables (last model)
6703 clauses (last model)
137 seconds


The binary MIP formulation is not doing poorly at all when solved with a state-of-the-art MIP solver. Z3 really does not like the same binary formulation: we need to use the integer formulation instead to get good performance. Different formulations of the same problem can make a substantial difference in performance.


Van der Waerden (1903-1996)


References


  1. Donald E. Knuth, The Art of Computer Programming, Volume 4, Fascicle 6, Satisfiability, 2015
  2. Van der Waerden Number, https://en.wikipedia.org/wiki/Van_der_Waerden_number