Monday, September 18, 2017

Conference Scheduler

In [1,2] a Python-based conference scheduler is presented. The problem is as follows:

  • There are events (i.e. talks) denoted by \(i\) and slots denoted by \(j\)
  • Some restrictions include:
    • Do not schedule talks by the same speaker at the same slot
  • Assign events to slots such that one of the following objectives is optimized:
    • Minimize total room overflow (that is: demand exceeds capacity)
    • Minimize maximum room overflow
    • Minimize changes from previous schedule

In [2] the model is stated as follows:

\[\bbox[lightcyan,10px,border:3px solid darkblue]{ \begin{align}
\min\>&\sum_{i,j} (d_i-c_j) x_{i,j}&&\text{Sum of room overflow}\\
          &\sum_j x_{i,j} =1\>\forall i \\
          &\sum_i x_{i,j} \le 1\>\forall j \\
          &x_{i,j} = 0\> \forall i,j \in C_S && \text{Unavailability} \\
          &x_{i,j} + x_{i’,j’} \le 1\> \forall i,j,i’,j’ \in C_E &&\text{Not at same day}\\
          &x_{i,j} \in \{0,1\}
\end{align}}\]

You can choose any single one from the objectives:

\[\begin{align}
1.&\sum_{i,j} (d_i-c_j) x_{i,j}&&\text{Sum of room overflow}\\
2.&\max_{i,j} (d_i-c_j) x_{i,j}&&\text{Max of room overflow}\\
3.&\sum_{i,j} \left[(1-x^0_{i,j})x_{i,j}+x^0_{i,j}(1-x_{i,j})\right]&&\text{Difference from a previous solution $x^0$}
\end{align}\]

The model can be solved with a MIP solver (via the PULP Python package) or using a simulated annealing code.

This is a nice, clean model. Of course there are always things I probably would do differently, largely based on taste.

  1. When I did something like this I used a variable \(x_{e,r,t}\) with \(r,t\) indicating the room and the time period. In this package \(j\) captures both \((r,t)\). As the package supports events and slots of different length (e.g. 30, 60, 90 minutes) that makes sense.
  2. Objective 1 seems to prefer assignments with \(d_i \ll c_j\). This can be fixed by using: \(\sum_{i,j} \max(d_i-c_j,0) x_{i,j}\). Of course we can do even better by also penalizing assignments of small events to big rooms, as shown in the rightmost picture:

    image
  3. It is not always easy to choose between the sum vs the max (objectives 1 and 2). Actually often a combination works very well.
  4. Similar for the last objective. In general I combine this with the other objectives, so that we can deviate from an existing schedule if it pays off for another objective.
  5. Instead of fixing variables \(x\) to zero if assignments are forbidden (and leave it to the presolver to get rid of these variables) I like to not even generate these variables. I guess I am old-fashioned.
  6. I probably would make the model very elastic: make sure we can always produce a feasible schedule (at the cost of renting very expensive “emergency” rooms). This way we can at least see a solution instead of just a message “infeasible”.
  7. I often prefer to present results in spreadsheet tables, in hopefully a meaningful layout. That seems to work better as a communication tool than a text file:
    image 

OK, enough of the nitpicking.  In any case, interesting stuff.

References
  1. A Python tool to assist the task of scheduling a conference, https://github.com/PyconUK/ConferenceScheduler
  2. Documentation, http://conference-scheduler.readthedocs.io/en/latest/index.html

Thursday, September 14, 2017

Minimizing Standard Deviation

image

In [1] an attempt was made to model the following situation:

  • There are \(n=25\) bins (indicated by \(i\)), each with \(q_i\) parts (\(q_i\) is given).
  • There are \(m=5\) pallets. We can load up to 5 bins onto a pallet. (Note: this implies each pallet will get exactly 5 bins.) The set of pallets is called \(j\).
  • We want to minimize the standard deviation of the number of parts loaded onto a pallet.

The question of how to model the standard deviation thing comes up now and then, so let’s see how we can model this simple example.

First we notice that we need to introduce some kind of binary assignment variable to indicate on which pallet a bin is loaded:

\[x_{i,j}=\begin{cases}1 & \text{if bin $i$ is assigned to pallet $j$}\\0&\text{otherwise}\end{cases}\]

A first model can look like:

\[\bbox[lightcyan,10px,border:3px solid darkblue]{
\begin{align}\min\>& \sqrt{\frac{\sum_j (p_j-\mu)^2}{m-1}}&&\text{minimize standard deviation}\\
&\sum_j x_{i,j} = 1 \>\> \forall i && \text{assign bin $i$ exactly once}\\
&\sum_i x_{i,j} \le 5 \>\> \forall j && \text{at most 5 bins per pallet}\\
& p_j = \sum_i q_i x_{i,j}&&\text{number of parts on a pallet}\\
& \mu = \frac{\sum_j p_j}{m}&&\text{average number of parts on a pallet} \\
& x_{i,j} \in \{0,1\} \end{align} } \]

The variables \(p_j\) and \(\mu\) are (unrestricted) continuous variables (to be precise: \(p_j\) will be integer automatically).

The objective is complicated and would require an MINLP solver. We can simplify the objective as follows:

\[\min\>\sum_j (p_j-\mu)^2 \]

Now we have a MIQP model that can be solved with a number of good solvers.

It is possible to look at the spread in a different way (there is nothing sacred about the standard deviation) and come up with a linear objective:

\[\begin{align}\min\>&p_{\text{max}}-p_{\text{min}}\\
&p_{\text{max}} \ge p_j \>\>\forall j\\
&p_{\text{min}} \le p_j \>\>\forall j \end{align}\]

This is the range of the \(p_j\)’s. In practice we might even try to use a simpler approach:

\[\begin{align}\min\>&p_{\text{max}}\\
&p_{\text{max}} \ge p_j \>\>\forall j \end{align}\]

This objective will also reduce the spread although in an indirect way. To be complete I also want to mention that I have seen cases where a quadratic objective of the form:

\[\min\>\sum_j p_j^2\]

was used. Indirectly, this objective will also minimize the spread.

The original post [1] suggest to use as number of parts:

image

Obviously drawing from the Normal distribution will give us fractional values. In practice I would expect the number of parts to be a whole number. Of course if we would consider an other attribute such as weight, we could see fractional values. In the model per se, we don’t assume integer valued \(q_i\)’s so let’s stick with these numbers. We will see below this is actually quite a difficult data set (even though we only have 125 discrete variables, which makes the model rather small), and other data will make the model much easier to solve.

To reduce some symmetry in the model we can add:

\[p_j \ge p_{j-1}\]

Not only will this help with symmetry, it also makes the variable \(p\) easier to interpret in the solution. You can see this in the results below.

The two quadratic models were difficult to solve to optimality. I stopped after 1000 seconds and both were still working. Interestingly just minimizing \(\sum p_j^2\) seems to get a somewhat better solution within the 1000 second time limit: it reduces the range from 0.1 to 0.052 and the standard deviation from 0.04 to 0.02.

image

The linear models solve to global optimality quickly and get better results.

image

By using a linear approximation we can reduce the range to 0.034 and 0.038 (and the standard deviation to 0.014 and 0.016). The model that minimizes \(p_{max}-p_{min}\) seems to be the best performing: both the quality of the solution is very good and the solution time is the fastest (34 seconds).

This is an example of a model where MIQP solvers are not as fast as we want: they really fall behind their linear counterparts.

Optimality

Can we find the best standard deviation possible? We can use the following algorithm to give this a try:

  1. Solve the linear model with objective \(\min p_{max}-p_{min}\) to optimality.
  2. Solve the quadratic model with objective \(\min \sum_j (p_j-\mu)^2\) using MIPSTART.

This data set although small is still a big problem for state-of-the-art MIQP solvers around. Step 2 took hours on a small laptop (somewhat faster on a faster machine after throwing extra threads at it). This step did not improve the solution found in in step 1 but rather proved it has the best possible standard deviation. Still very poor performance for a very small model!

Dataset

The dataset we used turned out to be very difficult. If we use a different dataset for \(q_i\): integer values from a uniform distribution, we get much better performance.

Using the following data:

----     35 PARAMETER q 

bin1  13.000,    bin2  27.000,    bin3  21.000,    bin4  16.000,    bin5  16.000
bin6  14.000,    bin7  17.000,    bin8  27.000,    bin9  11.000,    bin10 20.000
bin11 30.000,    bin12 22.000,    bin13 30.000,    bin14 26.000,    bin15 12.000
bin16 23.000,    bin17 13.000,    bin18 15.000,    bin19 24.000,    bin20 19.000
bin21 17.000,    bin22 17.000,    bin23 12.000,    bin24 13.000,    bin25 22.000

the model with quadratic objective \(\min \sum_j (p_j-\mu)^2\) solves this in just 0.2 seconds to optimality (see the solution below). This model has exactly the same size as before, just different data for \(q_i\). This is one more proof that problem size (measured by number of variables and equations) is not necessarily a good indication of how difficult a problem is to solve.

image

References
  1. LPSolve API, Minimize a function of constraints, https://stackoverflow.com/questions/46203974/lpsolve-api-minimize-a-function-of-constraints

Thursday, September 7, 2017

Quantiles with Pandas

In [1] I showed how quantiles in GAMS are difficult, and how they can be calculated better in R. Here is some Python code to do the same:

import pandas as pd
df = pd.read_csv("p.csv")
q=df.groupby(['i','j']).quantile([0,.25,.5,.75,1])
print(q)
q.to_csv("q.csv")

The q data frame looks like:

                  Val
i  j                
i1 j1 0.00  18.002966
       0.25  28.242058
       0.50  33.222936
       0.75  62.736221
       1.00  85.770764
    j2 0.00   7.644259
       0.25  41.375281
       0.50  61.313381
       0.75  82.127640
       1.00  99.813645
    j3 0.00  14.017667
       0.25  16.559221
       0.50  30.775334
       0.75  38.482815
       1.00  67.223932
i2 j1 0.00  11.938737
       0.25  29.259331
       0.50  55.029177
       0.75  69.633259
       1.00  83.258388
    j2 0.00  16.857103
       0.25  28.783298
       0.50  53.358812
       0.75  65.534761
       1.00  87.373769
    j3 0.00   5.608600
       0.25  17.433855
       0.50  33.311746
       0.75  45.566497
       1.00  64.926986

This is pretty clean. Data frames can easily be read in from CSV files (see the example above) or databases.

The new GAMS Python scripting tool is not very friendly in this respect. We need to do a lot of transformations leading to a low signal-to-noise ratio:

embeddedCode Python:
   import pandas as pd
   p = list(gams.get('p', keyFormat=KeyFormat.FLAT))
   df = pd.DataFrame(p, columns=["i", "j", "k", "value"])
   print("");print("df");print(df)
   q=df.groupby(['i','j']).quantile([0,.25,.5,.75,1])
   print("q");print(q)
# q has one multi-level index
# convert to standard indices
   q2 = q.reset_index()
   print("q2");print(q2)
# to extract a set q: convert to list of strings
   qset = q2["level_2"].unique().astype('str').tolist()
   print("qset");print(qset)
   gams.set("q", qset)
# get data itself: list of tuples
   quantiles = list(zip(q2["i"],q2["j"],q2["level_2"].astype('str'),q2["value"]))
   print("quantiles");print(quantiles)
   gams.set("quantiles",quantiles)
endEmbeddedCode q,quantiles
display q,quantiles;

Note that I am trying to prevent looping over data frame rows here: all operations are on complete columns. The CSV input/output is actually much shorter and cleaner. In this code, there is really one line that does really some work; the rest can be considered as just overhead.

The GAMS interface should probably support data frames directly to overcome this “impedance mismatch.” When users need to mix and match different languages the interface should make things as easy as possible. A GAMS/Python interface that is too low level and stays too close to GAMS asks the user to be reasonably fluent in in both GAMS and Python (the intersection of knowledgeable GAMS and Python users is likely to be small), and write quite some glue-code dealing with getting data in and out. Choosing a better abstraction level would probably help here.  

References

  1. http://yetanothermathprogrammingconsultant.blogspot.nl/2017/08/quantiles.html

Wednesday, September 6, 2017

Book: Designing Data-Intensive Applications

 

Was reading this during a long flight. Interesting review of the design decisions behind applications dealing with much (very much!) data.

Martin Kleppmann, Designing Data-Intensive Applications, 2017.

Contents:

Foundations of Data Systems
Chapter 1 Reliable, Scalable, and Maintainable Applications
    Thinking About Data Systems
    Reliability
    Scalability
    Maintainability
Chapter 2 Data Models and Query Languages
    Relational Model Versus Document Model
    Query Languages for Data
    Graph-Like Data Models
Chapter 3 Storage and Retrieval
    Data Structures That Power Your Database
    Transaction Processing or Analytics?
    Column-Oriented Storage
Chapter 4 Encoding and Evolution
    Formats for Encoding Data
    Modes of Dataflow
    Distributed Data
Chapter 5 Replication
    Leaders and Followers
    Problems with Replication Lag
    Multi-Leader Replication
    Leaderless Replication
Chapter 6 Partitioning
    Partitioning and Replication
    Partitioning of Key-Value Data
    Partitioning and Secondary Indexes
    Rebalancing Partitions
    Request Routing
Chapter 7 Transactions
    The Slippery Concept of a Transaction
    Weak Isolation Levels
    Serializability
Chapter 8 The Trouble with Distributed Systems
    Faults and Partial Failures
    Unreliable Networks
    Unreliable Clocks
    Knowledge, Truth, and Lies
Chapter 9 Consistency and Consensus
    Consistency Guarantees
    Linearizability
    Ordering Guarantees
    Distributed Transactions and Consensus
    Derived Data
Chapter 10 Batch Processing
    Batch Processing with Unix Tools
    MapReduce and Distributed Filesystems
    Beyond MapReduce
Chapter 11 Stream Processing
    Transmitting Event Streams
    Databases and Streams
    Processing Streams
Chapter 12 The Future of Data Systems
    Data Integration
    Unbundling Databases
    Aiming for Correctness
    Doing the Right Thing

Monday, September 4, 2017

What is this EPS in a GAMS solution?

I get this question a lot.  The smallest example is model number 1 from the GAMS model library [1]: a tiny transportation model (originally from [2]). The linear programming (LP) model looks like:

\[\begin{align} \min \>&z \\ &z = \sum_{i,j} c_{i,j} x_{i,j}&&\text{(cost)}\\ &\sum_j x_{i,j} \le a_i \>\forall i &&\text{(supply)}\\ &\sum_i x_{i,j} \ge b_j \>\forall j &&\text{(demand)}\\ &x_{i,j} \ge 0 \end{align}\]

The model has only two source nodes and three destination nodes and the solution looks like:

                           LOWER          LEVEL          UPPER         MARGINAL

---- EQU cost                .              .              .             1.0000     

  cost  define objective function

---- EQU supply  observe supply limit at plant i

                 LOWER          LEVEL          UPPER         MARGINAL

seattle          -INF          350.0000       350.0000         EPS        
san-diego        -INF          550.0000       600.0000          .         

---- EQU demand  satisfy demand at market j

                LOWER          LEVEL          UPPER         MARGINAL

new-york       325.0000       325.0000        +INF            0.2250     
chicago        300.0000       300.0000        +INF            0.1530     
topeka         275.0000       275.0000        +INF            0.1260     

---- VAR x  shipment quantities in cases

                          LOWER          LEVEL          UPPER         MARGINAL

seattle  .new-york          .            50.0000        +INF             .         
seattle  .chicago           .           300.0000        +INF             .         
seattle  .topeka            .              .            +INF            0.0360     
san-diego.new-york          .           275.0000        +INF             .         
san-diego.chicago           .              .            +INF            0.0090     
san-diego.topeka            .           275.0000        +INF             .         

                           LOWER          LEVEL          UPPER         MARGINAL

---- VAR z                 -INF          153.6750        +INF             .         

  z  total transportation costs in thousands of dollars

There are quite a few things to say about this solution listing.

There are 6 equations in this model and 7 variables. The objective function is modeled as an equality here as GAMS does not really have the notion of an objective function: it has an objective variable. The objective variable \(z\) is a free variable i.e. it has bounds \(-\infty\) and \(\infty\). This is also a GAMS convention.

The dots indicate a zero value. In the cost equation we can see the lower- and upper-bound are the same (both zero) indicating an equality. Note that this equation is rewritten as \(z - \sum_{i,j} c_{i,j} x_{i,j} = 0\) hence the bounds having a zero value.

The column marked MARGINAL are duals (for the rows) and reduced costs for the variables.

The row SUPPLY(‘SEATTLE’) has a marginal with a value of EPS. This means: this row is non-basic but with a zero dual. The EPS is used to signal this is a non-basic row: it the value was zero we could have deduced the row is basic. All basic rows and variables have a marginal that is zero. To verify we can count the basic and non-basic variables:

ROW/COLUMN                  MARGINAL      BASIS-STATUS

cost                           1.000         NB
supply('seattle')                EPS         NB
supply('san-diego')              .           B
demand('new-york')             0.2250        NB
demand('chicago')              0.1530        NB
demand('topeka')               0.1260        NB
x('seattle','new-york')          .           B
x('seattle','chicago')           .           B
x('seattle','topeka')          0.0360        NB
x('san-diego','new-york')        .           B
x('san-diego','chicago')       0.0090        NB
x('san-diego','topeka')          .           B
z                                .           B

All the variables and rows that have a marginal of zero are basic. There are 6 of them. This is identical to the number of equations. This conforms to what we would expect. The other 7 rows and columns have a non-zero marginal and are non-basic.

The occurrence of an EPS in the marginals indicates dual degeneracy. This indicates are multiple optimal solutions (to be precise: multiple optimal bases). Indeed here is an alternative solution found with a different LP solver:

                           LOWER          LEVEL          UPPER         MARGINAL

---- EQU cost                .              .              .             1.0000     

  cost  define objective function

---- EQU supply  observe supply limit at plant i

                 LOWER          LEVEL          UPPER         MARGINAL

seattle          -INF          300.0000       350.0000          .         
san-diego        -INF          600.0000       600.0000          .         

---- EQU demand  satisfy demand at market j

                LOWER          LEVEL          UPPER         MARGINAL

new-york       325.0000       325.0000        +INF            0.2250     
chicago        300.0000       300.0000        +INF            0.1530     
topeka         275.0000       275.0000        +INF            0.1260     

---- VAR x  shipment quantities in cases

                          LOWER          LEVEL          UPPER         MARGINAL

seattle  .new-york          .              .            +INF            EPS        
seattle  .chicago           .           300.0000        +INF             .         
seattle  .topeka            .              .            +INF            0.0360     
san-diego.new-york          .           325.0000        +INF             .         
san-diego.chicago           .              .            +INF            0.0090     
san-diego.topeka            .           275.0000        +INF             .         

                           LOWER          LEVEL          UPPER         MARGINAL

---- VAR z                 -INF          153.6750        +INF             .         

  z  total transportation costs in thousands of dollars

The term primal degeneracy is used to indicate that a basic row or column is at its bound (usually they are in between bounds). We see an example of that in row SUPPLY(‘SEATTLE’) here: this row is basic but the value 600 is equal to the upper bound.  This solution is also dual degenerate: we have an EPS marginal for variable X(‘SEATTLE’,’NEW-YORK’).

Notes

Q: Can we enumerate all possible optimal solutions? Not so easy, but in [3] an interesting approach is demonstrated on this model.

Q: Can we make \(z\) a positive variable? Well, GAMS is not very consistent here.
You can not declare ‘positive variable z’ as GAMS will complain: Objective variable is not a free variable. But we can assign: ‘z.lo = 0;’. Don’t ask me why. In general I just keep the variable unbounded.

Q: Can we used ranged equations in GAMS? No.
The listing file seems to indicate we can set lower and upper-bounds on rows. But this is not really true. If you would assign values say ‘supply.up(i) = INF;’, GAMS will ignore this and reset the bounds on the equations when generating the model.

Q: What about NLPs? Roughly the same. There is one major deviation: NLP solutions typically contain superbasic variables. These are non-basic (i.e. with a nonzero marginal) but are between their bounds.

Q: What about MIPs? By default GAMS will do the following after a MIP terminates: fix all discrete variables to their current level and resolve as an LP. So you will see marginals from this final LP. Marginals for a real MIP do not really exist (MIP nodes solved by say a dual simplex method have typically a number of discrete variables fixed by the branch-and-bound methods, cuts are added and heuristics are used also by modern MIP solvers).

Q: Can a free variable be non-basic? In almost all cases a free variable in an LP solution will be basic (a free basic variable will in general never leave the basis). However, it is possible for such a variable to be non-basic in which case it is usually zero.  

Q: Why do we need this EPS? The purpose of this EPS thing is to allow GAMS to set up an advanced basis for subsequent solves. These subsequent solves will typically be faster when we use an advanced basis. This method is solver independent. E.g. to make Gurobi look good we can do:
  option lp=cplex;
  solve m using lp minimizing z;
  option lp=gurobi;
  solve m using lp minimizing z;
This even works with different models that share many variables and equations:
  option lp=cplex;
  solve m1 using lp minimizing z1;
  option nlp=minos;
  solve m2 using nlp minimizing z2;

Q: My solver does not have a basis-preserving presolve, so I have to choose between presolving and using an advanced basis. Yes, unfortunately that is much the case with modern solvers. You need to make a decision here, hopefully after doing some tests. If the restarts solve very fast (i.e. need few iterations), usually bypassing the presolver is not a big deal.

Q: What is the meaning of a marginal when the model is infeasible. I am not sure. Are they wrt to a Phase I, Phase 2 or some composite objective? I would say: this is not well-defined.

Q: Any other uses for EPS? Yes. GAMS stores everything in a sparse format so all zeros are not stored (the exception is a scalar which is stored even if zero). When exporting data sometimes we want explicit zeros to be exported. An example is to prevent a matrix to miss a row or column that has only zeros. We can assign an EPS for these cases. Many tools like GDXXRW or GDX2SQLITE can convert EPS values to physical 0.0 values.

References
  1. https://www.gams.com/latest/gamslib_ml/libhtml/gamslib_trnsport.html
  2. Dantzig, G. B., Linear Programming and Extensions. Princeton University Press, Princeton, New Jersey, 1963.
  3. http://yetanothermathprogrammingconsultant.blogspot.com/2016/01/finding-all-optimal-lp-solutions.html