## Monday, September 18, 2017

### Conference Scheduler

In [1,2] a Python-based conference scheduler is presented. The problem is as follows:

• There are events (i.e. talks) denoted by $$i$$ and slots denoted by $$j$$
• Some restrictions include:
• Do not schedule talks by the same speaker at the same slot
• Assign events to slots such that one of the following objectives is optimized:
• Minimize total room overflow (that is: demand exceeds capacity)
• Minimize maximum room overflow
• Minimize changes from previous schedule

In [2] the model is stated as follows:

 \bbox[lightcyan,10px,border:3px solid darkblue]{ \begin{align}\min\>&\sum_{i,j} (d_i-c_j) x_{i,j}&&\text{Sum of room overflow}\\ &\sum_j x_{i,j} =1\>\forall i \\ &\sum_i x_{i,j} \le 1\>\forall j \\ &x_{i,j} = 0\> \forall i,j \in C_S && \text{Unavailability} \\ &x_{i,j} + x_{i’,j’} \le 1\> \forall i,j,i’,j’ \in C_E &&\text{Not at same day}\\ &x_{i,j} \in \{0,1\}\end{align}}

You can choose any single one from the objectives:

 \begin{align}1.&\sum_{i,j} (d_i-c_j) x_{i,j}&&\text{Sum of room overflow}\\2.&\max_{i,j} (d_i-c_j) x_{i,j}&&\text{Max of room overflow}\\3.&\sum_{i,j} \left[(1-x^0_{i,j})x_{i,j}+x^0_{i,j}(1-x_{i,j})\right]&&\text{Difference from a previous solution x^0}\end{align}

The model can be solved with a MIP solver (via the PULP Python package) or using a simulated annealing code.

This is a nice, clean model. Of course there are always things I probably would do differently, largely based on taste.

1. When I did something like this I used a variable $$x_{e,r,t}$$ with $$r,t$$ indicating the room and the time period. In this package $$j$$ captures both $$(r,t)$$. As the package supports events and slots of different length (e.g. 30, 60, 90 minutes) that makes sense.
2. Objective 1 seems to prefer assignments with $$d_i \ll c_j$$. This can be fixed by using: $$\sum_{i,j} \max(d_i-c_j,0) x_{i,j}$$. Of course we can do even better by also penalizing assignments of small events to big rooms, as shown in the rightmost picture:

3. It is not always easy to choose between the sum vs the max (objectives 1 and 2). Actually often a combination works very well.
4. Similar for the last objective. In general I combine this with the other objectives, so that we can deviate from an existing schedule if it pays off for another objective.
5. Instead of fixing variables $$x$$ to zero if assignments are forbidden (and leave it to the presolver to get rid of these variables) I like to not even generate these variables. I guess I am old-fashioned.
6. I probably would make the model very elastic: make sure we can always produce a feasible schedule (at the cost of renting very expensive “emergency” rooms). This way we can at least see a solution instead of just a message “infeasible”.
7. I often prefer to present results in spreadsheet tables, in hopefully a meaningful layout. That seems to work better as a communication tool than a text file:

OK, enough of the nitpicking.  In any case, interesting stuff.

##### References
1. A Python tool to assist the task of scheduling a conference, https://github.com/PyconUK/ConferenceScheduler

## Thursday, September 14, 2017

### Minimizing Standard Deviation

In [1] an attempt was made to model the following situation:

• There are $$n=25$$ bins (indicated by $$i$$), each with $$q_i$$ parts ($$q_i$$ is given).
• There are $$m=5$$ pallets. We can load up to 5 bins onto a pallet. (Note: this implies each pallet will get exactly 5 bins.) The set of pallets is called $$j$$.
• We want to minimize the standard deviation of the number of parts loaded onto a pallet.

The question of how to model the standard deviation thing comes up now and then, so let’s see how we can model this simple example.

First we notice that we need to introduce some kind of binary assignment variable to indicate on which pallet a bin is loaded:

 $x_{i,j}=\begin{cases}1 & \text{if bin i is assigned to pallet j}\\0&\text{otherwise}\end{cases}$
A first model can look like:
 \bbox[lightcyan,10px,border:3px solid darkblue]{\begin{align}\min\>& \sqrt{\frac{\sum_j (p_j-\mu)^2}{m-1}}&&\text{minimize standard deviation}\\&\sum_j x_{i,j} = 1 \>\> \forall i && \text{assign bin i exactly once}\\&\sum_i x_{i,j} \le 5 \>\> \forall j && \text{at most 5 bins per pallet}\\& p_j = \sum_i q_i x_{i,j}&&\text{number of parts on a pallet}\\& \mu = \frac{\sum_j p_j}{m}&&\text{average number of parts on a pallet} \\& x_{i,j} \in \{0,1\} \end{align} }

The variables $$p_j$$ and $$\mu$$ are (unrestricted) continuous variables (to be precise: $$p_j$$ will be integer automatically).

The objective is complicated and would require an MINLP solver. We can simplify the objective as follows:

 $\min\>\sum_j (p_j-\mu)^2$

Now we have a MIQP model that can be solved with a number of good solvers.

It is possible to look at the spread in a different way (there is nothing sacred about the standard deviation) and come up with a linear objective:

 \begin{align}\min\>&p_{\text{max}}-p_{\text{min}}\\&p_{\text{max}} \ge p_j \>\>\forall j\\&p_{\text{min}} \le p_j \>\>\forall j \end{align}

This is the range of the $$p_j$$’s. In practice we might even try to use a simpler approach:

 \begin{align}\min\>&p_{\text{max}}\\&p_{\text{max}} \ge p_j \>\>\forall j \end{align}

This objective will also reduce the spread although in an indirect way. To be complete I also want to mention that I have seen cases where a quadratic objective of the form:

 $\min\>\sum_j p_j^2$

was used. Indirectly, this objective will also minimize the spread.

The original post [1] suggest to use as number of parts:

Obviously drawing from the Normal distribution will give us fractional values. In practice I would expect the number of parts to be a whole number. Of course if we would consider an other attribute such as weight, we could see fractional values. In the model per se, we don’t assume integer valued $$q_i$$’s so let’s stick with these numbers. We will see below this is actually quite a difficult data set (even though we only have 125 discrete variables, which makes the model rather small), and other data will make the model much easier to solve.

To reduce some symmetry in the model we can add:

 $p_j \ge p_{j-1}$

Not only will this help with symmetry, it also makes the variable $$p$$ easier to interpret in the solution. You can see this in the results below.

The two quadratic models were difficult to solve to optimality. I stopped after 1000 seconds and both were still working. Interestingly just minimizing $$\sum p_j^2$$ seems to get a somewhat better solution within the 1000 second time limit: it reduces the range from 0.1 to 0.052 and the standard deviation from 0.04 to 0.02.

The linear models solve to global optimality quickly and get better results.

By using a linear approximation we can reduce the range to 0.034 and 0.038 (and the standard deviation to 0.014 and 0.016). The model that minimizes $$p_{max}-p_{min}$$ seems to be the best performing: both the quality of the solution is very good and the solution time is the fastest (34 seconds).

This is an example of a model where MIQP solvers are not as fast as we want: they really fall behind their linear counterparts.

##### Optimality

Can we find the best standard deviation possible? We can use the following algorithm to give this a try:

1. Solve the linear model with objective $$\min p_{max}-p_{min}$$ to optimality.
2. Solve the quadratic model with objective $$\min \sum_j (p_j-\mu)^2$$ using MIPSTART.

This data set although small is still a big problem for state-of-the-art MIQP solvers around. Step 2 took hours on a small laptop (somewhat faster on a faster machine after throwing extra threads at it). This step did not improve the solution found in in step 1 but rather proved it has the best possible standard deviation. Still very poor performance for a very small model!

##### Dataset

The dataset we used turned out to be very difficult. If we use a different dataset for $$q_i$$: integer values from a uniform distribution, we get much better performance.

Using the following data:

 ----     35 PARAMETER q  bin1  13.000,    bin2  27.000,    bin3  21.000,    bin4  16.000,    bin5  16.000 bin6  14.000,    bin7  17.000,    bin8  27.000,    bin9  11.000,    bin10 20.000 bin11 30.000,    bin12 22.000,    bin13 30.000,    bin14 26.000,    bin15 12.000 bin16 23.000,    bin17 13.000,    bin18 15.000,    bin19 24.000,    bin20 19.000 bin21 17.000,    bin22 17.000,    bin23 12.000,    bin24 13.000,    bin25 22.000

the model with quadratic objective $$\min \sum_j (p_j-\mu)^2$$ solves this in just 0.2 seconds to optimality (see the solution below). This model has exactly the same size as before, just different data for $$q_i$$. This is one more proof that problem size (measured by number of variables and equations) is not necessarily a good indication of how difficult a problem is to solve.

##### References
1. LPSolve API, Minimize a function of constraints, https://stackoverflow.com/questions/46203974/lpsolve-api-minimize-a-function-of-constraints

## Thursday, September 7, 2017

### Quantiles with Pandas

In [1] I showed how quantiles in GAMS are difficult, and how they can be calculated better in R. Here is some Python code to do the same:

 import pandas as pd df = pd.read_csv("p.csv") q=df.groupby(['i','j']).quantile([0,.25,.5,.75,1]) print(q) q.to_csv("q.csv")

The q data frame looks like:

 Val i  j                 i1 j1 0.00  18.002966       0.25  28.242058       0.50  33.222936       0.75  62.736221       1.00  85.770764    j2 0.00   7.644259       0.25  41.375281       0.50  61.313381       0.75  82.127640       1.00  99.813645    j3 0.00  14.017667       0.25  16.559221       0.50  30.775334       0.75  38.482815       1.00  67.223932 i2 j1 0.00  11.938737       0.25  29.259331       0.50  55.029177       0.75  69.633259       1.00  83.258388    j2 0.00  16.857103       0.25  28.783298       0.50  53.358812       0.75  65.534761       1.00  87.373769    j3 0.00   5.608600       0.25  17.433855       0.50  33.311746       0.75  45.566497       1.00  64.926986

This is pretty clean. Data frames can easily be read in from CSV files (see the example above) or databases.

The new GAMS Python scripting tool is not very friendly in this respect. We need to do a lot of transformations leading to a low signal-to-noise ratio:

 embeddedCode Python:   import pandas as pd   p = list(gams.get('p', keyFormat=KeyFormat.FLAT))   df = pd.DataFrame(p, columns=["i", "j", "k", "value"])   print("");print("df");print(df)   q=df.groupby(['i','j']).quantile([0,.25,.5,.75,1])   print("q");print(q) # q has one multi-level index # convert to standard indices   q2 = q.reset_index()   print("q2");print(q2) # to extract a set q: convert to list of strings   qset = q2["level_2"].unique().astype('str').tolist()   print("qset");print(qset)   gams.set("q", qset) # get data itself: list of tuples   quantiles = list(zip(q2["i"],q2["j"],q2["level_2"].astype('str'),q2["value"]))   print("quantiles");print(quantiles)   gams.set("quantiles",quantiles) endEmbeddedCode q,quantiles display q,quantiles;

Note that I am trying to prevent looping over data frame rows here: all operations are on complete columns. The CSV input/output is actually much shorter and cleaner. In this code, there is really one line that does really some work; the rest can be considered as just overhead.

The GAMS interface should probably support data frames directly to overcome this “impedance mismatch.” When users need to mix and match different languages the interface should make things as easy as possible. A GAMS/Python interface that is too low level and stays too close to GAMS asks the user to be reasonably fluent in in both GAMS and Python (the intersection of knowledgeable GAMS and Python users is likely to be small), and write quite some glue-code dealing with getting data in and out. Choosing a better abstraction level would probably help here.

#### References

1. http://yetanothermathprogrammingconsultant.blogspot.nl/2017/08/quantiles.html

## Wednesday, September 6, 2017

### Book: Designing Data-Intensive Applications

Was reading this during a long flight. Interesting review of the design decisions behind applications dealing with much (very much!) data.

Martin Kleppmann, Designing Data-Intensive Applications, 2017.

##### Contents:

Foundations of Data Systems
Chapter 1 Reliable, Scalable, and Maintainable Applications
Reliability
Scalability
Maintainability
Chapter 2 Data Models and Query Languages
Relational Model Versus Document Model
Query Languages for Data
Graph-Like Data Models
Chapter 3 Storage and Retrieval
Data Structures That Power Your Database
Transaction Processing or Analytics?
Column-Oriented Storage
Chapter 4 Encoding and Evolution
Formats for Encoding Data
Modes of Dataflow
Distributed Data
Chapter 5 Replication
Problems with Replication Lag
Chapter 6 Partitioning
Partitioning and Replication
Partitioning of Key-Value Data
Partitioning and Secondary Indexes
Rebalancing Partitions
Request Routing
Chapter 7 Transactions
The Slippery Concept of a Transaction
Weak Isolation Levels
Serializability
Chapter 8 The Trouble with Distributed Systems
Faults and Partial Failures
Unreliable Networks
Unreliable Clocks
Knowledge, Truth, and Lies
Chapter 9 Consistency and Consensus
Consistency Guarantees
Linearizability
Ordering Guarantees
Distributed Transactions and Consensus
Derived Data
Chapter 10 Batch Processing
Batch Processing with Unix Tools
MapReduce and Distributed Filesystems
Beyond MapReduce
Chapter 11 Stream Processing
Transmitting Event Streams
Databases and Streams
Processing Streams
Chapter 12 The Future of Data Systems
Data Integration
Unbundling Databases
Aiming for Correctness
Doing the Right Thing

## Monday, September 4, 2017

### What is this EPS in a GAMS solution?

I get this question a lot.  The smallest example is model number 1 from the GAMS model library [1]: a tiny transportation model (originally from [2]). The linear programming (LP) model looks like:

 \begin{align} \min \>&z \\ &z = \sum_{i,j} c_{i,j} x_{i,j}&&\text{(cost)}\\ &\sum_j x_{i,j} \le a_i \>\forall i &&\text{(supply)}\\ &\sum_i x_{i,j} \ge b_j \>\forall j &&\text{(demand)}\\ &x_{i,j} \ge 0 \end{align}

The model has only two source nodes and three destination nodes and the solution looks like:

 LOWER          LEVEL          UPPER         MARGINAL ---- EQU cost                .              .              .             1.0000        cost  define objective function ---- EQU supply  observe supply limit at plant i                  LOWER          LEVEL          UPPER         MARGINAL seattle          -INF          350.0000       350.0000         EPS         san-diego        -INF          550.0000       600.0000          .          ---- EQU demand  satisfy demand at market j                 LOWER          LEVEL          UPPER         MARGINAL new-york       325.0000       325.0000        +INF            0.2250      chicago        300.0000       300.0000        +INF            0.1530      topeka         275.0000       275.0000        +INF            0.1260      ---- VAR x  shipment quantities in cases                           LOWER          LEVEL          UPPER         MARGINAL seattle  .new-york          .            50.0000        +INF             .          seattle  .chicago           .           300.0000        +INF             .          seattle  .topeka            .              .            +INF            0.0360      san-diego.new-york          .           275.0000        +INF             .          san-diego.chicago           .              .            +INF            0.0090      san-diego.topeka            .           275.0000        +INF             .                                     LOWER          LEVEL          UPPER         MARGINAL ---- VAR z                 -INF          153.6750        +INF             .            z  total transportation costs in thousands of dollars

There are 6 equations in this model and 7 variables. The objective function is modeled as an equality here as GAMS does not really have the notion of an objective function: it has an objective variable. The objective variable $$z$$ is a free variable i.e. it has bounds $$-\infty$$ and $$\infty$$. This is also a GAMS convention.

The dots indicate a zero value. In the cost equation we can see the lower- and upper-bound are the same (both zero) indicating an equality. Note that this equation is rewritten as $$z - \sum_{i,j} c_{i,j} x_{i,j} = 0$$ hence the bounds having a zero value.

The column marked MARGINAL are duals (for the rows) and reduced costs for the variables.

The row SUPPLY(‘SEATTLE’) has a marginal with a value of EPS. This means: this row is non-basic but with a zero dual. The EPS is used to signal this is a non-basic row: it the value was zero we could have deduced the row is basic. All basic rows and variables have a marginal that is zero. To verify we can count the basic and non-basic variables:

 ROW/COLUMN                  MARGINAL      BASIS-STATUS cost                           1.000         NB supply('seattle')                EPS         NB supply('san-diego')              .           B demand('new-york')             0.2250        NB demand('chicago')              0.1530        NB demand('topeka')               0.1260        NB x('seattle','new-york')          .           B x('seattle','chicago')           .           B x('seattle','topeka')          0.0360        NB x('san-diego','new-york')        .           B x('san-diego','chicago')       0.0090        NB x('san-diego','topeka')          .           B z                                .           B

All the variables and rows that have a marginal of zero are basic. There are 6 of them. This is identical to the number of equations. This conforms to what we would expect. The other 7 rows and columns have a non-zero marginal and are non-basic.

The occurrence of an EPS in the marginals indicates dual degeneracy. This indicates are multiple optimal solutions (to be precise: multiple optimal bases). Indeed here is an alternative solution found with a different LP solver:

 LOWER          LEVEL          UPPER         MARGINAL ---- EQU cost                .              .              .             1.0000        cost  define objective function ---- EQU supply  observe supply limit at plant i                  LOWER          LEVEL          UPPER         MARGINAL seattle          -INF          300.0000       350.0000          .          san-diego        -INF          600.0000       600.0000          .          ---- EQU demand  satisfy demand at market j                 LOWER          LEVEL          UPPER         MARGINAL new-york       325.0000       325.0000        +INF            0.2250      chicago        300.0000       300.0000        +INF            0.1530      topeka         275.0000       275.0000        +INF            0.1260      ---- VAR x  shipment quantities in cases                           LOWER          LEVEL          UPPER         MARGINAL seattle  .new-york          .              .            +INF            EPS         seattle  .chicago           .           300.0000        +INF             .          seattle  .topeka            .              .            +INF            0.0360      san-diego.new-york          .           325.0000        +INF             .          san-diego.chicago           .              .            +INF            0.0090      san-diego.topeka            .           275.0000        +INF             .                                     LOWER          LEVEL          UPPER         MARGINAL ---- VAR z                 -INF          153.6750        +INF             .            z  total transportation costs in thousands of dollars

The term primal degeneracy is used to indicate that a basic row or column is at its bound (usually they are in between bounds). We see an example of that in row SUPPLY(‘SEATTLE’) here: this row is basic but the value 600 is equal to the upper bound.  This solution is also dual degenerate: we have an EPS marginal for variable X(‘SEATTLE’,’NEW-YORK’).

##### Notes

Q: Can we enumerate all possible optimal solutions? Not so easy, but in [3] an interesting approach is demonstrated on this model.

Q: Can we make $$z$$ a positive variable? Well, GAMS is not very consistent here.
You can not declare ‘positive variable z’ as GAMS will complain: Objective variable is not a free variable. But we can assign: ‘z.lo = 0;’. Don’t ask me why. In general I just keep the variable unbounded.

Q: Can we used ranged equations in GAMS? No.
The listing file seems to indicate we can set lower and upper-bounds on rows. But this is not really true. If you would assign values say ‘supply.up(i) = INF;’, GAMS will ignore this and reset the bounds on the equations when generating the model.

Q: What about NLPs? Roughly the same. There is one major deviation: NLP solutions typically contain superbasic variables. These are non-basic (i.e. with a nonzero marginal) but are between their bounds.

Q: What about MIPs? By default GAMS will do the following after a MIP terminates: fix all discrete variables to their current level and resolve as an LP. So you will see marginals from this final LP. Marginals for a real MIP do not really exist (MIP nodes solved by say a dual simplex method have typically a number of discrete variables fixed by the branch-and-bound methods, cuts are added and heuristics are used also by modern MIP solvers).

Q: Can a free variable be non-basic? In almost all cases a free variable in an LP solution will be basic (a free basic variable will in general never leave the basis). However, it is possible for such a variable to be non-basic in which case it is usually zero.

Q: Why do we need this EPS? The purpose of this EPS thing is to allow GAMS to set up an advanced basis for subsequent solves. These subsequent solves will typically be faster when we use an advanced basis. This method is solver independent. E.g. to make Gurobi look good we can do:
option lp=cplex;
solve m using lp minimizing z;
option lp=gurobi;
solve m using lp minimizing z;
This even works with different models that share many variables and equations:
option lp=cplex;
solve m1 using lp minimizing z1;
option nlp=minos;
solve m2 using nlp minimizing z2;

Q: My solver does not have a basis-preserving presolve, so I have to choose between presolving and using an advanced basis. Yes, unfortunately that is much the case with modern solvers. You need to make a decision here, hopefully after doing some tests. If the restarts solve very fast (i.e. need few iterations), usually bypassing the presolver is not a big deal.

Q: What is the meaning of a marginal when the model is infeasible. I am not sure. Are they wrt to a Phase I, Phase 2 or some composite objective? I would say: this is not well-defined.

Q: Any other uses for EPS? Yes. GAMS stores everything in a sparse format so all zeros are not stored (the exception is a scalar which is stored even if zero). When exporting data sometimes we want explicit zeros to be exported. An example is to prevent a matrix to miss a row or column that has only zeros. We can assign an EPS for these cases. Many tools like GDXXRW or GDX2SQLITE can convert EPS values to physical 0.0 values.

##### References
1. https://www.gams.com/latest/gamslib_ml/libhtml/gamslib_trnsport.html
2. Dantzig, G. B., Linear Programming and Extensions. Princeton University Press, Princeton, New Jersey, 1963.
3. http://yetanothermathprogrammingconsultant.blogspot.com/2016/01/finding-all-optimal-lp-solutions.html

## Monday, August 28, 2017

In (1) a so-called Adaptive Plant Propagation or Adaptive Strawberry Algorithm is applied to a standard Economic Load Dispatch problem.

##### Abstract

Optimization problems in design engineering are complex by nature, often because of the involvement of critical objective functions accompanied by a number of rigid constraints associated with the products involved. One such problem is Economic Load Dispatch (ED) problem which focuses on the optimization of the fuel cost while satisfying some system constraints. Classical optimization algorithms are not sufficient and also inefficient for the ED problem involving highly nonlinear, and non-convex functions both in the objective and in the constraints. This led to the development of metaheuristic optimization approaches which can solve the ED problem almost efficiently. This paper presents a novel robust plant intelligence based Adaptive Plant Propagation Algorithm (APPA) which is used to solve the classical ED problem. The application of the proposed method to the 3-generator and 6-generator systems shows the efficiency and robustness of the proposed algorithm. A comparative study with another state-of-the-art algorithm (APSO) demonstrates the quality of the solution achieved by the proposed method along with the convergence characteristics of the proposed approach.

##### Problem

The economic dispatch problem used in this paper can be stated as a simple convex QP (Quadratic Programming) problem:

\bbox[lightcyan,10px,border:3px solid darkblue]{ \begin{align}\min\>&z = \sum_i (a_i + b_i x_i + c_i x_i^2)\\ & \sum_i x_i = D\\ & \ell_i \le x_i \le u_i\end{align} }

Basically with have $$n$$ generators that have a quadratic cost function w.r.t. to their output $$x_i$$. The constraints deal with meeting demand and operating limits.

Nitpick: the paper mentions: $$\ell_i < x_i < u_i$$ which is mathematically not really ok.

Opposed to what the abstract is saying this is not a non-convex problem, and standard QP algorithms will be able to solve this problem without any problem. There really is no good reason for using a heuristic for this problem.

##### Example 1

The first example is a tiny problem with three generators.

Note that the columns in table 1 are mislabeled. The first column represents $$c_i$$ while the last column is $$a_i$$.

The solution is reported as:

The solution calculated by hand seems correct (I guess the author was lucky here that no operating limits were binding). The heuristic gives a solution that is slightly better (lower cost) but that is because this solution is slightly infeasible: we don’t quite match the demand of 800 MW.

##### Example 2

This is another tiny problem with six generators.

The demand is 410 MW. The author shows two solutions:

These solutions are both wrong. The reported objective values (Cost per hour) should be an indication something is not right. The reason is that the author is confused about the (mislabeling of the) coefficients $$a_i$$, $$b_i$$, $$c_i$$ in table III. The large coefficients $$C_i$$ deal with the fixed cost while the small coefficients $$A_i$$ are for the quadratic terms.

The correct solution is:

A QP solver can solve this model to (proven) optimality in less than a second (Cplex takes 0.09 seconds).

##### Unit Commitment Problem

A more interesting problem is where we can turn generating units on and off. This is called the unit-commitment problem. Often exotic things like ramp-up and ramp-down limits, and other operational constraints, are added. These problems can actually become very difficult to solve, and heuristics pay traditionally an important role here. Having said that, there is somewhat of an increased interest to solve these models with MIP technology (after piecewise linear representation of the quadratic costs) due to substantial advances in state-of-the-art solvers. I have not seen many efforts to use MIQP algorithms here: often these linearized versions work quite well.

##### Conclusion

Writing a solid paper is not that easy.

## Monday, August 21, 2017

### Quantiles

Calculating quantiles is a standard way to summarize data. In the example below we have some small random data p(i,j,k) where we want to calculate quantiles at the 0, 25, 50, 75, and 100% level for each (i,j):

 ----      9 PARAMETER p                 k1          k2          k3          k4          k5          k6          k7          k8 i1.j1      18.003      84.483      55.487      30.813      29.929      23.181      35.633      85.771i1.j2       7.644      50.521      99.814      58.295      99.122      76.463      13.939      64.332i1.j3      16.792      25.758      67.224      44.100      36.610      35.793      14.018      15.860i2.j1      59.322      83.258      23.851      66.908      77.810      31.062      11.939      50.736i2.j2      16.857      87.374      27.246      29.296      59.802      72.549      63.197      46.916i2.j3      41.917      12.652      32.107       5.609      34.516      19.028      64.927      56.514 ----    143 PARAMETER quantiles                 0%         25%         50%         75%        100% i1.j1      18.003      28.242      33.223      62.736      85.771i1.j2       7.644      41.375      61.313      82.128      99.814i1.j3      14.018      16.559      30.775      38.483      67.224i2.j1      11.939      29.259      55.029      69.633      83.258i2.j2      16.857      28.783      53.359      65.535      87.374i2.j3       5.609      17.434      33.312      45.566      64.927

In GAMS this is not so easy. Here is one way:

 set i /i1*i2/    j /j1*j3/    k /k1*k8/    q /'0%', '25%', '50%', '75%', '100%'/;parameter p(i,j,k);p(i,j,k) = uniform(1,100);display p;parameter   v(k) 'vector to be sorted'   r(k) 'sorting rank'   perc0(q) /'0%' EPS, '25%' 25, '50%' 50, '75%' 75, '100%' 100 /   perc(q) 'percentiles'   quantiles(i,j,q);$libinclude rankloop((i,j), v(k) = p(i,j,k); perc(q) = perc0(q);$ libinclude rank v k r perc  quantiles(i,j,q) = perc(q););display quantiles;

We use here the rank utility by Tom Rutherford. A disadvantage is that it is designed to work on vectors. That means we need to loop to extract a k-vector. Under the hood a GDX file is written for each vector, a program called gdxrank is called, and another GDX file is written with the ranks of each vector element. For very large data sets this turns out to be very slow.

Note that in the data statement for perc0 we use EPS instead of 0. This on purpose: EPS will survive the trip to gdxrank, while a pure zero is not stored and not transmitted. If we would have used 0 instead of EPS we would have seen:

 ----    142 PARAMETER quantiles                25%         50%         75%        100% i1.j1      28.242      33.223      62.736      85.771i1.j2      41.375      61.313      82.128      99.814i1.j3      16.559      30.775      38.483      67.224i2.j1      29.259      55.029      69.633      83.258i2.j2      28.783      53.359      65.535      87.374i2.j3      17.434      33.312      45.566      64.927
##### How would this look in R?

R has the quantile function to calculate percentiles of a vector. We can use the by function to apply this on a slice of the data. This can look like:

> df

i  j  k     value

1  i1 j1 k1 18.002966

2  i1 j1 k2 84.483404

3  i1 j1 k3 55.487160

4  i1 j1 k4 30.812652

5  i1 j1 k5 29.929000

6  i1 j1 k6 23.181234

7  i1 j1 k7 35.633220

8  i1 j1 k8 85.770764

9  i1 j2 k1  7.644259

10 i1 j2 k2 50.520856

11 i1 j2 k3 99.813645

12 i1 j2 k4 58.294604

13 i1 j2 k5 99.122171

14 i1 j2 k6 76.462796

15 i1 j2 k7 13.938556

16 i1 j2 k8 64.332157

17 i1 j3 k1 16.792269

18 i1 j3 k2 25.757973

19 i1 j3 k3 67.223932

20 i1 j3 k4 44.100282

21 i1 j3 k5 36.610326

22 i1 j3 k6 35.792695

23 i1 j3 k7 14.017667

24 i1 j3 k8 15.860077

25 i2 j1 k1 59.322251

26 i2 j1 k2 83.258388

27 i2 j1 k3 23.850758

28 i2 j1 k4 66.907712

29 i2 j1 k5 77.809903

30 i2 j1 k6 31.062189

31 i2 j1 k7 11.938737

32 i2 j1 k8 50.736102

33 i2 j2 k1 16.857103

34 i2 j2 k2 87.373769

35 i2 j2 k3 27.246340

36 i2 j2 k4 29.295618

37 i2 j2 k5 59.801636

38 i2 j2 k6 72.549188

39 i2 j2 k7 63.196619

40 i2 j2 k8 46.915989

41 i2 j3 k1 41.917392

42 i2 j3 k2 12.651840

43 i2 j3 k3 32.107014

44 i2 j3 k4  5.608600

45 i2 j3 k5 34.516477

46 i2 j3 k6 19.027860

47 i2 j3 k7 64.926986

48 i2 j3 k8 56.513809

> q <- by(df$value,list(df$i,df$j),quantile) > q : i1 : j1 0% 25% 50% 75% 100% 18.00297 28.24206 33.22294 62.73622 85.77076 ------------------------------------------------------------------------ : i2 : j1 0% 25% 50% 75% 100% 11.93874 29.25933 55.02918 69.63326 83.25839 ------------------------------------------------------------------------ : i1 : j2 0% 25% 50% 75% 100% 7.644259 41.375281 61.313381 82.127640 99.813645 ------------------------------------------------------------------------ : i2 : j2 0% 25% 50% 75% 100% 16.85710 28.78330 53.35881 65.53476 87.37377 ------------------------------------------------------------------------ : i1 : j3 0% 25% 50% 75% 100% 14.01767 16.55922 30.77533 38.48282 67.22393 ------------------------------------------------------------------------ : i2 : j3 0% 25% 50% 75% 100% 5.60860 17.43385 33.31175 45.56650 64.92699  > We can compute all quantiles in a oneliner script. Although the results are correct, it would be nice if we can organize the results in a proper data frame for further processing. Here is my first attempt:  > cbind(unique(df[,1:2]), + matrix(unlist(q),ncol=5,byrow=T,dimnames=list(NULL,names(q[[1,1]]))) + ) i j 0% 25% 50% 75% 100% 1 i1 j1 18.002966 28.24206 33.22294 62.73622 85.77076 9 i1 j2 11.938737 29.25933 55.02918 69.63326 83.25839 17 i1 j3 7.644259 41.37528 61.31338 82.12764 99.81365 25 i2 j1 16.857103 28.78330 53.35881 65.53476 87.37377 33 i2 j2 14.017667 16.55922 30.77533 38.48282 67.22393 41 i2 j3 5.608600 17.43385 33.31175 45.56650 64.92699 The unique function populates the leftmost columns i and j. The unlist operation will make one long vector of the quantiles. We make the matrix of the correct size by using ncol=5. Finally, as the column names are lost in this operation, we re-establish them by dimnames. Unfortunately this is wrong as unique populates the columns i and j in a different order than by does. The easy fix is as follows:  > q <- by(df$value,list(df$j,df$i),quantile)> qdf <- cbind(unique(df[,1:2]),+          matrix(unlist(q),ncol=5,byrow=T,dimnames=list(NULL,names(q[[1,1]])))+       )> qdf    i  j        0%      25%      50%      75%     100%1  i1 j1 18.002966 28.24206 33.22294 62.73622 85.770769  i1 j2  7.644259 41.37528 61.31338 82.12764 99.8136517 i1 j3 14.017667 16.55922 30.77533 38.48282 67.2239325 i2 j1 11.938737 29.25933 55.02918 69.63326 83.2583933 i2 j2 16.857103 28.78330 53.35881 65.53476 87.3737741 i2 j3  5.608600 17.43385 33.31175 45.56650 64.92699

The change is in the order of list(df$j,df$i) passed on to by. Now unique and q have the same order.

##### Performance

If we change the data to:

 set i /i1*i50/    j /j1*j50/    k /k1*k50/    q /'0%', '25%', '50%', '75%', '100%'/;

we see the following timings:

• GAMS rank loop: 161 seconds
• R calculate quantiles: 1.5 seconds
• R convert results into a data frame: 0.08 seconds