Tuesday, July 18, 2017

Ubuntu Bash on Windows

Windows 10 has a beta feature: Bash.exe with an Ubuntu Linux subsystem (see (1) and (2)), also known as WSL (Windows Subsystem for Linux). This will allow you to run Linux command line tools from within Windows. This is not the same as running Linux in a Virtual Machine (VM) using virtualization software like VirtualBox or VMware. It is also different from Cygwin or MINGW which is a windows port of GNU tools. In the case of  WSL, real, unmodified Ubuntu Linux binaries are executed.

learnbash

LXSS diagram

There is a single Linux instance, so when invoking bash.exe several times, you talk to the same instance:


wsl


The main issue I noticed is that some networking commands (like ping) requires bash.exe to be started as administrator.

References

  1. Jack Hammons, Bash on Ubuntu on Windows, https://msdn.microsoft.com/commandline/wsl
  2. Learning about Bash on Windows Subsystem for Linux, https://blogs.msdn.microsoft.com/commandline/learn-about-bash-on-windows-subsystem-for-linux/
  3. VirtualBox, https://www.virtualbox.org/
  4. VMware, https://www.vmware.com/
  5. Cygwin, https://www.cygwin.com/
  6. MINGW, Minimalist GNU for Windows, http://www.mingw.org/

Monday, July 17, 2017

Modeling flood damages

While working on a investment planning model to combat damages resulting from flooding, I received the results from a rainfall model that calculates damages as a result of excessive rain and flooding. The picture the engineers produced looks like:

image 

This picture has the scenario on the x-axis (sorted by damage) and the damages on the y-axis. This picture is very much like a load-duration curve in power generation.

For a more “statistical” picture we can use standard histogram (after binning the data):

image

Gamma distribution

We can use standard techniques to fit a distribution. When considering a Gamma distribution (1), a simple approach is the method of moments. The mean and the variance of the Gamma distribution with parameters \(k\) and \(\theta\) are given by:

\[\begin{align} &E[X] = k \cdot \theta\\ & Var[X] = k \cdot \theta^2 \end{align} \]

Using sample mean \(\mu\) and standard deviation \(\sigma\), we can solve:

\[\begin{align} & k \cdot \theta = \mu\\ & \sqrt{k} \cdot \theta = \sigma \end{align} \]

This can be easily solved numerically and it actually seems to work:

image

Weibull distribution

An alternative distribution that is sometimes suggested is the Weibull distribution (2).  The method of moments estimator for the Weibull distribution with parameters \(\lambda\) and \(k\) can be found by solving the system:

\[\begin{align} & \lambda \Gamma(1+1/k) = \mu\\ &\lambda \sqrt{\Gamma(1+2/k)+\left(\Gamma(1+1/k)\right)^2} = \sigma \end{align}\]
I was unable to get a solution from this: solvers failed miserably on this.

An alternative approach would be to use an MLE (Maximum Likelihood Estimation) technique. This yields a system of equations:

\[\begin{align} & \lambda^k = \frac{1}{n}\sum_{i=1}^n x_i^k\\ &k^{-1} = \frac{\sum_{i=1}^n x_i^k \ln x_i}{\sum_{i=1}^n x_i^k} – \frac{1}{n}\sum_{i=1}^n \ln x_i \end{align}\]
Note that we can solve the second equation first to solve for \(k\) and then calculate \(\lambda\) using the first equation. A solver like CONOPT will do this automatically by recognizing the triangular structure of this system. Note that our data set contains many \(x_i=0\). These are replaced by \(x_i=0.001\) so we can take the logarithm.

This gives is a very similar picture:

image

References
  1. Gamma distribution, https://en.wikipedia.org/wiki/Gamma_distribution
  2. Weibull distribution, https://en.wikipedia.org/wiki/Weibull_distribution

Tuesday, July 11, 2017

Rectangles: no-overlap constraints

The question of how we can formulate constraints that enforce rectangles not to overlap comes up regularly (1). There are basically four cases to consider when considering two rectangles \(i\) and \(j\):

image

In the picture above, we indicate by \((x_i,y_i)\) the left-lower corner of a rectangle (these are typically decision variables). The height and the width are \(h_i\) and \(w_i\) (these are typically constants). From the above, we can formulate the “no-overlap”  constraints as:

\[
\bbox[lightcyan,10px,border:3px solid darkblue] {
\begin{align}
&x_i+w_i \le x_j  & \text{or} \\
&x_j+w_j \le x_i  & \text{or} \\
&y_i+h_i \le y_j  & \text{or} \\
&y_j+h_j \le x_i
\end{align}
}\]

The “or” condition can be modeled with the help of binary variables \(\delta\) and big-M constraints:

\[
\bbox[lightcyan,10px,border:3px solid darkblue] {
\begin{align}
&x_i+w_i \le x_j + M_1 \delta_{i,j,1}\\
&x_j+w_j \le x_i + M_2 \delta_{i,j,2}\\
&y_i+h_i \le y_j + M_3 \delta_{i,j,3}\\
&y_j+h_j \le x_i+ M_4 \delta_{i,j,4}\\
&\sum_k \delta_{i,j,k} \le 3\\
& \delta_{i,j,k} \in \{0,1\}
\end{align}
}\]

The binary variables make sure at least one of the constraints is active (not relaxed).

If we have multiple rectangles we need to compare all combinations, i.e. we generate these constraints for  \(\forall i,j\). Except: we can not compare a rectangle with itself. A first approach would be to generate the above constraints for all \(i\ne j\). This is correct, but we can apply an optimization. Note that we actually compare things twice: if rectangles \(i\) and \(j\) are non-overlapping then we don’t need to check the pair of rectangles \(j\) and \(i\). That means we only need constraints and variables \(\delta_{i,j,k}\) for \(i<j\).

It is important to make constants \(M_k\) as small as possible. In general we have a good idea about good values for these \(M_{k}\)’s. E.g. consider the case where we need to pack boxes in a container (2):

image

If the container has length and width \(H\) and \(W\) then we can easily see: \(M_1=M_2=W\) and \(M_3=M_4=H\).

References
  1. How to write this constraint in LPSolve?, https://stackoverflow.com/questions/44858353/how-to-write-this-constraint-in-lp-solver-logical-and-and-doesnot-exist
  2. Filling up a container with boxes, http://yetanothermathprogrammingconsultant.blogspot.com/2016/06/filling-up-container-with-boxes.html

Monday, July 10, 2017

GLPK was used in the proof of the 300 year old Kepler conjecture

image

A lot of authors.

Apparently 43,078 LPs had to be solved.

References
  1. Thomas Hales e.a., A Formal Proof of the Kepler Conjecture, Forum of mathematics, Pi (2017), vol 5, e2, https://www.cambridge.org/core/services/aop-cambridge-core/content/view/78FBD5E1A3D1BCCB8E0D5B0C463C9FBC/S2050508617000014a.pdf/formal_proof_of_the_kepler_conjecture.pdf
  2. Original message appeared in the Help-GLPK mailing list,  https://lists.gnu.org/archive/html/help-glpk/2017-07/msg00001.html

Tuesday, June 27, 2017

Who needs Cplex or Gurobi: solving LPs using LU

The strange notion that linear programming problems can be solved simply by applying an LU decomposition (not as part of the Simplex method but just LU) seems to take hold. The original idea is from (1) but now we also can use LU as a solver for “Fuzzy Linear Fractional Programming Problems” (2).


image 

I agree with the abstract: the proposed approach is simple.

References
  1. S.M.Chincole, A.P.Bhadane, LU Factorization Method To Solve Linear Programming Problem, International Journal of Emerging Technology and Advanced Engineering ,Volume 4, Issue 4, 176-180, http://www.ijetae.com/files/Volume4Issue4/IJETAE_0414_31.pdf. Warning: this method does not work except in some very special cases (e.g. all constraints binding in the optimal solution).
  2. S. Muruganandam and P. Ambika,  Solving Fuzzy Linear Fractional Programming Problem using LU Decomposition Method, Annals of Pure and Applied Mathematics, Vol. 13, No 1., 2017, pp 89-97.
  3. Solving LP by LU??, http://yetanothermathprogrammingconsultant.blogspot.com/2017/01/solving-lp-by-lu.html

Wednesday, June 21, 2017

Minimizing the k-th largest x

Minimizing the largest \(x_i\) is an easy exercise in LP modeling:
\[\bbox[lightcyan,10px,border:3px solid darkblue] { \begin{align} \min\>& z\\ & z \ge x_i \end{align}} \]
This is sometimes called MINIMAX.

What about minimizing the \(k^{\text{th}}\) largest value? Here is one possible MIP formulation:

\[\bbox[lightcyan,10px,border:3px solid darkblue] { \begin{align} \min\>& z\\ & z \ge x_i - \delta_i M\\ & \sum_i \delta_i = k-1\\ & \delta_i \in \{0,1\} \end{align}} \]

I.e., we have \(k-1\) exceptions on the constraint \(z \ge x_i\). The objective will make sure the exceptions are the largest \(x_i\).  We should think long and hard about making the big-M’s as small as possible. If you have no clue about proper bounds on \(x_i\) one could use indicator constraints.

Interestingly, minimizing the sum of the \(k\) largest can be modeled as a pure LP (1) :

\[\bbox[lightcyan,10px,border:3px solid darkblue] { \begin{align} \min\>& \sum_i v_i + k\cdot q\\ & v_i \ge x_i - q\\ &v_i \ge 0 \\ &q \text{ free } \end{align}} \]

You almost would think there is an LP formulation for minimizing the \(k^{\text{th}}\) largest value. I don't see it however.
References
  1. Comment by Michael Grant in http://orinanobworld.blogspot.se/2015/08/optimizingpartoftheobjectivefunction-ii.html.

Wednesday, June 14, 2017

Modeling production runs with length exactly three

In (1) the question was posed how to model production runs that have an exact length. Although the question was asked in terms of integer variables, this is much easier to deal with when we have binary variables. Let’s introduce a binary variable:

\[x_{t} = \begin{cases}1 & \text{when the unit $x$ is turned on in period $t$}\\0 &\text{otherwise}\end{cases}\]

The question was formulated as an implication:

\[x_t=1 \implies x_{t+1}=x_{t+2}=1\]

This implication is not 100% correct: it would turn on \(x\) forever. But we understand what the poster meant, if a unit is turned on, it should stay on for exactly three periods.

image

This condition can be more correctly stated as the implication:

\[x_{t-1}=0 \text{ and } x_{t}=1 \implies x_{t+1}=x_{t+2}=1 \text{ and } x_{t+3}=0\]

We can make linear inequalities out of this as follows:

\[\bbox[lightcyan,10px,border:3px solid darkblue]{\begin{align}&x_{t+1}\ge x_{t}-x_{t-1}\\&x_{t+2}\ge x_{t}-x_{t-1}\\&1-x_{t+3}\ge x_{t}-x_{t-1}\end{align}}\]

If we want to model the condition “a unit should stay on for at least three periods”, we can drop the last constraint and just keep:

\[\begin{align}&x_{t+1}\ge x_{t}-x_{t-1}\\&x_{t+2}\ge x_{t}-x_{t-1}\end{align}\]


References
  1. https://stackoverflow.com/questions/44496473/block-of-consecutive-variables-to-have-same-value-in-mixed-integer-linear-progra

Friday, June 9, 2017

A Staffing Problem

In (1) a "simple problem" is stated (problems are rarely as simple as they seem):

For the next 18 weeks there is some data about demand for staffing resources:

image

Actually we don’t need the data for the individual projects: just the totals. It is noted that the maximum number of “bodies” we need is 52 (in week 3).

We start with 48 temps available in the first period. We can let a temp staffer go and we can hire new ones. However when hiring a staffer it will cost 10 days (2 weeks) to train this person. During training a staffer is not productive.  We can also keep a staffer idling for a few periods.

To make a difference between idling and training, as shown here:

image

I will assume that hiring + training is slightly more expensive than keeping someone idle. I think it makes sense to assign some cost to the hiring and training process. This property was not in the original post but I think I have convinced myself that this is actually a reasonable assumption.

To model this I introduce two sets of binary variables:

\[\begin{align}
& r_{i,t} = \begin{cases} 1 & \text{if a staffer $i$ is available for training or work during period $t$}\\
0 & \text{otherwise} \end{cases} \\
& h_{i,t} = \begin{cases} 1 & \text{if a staffer $i$ is hired at the beginning of period $t$}\\
0 & \text{otherwise} \end{cases}
\end{align}\]

We can link \(h_{i,t}\) and \(r_{i,t}\) as follows:

\[h_{i,t}\ge r_{i,t}-r_{i,t-1}\]

This implements the implication:

\[r_{i,t-1}=0 \text{ and } r_{i,t}=1 \implies h_{i,t}=1\]

that is: if we change \(r\) from 0 to 1, we have a new hire. We will add a (small) cost to a hire to the objective so we don’t need to add constraint to enforce the other way around:

\[ r_{i,t-1}=1 \text{ or } r_{i,t}=0 \implies h_{i,t}=0\]

There is one wrinkle here: we need to make sure that the 48 staffers already hired before can work in the first period without being considered to be hired. We can explicitly model this:

\[h_{i,t}\ge \begin{cases} r_{i,t}-r_{i,t-1} &\text{if $t>1$}\\
r_{i,t}&\text{if $t=1$ and $i>48$}\end{cases}\]

Actually in the GAMS model below I approached this slightly differently, but with the same net result.

The main equation in the model is to make sure we have enough staffing in each period \(t\). This can be modeled as:

\[\sum_i \left( r_{i,t} – h_{i,t} - h_{i,t-1}\right) \ge \mathit{demand}_t \]

The complete model can look like:

image

Notes:

  • \(\mathit{rinit}_{i,t}\) is used to indicate initial staffing when we start period 1. It is sparse: it assumes the value 1 only if \(i=1\) and \(t \le 48\).
  • This allows us to write equation EHire as one single, clean equation.
  • Note that GAMS will assume a variable is zero outside its domain:  \(r_{i,t-1}\) is zero when \(t=1\).
  • I used the maximum demand (52 in week 3) to dimension the the problem: set \(i\) has 52 members.
  • The model does not decide which workers are idle. We can see in each period how many workers we have and how many are in training. The remaining ones are working or sitting idle.
  • We reuse resource numbers \(i\). In the results below worker \(i=21\) is likely to be two different persons.

The results look like:

image

 image

Here a green cell with code=2 means hired and working (or idle, but not in training). A red cell with code=1 means in training.  The row “total” is the number of staffers (either working, idling or training). The row “Work” indicates the number of workers available (not in training). We are overshooting demand a bit: some workers are idling.

A picture is always a good idea, so here we see demand vs. staffing available for work.

image

We see two place where we increase the workforce: we hire in week 1 to handle demand in week 3, and we hire again in week 7 to deal with demand in week 9.

References

Wednesday, June 7, 2017

Minimum down- and up-time

In machine scheduling models we sometimes want to impose minimum up-time and minimum down-time restrictions. E.g., from (1):

My question is how do I add in additional constraints that if the factory switches off then its needs to stay off for 3 months, and if it switches back on then it needs to stay on for 4 months?

One possible solution is the following. Let us define our binary decision variable by

\[x_{i,t} = \begin{cases}1 & \text{if factory $i$ is operating in month $t$} \\
  0&\text{otherwise} \end{cases}\]
Method 1

We really want to forbid short run patterns such as 010 (i.e. off-on-off), 0110, 01110 and 101, 1001. Forbidding patterns 010, 0110, 01110 will ensure a factory is up at least 4 consecutive periods. By not allowing 101,1001 we really make sure that a down time period is at least three months. We can model these restrictions in a linear fashion as follows (2):

Forbid 010 \(-x_{i,t}+x_{i,t+1}-x_{i,t+2}\le 0\)
Forbid 0110 \(-x_{i,t}+x_{i,t+1}+x_{i,t+2}-x_{i,t+3}\le 1\)
Forbid 01110 \(-x_{i,t}+x_{i,t+1}+x_{i,t+2}+x_{i,t+3}-x_{i,t+4}\le 2\)
Forbid 101 \(x_{i,t}-x_{i,t+1}+x_{i,t+2}\le 1\)
Forbid 1001 \(x_{i,t}-x_{i,t+1}-x_{i,t+2}+x_{i,t+3}\le 1\)
Method 2

A different approach is as follows. First define binary variables:

\[ \begin{align} &\delta^{\mathit{on}}_{i,t} = \begin{cases}1 & \text{if factory $i$ is turned on in month $t$} \\
  0&\text{otherwise} \end{cases}\\ &\delta^{\mathit{off}}_{i,t} = \begin{cases}1 & \text{if factory $i$ is turned off in month $t$} \\
  0&\text{otherwise} \end{cases} \end{align} \]

Mathematically we write this as:

\[ \begin{align} &\delta^{\mathit{on}}_{i,t} = (1-x_{i,t-1})\cdot x_{i,t}\\ &\delta^{\mathit{off}}_{i,t} = x_{i,t-1}\cdot (1-x_{i,t})\\ \end{align} \]

We can linearize these non-linear equations by:

\[ \begin{align} &\delta^{\mathit{on}}_{i,t} \le 1-x_{i,t-1}\\ &\delta^{\mathit{on}}_{i,t} \le x_{i,t}\\ &\delta^{\mathit{on}}_{i,t} \ge x_{i,t}-x_{i,t-1}\\ &\delta^{\mathit{off}}_{i,t} \le x_{i,t-1}\\ &\delta^{\mathit{off}}_{i,t} \le 1-x_{i,t}\\ &\delta^{\mathit{off}}_{i,t} \ge x_{i,t-1}-x_{i,t}\\ \end{align} \]

With these variables we can implement the implications:

\[ \begin{align} &\delta^{\mathit{on}}_{i,t} = 1 \implies x_{i,t} + x_{i,t+1} + x_{i,t+2} + x_{i,t+3} = 4 \\ &\delta^{\mathit{off}}_{i,t} = 1 \implies x_{i,t} + x_{i,t+1} + x_{i,t+2} = 0 \end{align} \]

This can be linearized as:

\[ \begin{align} & x_{i,t+1} + x_{i,t+2} + x_{i,t+3} \ge 3 \delta^{\mathit{on}}_{i,t} \\ & x_{i,t+1} + x_{i,t+2} \le 2 (1-\delta^{\mathit{off}}_{i,t}) \end{align} \]

Note that I dropped \(x_{i,t}\) in both inequalities. These are already known from the definition of \(\delta^{\mathit{on}}_{i,t}\) and \(\delta^{\mathit{off}}_{i,t}\).

In the comments it is mentioned by Rob Pratt that we can strengthen this a bit (the math does not look very good in a comment, so I repeat it here):

\[\begin{align} & x_{i,t+1} \ge \delta^{\mathit{on}}_{i,t}\\
& x_{i,t+2} \ge \delta^{\mathit{on}}_{i,t}\\
& x_{i,t+3} \ge \delta^{\mathit{on}}_{i,t}\\
& x_{i,t+1} \le 1-\delta^{\mathit{off}}_{i,t} \\
& x_{i,t+2} \le 1-\delta^{\mathit{off}}_{i,t} \end{align}\]

References
  1. How do I add a constraint to keep a factory switched on or off for a certain period of time in PuLP? https://stackoverflow.com/questions/44281389/how-do-i-add-a-constraint-to-keep-a-factory-switched-on-or-off-for-a-certain-per/44293592
  2. Integer cuts, http://yetanothermathprogrammingconsultant.blogspot.com/2011/10/integer-cuts.html

Friday, May 26, 2017

Working on an R package….

The C++ code (parsing Excel formulas so we can for instance execute them) is not working 100% correctly as of now…

Capture

Time to make it work under a debugger (amazingly I did not need a debugger until now).

Update: found it (without debugger). Dereferencing a nil-pointer. Is it unreasonable to expect a (descriptive) exception to be raised when this happens?

Sunday, May 21, 2017

Indexing economic time series in R

When we want to compare different (economic) data, an often used approach is indexing. We choose one year (often the beginning of the time series) as the base year. We then normalize each time series such that the value at the base year is 100. Or:

\[\hat{x}_t = 100 \frac{x_t}{x_0}\]

When doing this in R it is interesting to see the implementation can depend much on the chosen data structure.

Below I consider three different data structures: the time series are stored (1) row-wise in a matrix, (2) column-wise in a matrix, and (3) unordered in a long-format data frame.

Matrix: row wise

Below is some (artificial) data organized as a matrix with the rows being the series. We have three series: a,b and c. The columns represent the years. Row and column names are used to make this visible:

str(A)

##  num [1:3, 1:20] 67.235 4743.289 0.871 64.69 5006.955 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:3] "a" "b" "c"
##   ..$ : chr [1:20] "2011" "2012" "2013" "2014" ...

A

##           2011         2012         2013         2014         2015
## a   67.2353402   64.6901151   63.6518902   69.1449240   71.8494470
## b 4743.2887884 5006.9547226 5623.0654046 5912.6498320 5736.7239960
## c    0.8710604    0.9193449    0.8711697    0.8451556    0.8440797
##           2016         2017         2018         2019         2020
## a   77.2782169   75.5709375   83.1405320   82.0417121   85.6433929
## b 6302.6091905 6241.5527226 6140.4721640 5542.9254530 5627.5502851
## c    0.8496936    0.8739366    0.8955205    0.8681451    0.8202612
##           2021         2022         2023         2024         2025
## a   89.1381161   89.8803998   90.5211813    91.411429   94.4488615
## b 6006.0691966 5998.6549344 6121.5326208  6378.963851 6628.1064720
## c    0.8639629    0.8477383    0.8235255     0.887063    0.8998234
##           2026         2027         2028         2029         2030
## a   94.2000739   89.9766167   90.4627291   87.2632337   86.2154671
## b 7115.7111566 6410.3639302 7038.3387976 6765.7867813 7076.7998102
## c    0.8547652    0.9033014    0.9224435    0.9127362    0.9541839

To index this we can use the the following R code:

Aindx <- 100*A/A[,1]
Aindx

##   2011      2012      2013      2014      2015      2016     2017     2018
## a  100  96.21445  94.67029 102.84015 106.86262 114.93690 112.3976 123.6560
## b  100 105.55872 118.54782 124.65296 120.94402 132.87425 131.5870 129.4560
## c  100 105.54318 100.01254  97.02606  96.90254  97.54703 100.3302 102.8081
##        2019      2020      2021      2022      2023     2024     2025
## a 122.02171 127.37854 132.57628 133.68029 134.63334 135.9574 140.4750
## b 116.85827 118.64237 126.62247 126.46615 129.05671 134.4840 139.7365
## c  99.66531  94.16812  99.18518  97.32255  94.54286 101.8371 103.3021
##        2026     2027     2028     2029     2030
## a 140.10500 133.8234 134.5464 129.7877 128.2294
## b 150.01640 135.1460 148.3852 142.6391 149.1961
## c  98.12926 103.7013 105.8989 104.7845 109.5428

Note that the expression 100*A/A[,1] is not as trivial as it seems. We divide a \(3 \times 20\) matrix by a vector of length 3. The division is done element-wise and column-by-column. We sometimes say the elements of A[,1] are recycled.

The recycling mechanism can be illustrated with a small example:

c(1,2,3,4)+c(1,2)

## [1] 2 4 4 6

I try to have a picture in each post, so here we go:

image

Matrix column wise

If the matrix is organized column-wise (e.g. by taking the transpose), we have:

A

##             a        b         c
## 2011 67.23534 4743.289 0.8710604
## 2012 64.69012 5006.955 0.9193449
## 2013 63.65189 5623.065 0.8711697
## 2014 69.14492 5912.650 0.8451556
## 2015 71.84945 5736.724 0.8440797
## 2016 77.27822 6302.609 0.8496936
## 2017 75.57094 6241.553 0.8739366
## 2018 83.14053 6140.472 0.8955205
## 2019 82.04171 5542.925 0.8681451
## 2020 85.64339 5627.550 0.8202612
## 2021 89.13812 6006.069 0.8639629
## 2022 89.88040 5998.655 0.8477383
## 2023 90.52118 6121.533 0.8235255
## 2024 91.41143 6378.964 0.8870630
## 2025 94.44886 6628.106 0.8998234
## 2026 94.20007 7115.711 0.8547652
## 2027 89.97662 6410.364 0.9033014
## 2028 90.46273 7038.339 0.9224435
## 2029 87.26323 6765.787 0.9127362
## 2030 86.21547 7076.800 0.9541839

The expression to index the series becomes now much more complicated:

Aindx <- 100*A/rep(A[1,],each=nrow(A))
Aindx

##              a        b         c
## 2011 100.00000 100.0000 100.00000
## 2012  96.21445 105.5587 105.54318
## 2013  94.67029 118.5478 100.01254
## 2014 102.84015 124.6530  97.02606
## 2015 106.86262 120.9440  96.90254
## 2016 114.93690 132.8742  97.54703
## 2017 112.39764 131.5870 100.33019
## 2018 123.65600 129.4560 102.80808
## 2019 122.02171 116.8583  99.66531
## 2020 127.37854 118.6424  94.16812
## 2021 132.57628 126.6225  99.18518
## 2022 133.68029 126.4662  97.32255
## 2023 134.63334 129.0567  94.54286
## 2024 135.95741 134.4840 101.83714
## 2025 140.47503 139.7365 103.30206
## 2026 140.10500 150.0164  98.12926
## 2027 133.82340 135.1460 103.70135
## 2028 134.54640 148.3852 105.89891
## 2029 129.78775 142.6391 104.78448
## 2030 128.22939 149.1961 109.54279

In this case the automatic recycling is not working the way we want, and we have to do this by hand. Basically, in terms of our little example, before we were happy with c(1,2) being extended automatically to c(1,2,1,2) while we need now something like c(1,1,2,2).

Data frame long format

Often data comes in a “long” format. Here is a picture to illustrate the difference between a “wide” and a “long” format:

widelong

Often wide format data comes from spreadsheets while long format is often used in databases. Sometimes the operation to convert from long to wide is called “pivot” (and the reverse “unpivot”).

A long format data frame with the above data can look like (I show the first part only):

head(df)

##   series year        value
## 1      a 2011   67.2353402
## 2      b 2011 4743.2887884
## 3      c 2011    0.8710604
## 4      a 2012   64.6901151
## 5      b 2012 5006.9547226
## 6      c 2012    0.9193449

How can we index this?

Here is my solution:

# get first year
y0 <- min(df$year)
y0

## [1] 2011

# get values at first year
x0 <- df[df$year==y0,"value"]
x0

## [1]   67.2353402 4743.2887884    0.8710604

# allow x0 to be indexed by series name
names(x0) <- df[df$year==y0,"series"]
x0

##            a            b            c
##   67.2353402 4743.2887884    0.8710604

# indexing of the series
df$indexedvalue <- 100*df$value/x0[df$series]
head
(df)

##   series year        value indexedvalue
## 1      a 2011   67.2353402    100.00000
## 2      b 2011 4743.2887884    100.00000
## 3      c 2011    0.8710604    100.00000
## 4      a 2012   64.6901151     96.21445
## 5      b 2012 5006.9547226    105.55872
## 6      c 2012    0.9193449    105.54318

The trick I used was to make the vector of values of the first year addressable by the series name. E.g.:

x0["a"]

##        a
## 67.23534

This allows us to calculate the column with indexed values in one vectorized operation.

dplyr

In the comments below, Ricardo Sanchez, offered another, rather clean, approach for the last operation:

library(dplyr)
df <- df %>%
      group_by(series) %>%
      arrange(year) %>%
      mutate(indexedvalue = 100 * value / first(value))
df

## Source: local data frame [60 x 4]
## Groups: series [3]
##
##    series  year        value indexedvalue
##    <fctr> <int>        <dbl>        <dbl>
## 1       a  2011   67.2353402    100.00000
## 2       b  2011 4743.2887884    100.00000
## 3       c  2011    0.8710604    100.00000
## 4       a  2012   64.6901151     96.21445
## 5       b  2012 5006.9547226    105.55872
## 6       c  2012    0.9193449    105.54318
## 7       a  2013   63.6518902     94.67029
## 8       b  2013 5623.0654046    118.54782
## 9       c  2013    0.8711697    100.01254
## 10      a  2014   69.1449240    102.84015
## # ... with 50 more rows

sqldf

Of course if you are familiar with SQL we can also use that:

library(sqldf)
df <-  sqldf("
         select df.series,year,value,100*value/v0 as indexedvalue
         from df
         join (select min(year),value as v0, series
             from df
             group by series) df0
             on df.series = df0.series
       "
)
head
(df)

##   series year        value indexedvalue
## 1      a 2011   67.2353402    100.00000
## 2      b 2011 4743.2887884    100.00000
## 3      c 2011    0.8710604    100.00000
## 4      a 2012   64.6901151     96.21445
## 5      b 2012 5006.9547226    105.55872
## 6      c 2012    0.9193449    105.54318

References
  1. Federal Reserve Bank of Dallas, Indexing to a Common Starting Point, https://www.dallasfed.org/research/basics/indexing.aspx

Saturday, May 20, 2017

Journalist explaining statistics

I would call this explanation, well, below average:

MATH, HORRIBLE MATH
[…]
Take that bit about the bell curve of IQ. It’s an unpleasant fact that half of all people are of below average IQ. It’s also true that half of all people are below average height, weight, and everything else. And the other half are above average. You know why? Because that’s what “average” means.
The italics are in the original text (they are not mine). May be alternative facts include alternative definitions of statistical terms. It is ironic that the title of the section is about bad math. 
References
  1. Jonah Goldberg, Can Trump be contained?, http://www.nationalreview.com/g-file/447797/donald-trump-robert-mueller-special-counsel-investigation-social-justice-math

Thursday, May 18, 2017

Simple piecewise linear problem, not so easy with binary variables

The following picture illustrates the problem:

image

The blue line is what we want to model:

\[\bbox[lightcyan,10px,border:3px solid darkblue]{
   y = \begin{cases}
    0 & \text{if $0 \le x \le a$}\\
   (x-a)\displaystyle\frac{H}{b-a} & \text{if $a < x < b$}\\
   H & \text{if $x\ge b$}\end{cases}}\]
 

Is there much we can exploit here, from this simple structure? I don’t believe so, and came up with:

\[\begin{align}
&x_1 \le a \delta_1\\
&a \delta_2 \le x_2 \le b \delta_2\\
&b \delta_3 \le x_3 \le U \delta_3\\
&\delta_1+\delta_2+\delta_3 = 1\\
&x = x_1+x_2+x_3\\
&y = (x_2 - a \delta_2)\displaystyle\frac{H}{b-a} + H \delta_3 \\
&\delta_k \in \{0,1\}\\
&x_i \ge 0\\
&0 \le x \le U
\end{align}\]
Update

Added missing \(\sum_k \delta_k=1\), see comments below.

Wednesday, May 17, 2017

RStudio Tips and Tricks

From: https://youtu.be/kuSQgswZdr8

and then I read this:

Why R is Bad for You

Summary: Someone had to say it.  In my opinion R is not the best way to learn data science and not the best way to practice it either.  More and more large employers agree.

References
  1. Sean Lopp, RStudio Tips and Tricks, https://www.youtube.com/watch?v=kuSQgswZdr8&t=85s. Delivered by Sean Lopp (RStudio) at the 2017 New York R Conference on April 21st and 22nd at Work-Bench.
  2. William Vorhies, Why R is Bad for You, http://www.datasciencecentral.com/profiles/blogs/why-r-is-bad-for-you

Small GAMS trick: $eval

In GAMS we typically first declare sets and and then deduce things like the number of elements in a set:

set i/i1*i20/;
scalar
n;
n =
card
(i);

Often the question comes up: Can I do it the other way around? First declare the scalar n and then create set if that size?  We can use a funny trick for that, using a variant of the $set construct:

scalar n /20/;
$eval n n
set i /i1 * i%n%/
;

In general I prefer to write this as:

$set n 20
scalar n /%n%/
;
set i /i1 * i%n%/
;

(a little bit less of a “surprise factor” is probably good).

Notes:

  • The assignment n = card(i) is done at execution time. All other operations are at compile time. In some more advanced cases this can be an important distinction.
  • The construct $eval n n is like a $set. It will evaluate the right most n and the result is used to populate the preprocessor identifier (the left most n).
  • To make sure a preprocessor identifier holds a numeric value, you can use scalar N /%id%/.  GAMS will complain if the macro contains a string that does not correspond to a number.

Sunday, May 7, 2017

Semi-Continuous Variables and the Solution Pool

In this post I consider the interaction between semi-continuous variables and the solution pool in Cplex and Gurobi. I received a question about this, and I had to try things about before I could give a definitive answer.

The solution pool can enumerate (some or all) different integer solutions. A difference in continuous variables is not recognized in the solution pool. A semi-continuous variable is somewhat of a hybrid (1): it is a continuous variables that can assume values: \(x=0\) or between bounds \(x \in [\ell,u]\). How would the solution pool deal with this?

To test this out, we construct a tiny knapsack-like problem but with semi-continuous variables. I.e.:

\[\bbox[lightcyan,10px,border:3px solid darkblue]{
\begin{align}
\min & \sum_ i c_i x_i \\
       & \sum_i x_i = b\\
       & x_i \in \{0\} \cup [\ell_i,u_i]
\end{align}
}\]

For this purpose I use a very small data set with just 4 variables. I used:

----     19 PARAMETER c  obj coefficients

i1 1.000,    i2 2.000,    i3 3.000,    i4 4.000


----     19 PARAMETER b                    =       40.000  rhs

----     19 PARAMETER xlo  lower bounds

i1 1.717,    i2 8.433,    i3 5.504,    i4 3.011


----     19 PARAMETER xup  upper bounds

i1 12.922,    i2 12.241,    i3 13.498,    i4 18.563

The optimal solution can look like:

---- VAR x  semi-continous variables

          LOWER          LEVEL          UPPER         MARGINAL

i1         1.7175        12.9221        12.9221         1.0000     
i2         8.4327        12.2405        12.2405         2.0000     
i3         5.5038        11.8260        13.4983         3.0000     
i4         3.0114         3.0114        18.5627         4.0000     

                           LOWER          LEVEL          UPPER         MARGINAL

---- VAR z                 -INF           84.9266        +INF             .         

  z  objective

Before running this model, I expected a few of the \(x\) variables to be zero.

All solutions

When we throw this into the solution pool and retrieve all solutions, we see:

----     49 PARAMETER xp 

                   i1          i2          i3          i4

soln_m_p1      12.922      12.241      11.826       3.011
soln_m_p2      12.922      12.241                  14.837
soln_m_p3      12.922                  13.498      13.580
soln_m_p4 -1.7764E-15      12.241      13.498      14.261


----     49 PARAMETER objs 

soln_m_p1  84.927,    soln_m_p2  96.753,    soln_m_p3 107.735,    soln_m_p4 122.021

We see that besides the optimal solution (with objective 84.927), we find three other feasible solutions, where we have some variable \(x_i=0\).

Binary variable formulation

Basically Cplex and Gurobi consider a semi-continuous variable different, when they are zero or not. I think this is the correct interpretation. This corresponds directly to a binary variable formulation where we replace the semi-continuous variable \(x_i\) by a pair of variables: a continuous variable \(x_i\) and a binary variable \(\delta_i\):

\[\bbox[lightcyan,10px,border:3px solid darkblue]{
\begin{align}
\min & \sum_ i c_i x_i \\
       & \sum_i x_i = b\\
       & \ell_i \delta_i \le x_i \le u_i \delta_i\\
       & x_i \ge 0\\
       & \delta_i \in \{0,1\}
\end{align}
}\]

Indeed when we try this binary variable formulation with the solution pool we see:

----     43 PARAMETER deltap 

                   i1          i2          i3          i4

soln_m_p1       1.000       1.000       1.000       1.000
soln_m_p2       1.000       1.000                   1.000
soln_m_p3       1.000                   1.000       1.000
soln_m_p4                   1.000       1.000       1.000


----     43 PARAMETER xp 

                   i1          i2          i3          i4

soln_m_p1      12.922      12.241      11.826       3.011
soln_m_p2      12.922      12.241                  14.837
soln_m_p3      12.922                  13.498      13.580
soln_m_p4 1.77636E-15      12.241      13.498      14.261


----     43 PARAMETER objs 

soln_m_p1  84.927,    soln_m_p2  96.753,    soln_m_p3 107.735,    soln_m_p4 122.021

References
  1. Semi-continuous variables, http://yetanothermathprogrammingconsultant.blogspot.com/2016/08/semi-continuous-variables.html