Friday, May 26, 2017

Working on an R package….

The C++ code (parsing Excel formulas so we can for instance execute them) is not working 100% correctly as of now…

Capture

Time to make it work under a debugger (amazingly I did not need a debugger until now).

Update: found it (without debugger). Dereferencing a nil-pointer. Is it unreasonable to expect a (descriptive) exception to be raised when this happens?

Sunday, May 21, 2017

Indexing economic time series in R

When we want to compare different (economic) data, an often used approach is indexing. We choose one year (often the beginning of the time series) as the base year. We then normalize each time series such that the value at the base year is 100. Or:

\[\hat{x}_t = 100 \frac{x_t}{x_0}\]

When doing this in R it is interesting to see the implementation can depend much on the chosen data structure.

Below I consider three different data structures: the time series are stored (1) row-wise in a matrix, (2) column-wise in a matrix, and (3) unordered in a long-format data frame.

Matrix: row wise

Below is some (artificial) data organized as a matrix with the rows being the series. We have three series: a,b and c. The columns represent the years. Row and column names are used to make this visible:

str(A)

##  num [1:3, 1:20] 67.235 4743.289 0.871 64.69 5006.955 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:3] "a" "b" "c"
##   ..$ : chr [1:20] "2011" "2012" "2013" "2014" ...

A

##           2011         2012         2013         2014         2015
## a   67.2353402   64.6901151   63.6518902   69.1449240   71.8494470
## b 4743.2887884 5006.9547226 5623.0654046 5912.6498320 5736.7239960
## c    0.8710604    0.9193449    0.8711697    0.8451556    0.8440797
##           2016         2017         2018         2019         2020
## a   77.2782169   75.5709375   83.1405320   82.0417121   85.6433929
## b 6302.6091905 6241.5527226 6140.4721640 5542.9254530 5627.5502851
## c    0.8496936    0.8739366    0.8955205    0.8681451    0.8202612
##           2021         2022         2023         2024         2025
## a   89.1381161   89.8803998   90.5211813    91.411429   94.4488615
## b 6006.0691966 5998.6549344 6121.5326208  6378.963851 6628.1064720
## c    0.8639629    0.8477383    0.8235255     0.887063    0.8998234
##           2026         2027         2028         2029         2030
## a   94.2000739   89.9766167   90.4627291   87.2632337   86.2154671
## b 7115.7111566 6410.3639302 7038.3387976 6765.7867813 7076.7998102
## c    0.8547652    0.9033014    0.9224435    0.9127362    0.9541839

To index this we can use the the following R code:

Aindx <- 100*A/A[,1]
Aindx

##   2011      2012      2013      2014      2015      2016     2017     2018
## a  100  96.21445  94.67029 102.84015 106.86262 114.93690 112.3976 123.6560
## b  100 105.55872 118.54782 124.65296 120.94402 132.87425 131.5870 129.4560
## c  100 105.54318 100.01254  97.02606  96.90254  97.54703 100.3302 102.8081
##        2019      2020      2021      2022      2023     2024     2025
## a 122.02171 127.37854 132.57628 133.68029 134.63334 135.9574 140.4750
## b 116.85827 118.64237 126.62247 126.46615 129.05671 134.4840 139.7365
## c  99.66531  94.16812  99.18518  97.32255  94.54286 101.8371 103.3021
##        2026     2027     2028     2029     2030
## a 140.10500 133.8234 134.5464 129.7877 128.2294
## b 150.01640 135.1460 148.3852 142.6391 149.1961
## c  98.12926 103.7013 105.8989 104.7845 109.5428

Note that the expression 100*A/A[,1] is not as trivial as it seems. We divide a \(3 \times 20\) matrix by a vector of length 3. The division is done element-wise and column-by-column. We sometimes say the elements of A[,1] are recycled.

The recycling mechanism can be illustrated with a small example:

c(1,2,3,4)+c(1,2)

## [1] 2 4 4 6

I try to have a picture in each post, so here we go:

image

Matrix column wise

If the matrix is organized column-wise (e.g. by taking the transpose), we have:

A

##             a        b         c
## 2011 67.23534 4743.289 0.8710604
## 2012 64.69012 5006.955 0.9193449
## 2013 63.65189 5623.065 0.8711697
## 2014 69.14492 5912.650 0.8451556
## 2015 71.84945 5736.724 0.8440797
## 2016 77.27822 6302.609 0.8496936
## 2017 75.57094 6241.553 0.8739366
## 2018 83.14053 6140.472 0.8955205
## 2019 82.04171 5542.925 0.8681451
## 2020 85.64339 5627.550 0.8202612
## 2021 89.13812 6006.069 0.8639629
## 2022 89.88040 5998.655 0.8477383
## 2023 90.52118 6121.533 0.8235255
## 2024 91.41143 6378.964 0.8870630
## 2025 94.44886 6628.106 0.8998234
## 2026 94.20007 7115.711 0.8547652
## 2027 89.97662 6410.364 0.9033014
## 2028 90.46273 7038.339 0.9224435
## 2029 87.26323 6765.787 0.9127362
## 2030 86.21547 7076.800 0.9541839

The expression to index the series becomes now much more complicated:

Aindx <- 100*A/rep(A[1,],each=nrow(A))
Aindx

##              a        b         c
## 2011 100.00000 100.0000 100.00000
## 2012  96.21445 105.5587 105.54318
## 2013  94.67029 118.5478 100.01254
## 2014 102.84015 124.6530  97.02606
## 2015 106.86262 120.9440  96.90254
## 2016 114.93690 132.8742  97.54703
## 2017 112.39764 131.5870 100.33019
## 2018 123.65600 129.4560 102.80808
## 2019 122.02171 116.8583  99.66531
## 2020 127.37854 118.6424  94.16812
## 2021 132.57628 126.6225  99.18518
## 2022 133.68029 126.4662  97.32255
## 2023 134.63334 129.0567  94.54286
## 2024 135.95741 134.4840 101.83714
## 2025 140.47503 139.7365 103.30206
## 2026 140.10500 150.0164  98.12926
## 2027 133.82340 135.1460 103.70135
## 2028 134.54640 148.3852 105.89891
## 2029 129.78775 142.6391 104.78448
## 2030 128.22939 149.1961 109.54279

In this case the automatic recycling is not working the way we want, and we have to do this by hand. Basically, in terms of our little example, before we were happy with c(1,2) being extended automatically to c(1,2,1,2) while we need now something like c(1,1,2,2).

Data frame long format

Often data comes in a “long” format. Here is a picture to illustrate the difference between a “wide” and a “long” format:

widelong

Often wide format data comes from spreadsheets while long format is often used in databases. Sometimes the operation to convert from long to wide is called “pivot” (and the reverse “unpivot”).

A long format data frame with the above data can look like (I show the first part only):

head(df)

##   series year        value
## 1      a 2011   67.2353402
## 2      b 2011 4743.2887884
## 3      c 2011    0.8710604
## 4      a 2012   64.6901151
## 5      b 2012 5006.9547226
## 6      c 2012    0.9193449

How can we index this?

Here is my solution:

# get first year
y0 <- min(df$year)
y0

## [1] 2011

# get values at first year
x0 <- df[df$year==y0,"value"]
x0

## [1]   67.2353402 4743.2887884    0.8710604

# allow x0 to be indexed by series name
names(x0) <- df[df$year==y0,"series"]
x0

##            a            b            c
##   67.2353402 4743.2887884    0.8710604

# indexing of the series
df$indexedvalue <- 100*df$value/x0[df$series]
head
(df)

##   series year        value indexedvalue
## 1      a 2011   67.2353402    100.00000
## 2      b 2011 4743.2887884    100.00000
## 3      c 2011    0.8710604    100.00000
## 4      a 2012   64.6901151     96.21445
## 5      b 2012 5006.9547226    105.55872
## 6      c 2012    0.9193449    105.54318

The trick I used was to make the vector of values of the first year addressable by the series name. E.g.:

x0["a"]

##        a
## 67.23534

This allows us to calculate the column with indexed values in one vectorized operation.

dplyr

In the comments below, Ricardo Sanchez, offered another, rather clean, approach for the last operation:

library(dplyr)
df <- df %>%
      group_by(series) %>%
      arrange(year) %>%
      mutate(indexedvalue = 100 * value / first(value))
df

## Source: local data frame [60 x 4]
## Groups: series [3]
##
##    series  year        value indexedvalue
##    <fctr> <int>        <dbl>        <dbl>
## 1       a  2011   67.2353402    100.00000
## 2       b  2011 4743.2887884    100.00000
## 3       c  2011    0.8710604    100.00000
## 4       a  2012   64.6901151     96.21445
## 5       b  2012 5006.9547226    105.55872
## 6       c  2012    0.9193449    105.54318
## 7       a  2013   63.6518902     94.67029
## 8       b  2013 5623.0654046    118.54782
## 9       c  2013    0.8711697    100.01254
## 10      a  2014   69.1449240    102.84015
## # ... with 50 more rows

sqldf

Of course if you are familiar with SQL we can also use that:

library(sqldf)
df <-  sqldf("
         select df.series,year,value,100*value/v0 as indexedvalue
         from df
         join (select min(year),value as v0, series
             from df
             group by series) df0
             on df.series = df0.series
       "
)
head
(df)

##   series year        value indexedvalue
## 1      a 2011   67.2353402    100.00000
## 2      b 2011 4743.2887884    100.00000
## 3      c 2011    0.8710604    100.00000
## 4      a 2012   64.6901151     96.21445
## 5      b 2012 5006.9547226    105.55872
## 6      c 2012    0.9193449    105.54318

References
  1. Federal Reserve Bank of Dallas, Indexing to a Common Starting Point, https://www.dallasfed.org/research/basics/indexing.aspx

Saturday, May 20, 2017

Journalist explaining statistics

I would call this explanation, well, below average:

MATH, HORRIBLE MATH
[…]
Take that bit about the bell curve of IQ. It’s an unpleasant fact that half of all people are of below average IQ. It’s also true that half of all people are below average height, weight, and everything else. And the other half are above average. You know why? Because that’s what “average” means.
The italics are in the original text (they are not mine). May be alternative facts include alternative definitions of statistical terms. It is ironic that the title of the section is about bad math. 
References
  1. Jonah Goldberg, Can Trump be contained?, http://www.nationalreview.com/g-file/447797/donald-trump-robert-mueller-special-counsel-investigation-social-justice-math

Thursday, May 18, 2017

Simple piecewise linear problem, not so easy with binary variables

The following picture illustrates the problem:

image

The blue line is what we want to model:

\[\bbox[lightcyan,10px,border:3px solid darkblue]{
   y = \begin{cases}
    0 & \text{if $0 \le x \le a$}\\
   (x-a)\displaystyle\frac{H}{b-a} & \text{if $a < x < b$}\\
   H & \text{if $x\ge b$}\end{cases}}\]
 

Is there much we can exploit here, from this simple structure? I don’t believe so, and came up with:

\[\begin{align}
&x_1 \le a \delta_1\\
&a \delta_2 \le x_2 \le b \delta_2\\
&b \delta_3 \le x_3 \le U \delta_3\\
&\delta_1+\delta_2+\delta_3 = 1\\
&x = x_1+x_2+x_3\\
&y = (x_2 - a \delta_2)\displaystyle\frac{H}{b-a} + H \delta_3 \\
&\delta_k \in \{0,1\}\\
&x_i \ge 0\\
&0 \le x \le U
\end{align}\]
Update

Added missing \(\sum_k \delta_k=1\), see comments below.

Wednesday, May 17, 2017

RStudio Tips and Tricks

From: https://youtu.be/kuSQgswZdr8

and then I read this:

Why R is Bad for You

Summary: Someone had to say it.  In my opinion R is not the best way to learn data science and not the best way to practice it either.  More and more large employers agree.

References
  1. Sean Lopp, RStudio Tips and Tricks, https://www.youtube.com/watch?v=kuSQgswZdr8&t=85s. Delivered by Sean Lopp (RStudio) at the 2017 New York R Conference on April 21st and 22nd at Work-Bench.
  2. William Vorhies, Why R is Bad for You, http://www.datasciencecentral.com/profiles/blogs/why-r-is-bad-for-you

Small GAMS trick: $eval

In GAMS we typically first declare sets and and then deduce things like the number of elements in a set:

set i/i1*i20/;
scalar
n;
n =
card
(i);

Often the question comes up: Can I do it the other way around? First declare the scalar n and then create set if that size?  We can use a funny trick for that, using a variant of the $set construct:

scalar n /20/;
$eval n n
set i /i1 * i%n%/
;

In general I prefer to write this as:

$set n 20
scalar n /%n%/
;
set i /i1 * i%n%/
;

(a little bit less of a “surprise factor” is probably good).

Notes:

  • The assignment n = card(i) is done at execution time. All other operations are at compile time. In some more advanced cases this can be an important distinction.
  • The construct $eval n n is like a $set. It will evaluate the right most n and the result is used to populate the preprocessor identifier (the left most n).
  • To make sure a preprocessor identifier holds a numeric value, you can use scalar N /%id%/.  GAMS will complain if the macro contains a string that does not correspond to a number.

Sunday, May 7, 2017

Semi-Continuous Variables and the Solution Pool

In this post I consider the interaction between semi-continuous variables and the solution pool in Cplex and Gurobi. I received a question about this, and I had to try things about before I could give a definitive answer.

The solution pool can enumerate (some or all) different integer solutions. A difference in continuous variables is not recognized in the solution pool. A semi-continuous variable is somewhat of a hybrid (1): it is a continuous variables that can assume values: \(x=0\) or between bounds \(x \in [\ell,u]\). How would the solution pool deal with this?

To test this out, we construct a tiny knapsack-like problem but with semi-continuous variables. I.e.:

\[\bbox[lightcyan,10px,border:3px solid darkblue]{
\begin{align}
\min & \sum_ i c_i x_i \\
       & \sum_i x_i = b\\
       & x_i \in \{0\} \cup [\ell_i,u_i]
\end{align}
}\]

For this purpose I use a very small data set with just 4 variables. I used:

----     19 PARAMETER c  obj coefficients

i1 1.000,    i2 2.000,    i3 3.000,    i4 4.000


----     19 PARAMETER b                    =       40.000  rhs

----     19 PARAMETER xlo  lower bounds

i1 1.717,    i2 8.433,    i3 5.504,    i4 3.011


----     19 PARAMETER xup  upper bounds

i1 12.922,    i2 12.241,    i3 13.498,    i4 18.563

The optimal solution can look like:

---- VAR x  semi-continous variables

          LOWER          LEVEL          UPPER         MARGINAL

i1         1.7175        12.9221        12.9221         1.0000     
i2         8.4327        12.2405        12.2405         2.0000     
i3         5.5038        11.8260        13.4983         3.0000     
i4         3.0114         3.0114        18.5627         4.0000     

                           LOWER          LEVEL          UPPER         MARGINAL

---- VAR z                 -INF           84.9266        +INF             .         

  z  objective

Before running this model, I expected a few of the \(x\) variables to be zero.

All solutions

When we throw this into the solution pool and retrieve all solutions, we see:

----     49 PARAMETER xp 

                   i1          i2          i3          i4

soln_m_p1      12.922      12.241      11.826       3.011
soln_m_p2      12.922      12.241                  14.837
soln_m_p3      12.922                  13.498      13.580
soln_m_p4 -1.7764E-15      12.241      13.498      14.261


----     49 PARAMETER objs 

soln_m_p1  84.927,    soln_m_p2  96.753,    soln_m_p3 107.735,    soln_m_p4 122.021

We see that besides the optimal solution (with objective 84.927), we find three other feasible solutions, where we have some variable \(x_i=0\).

Binary variable formulation

Basically Cplex and Gurobi consider a semi-continuous variable different, when they are zero or not. I think this is the correct interpretation. This corresponds directly to a binary variable formulation where we replace the semi-continuous variable \(x_i\) by a pair of variables: a continuous variable \(x_i\) and a binary variable \(\delta_i\):

\[\bbox[lightcyan,10px,border:3px solid darkblue]{
\begin{align}
\min & \sum_ i c_i x_i \\
       & \sum_i x_i = b\\
       & \ell_i \delta_i \le x_i \le u_i \delta_i\\
       & x_i \ge 0\\
       & \delta_i \in \{0,1\}
\end{align}
}\]

Indeed when we try this binary variable formulation with the solution pool we see:

----     43 PARAMETER deltap 

                   i1          i2          i3          i4

soln_m_p1       1.000       1.000       1.000       1.000
soln_m_p2       1.000       1.000                   1.000
soln_m_p3       1.000                   1.000       1.000
soln_m_p4                   1.000       1.000       1.000


----     43 PARAMETER xp 

                   i1          i2          i3          i4

soln_m_p1      12.922      12.241      11.826       3.011
soln_m_p2      12.922      12.241                  14.837
soln_m_p3      12.922                  13.498      13.580
soln_m_p4 1.77636E-15      12.241      13.498      14.261


----     43 PARAMETER objs 

soln_m_p1  84.927,    soln_m_p2  96.753,    soln_m_p3 107.735,    soln_m_p4 122.021

References
  1. Semi-continuous variables, http://yetanothermathprogrammingconsultant.blogspot.com/2016/08/semi-continuous-variables.html

Saturday, May 6, 2017

Nonlinear equations and CONOPT

The problem

When we want to generate some random draws from an exotic probability distribution, the following algorithm may help:

  1. Generate \(u_i \sim U(0,1)\) from a uniform distribution.
  2. Solve \(F(x_i) = u_i\) where \(F(x)\) is the cumulative distribution function (cdf).

For most well-known distributions better methods exists, with better numerical properties especially in the tail areas. An example of such an less-known exotic distribution is the Gamma/Gompertz distribution (1).

Implementation

Here is a model I suggested to do the above:

image

In the above model we form a system of nonlinear equations \(G(x)=F\) in GAMS. It is a diagonal system: the Jacobian has entries on the diagonal only.

In most cases it is better to solve \(n\) single equations instead of a system of \(n\) equations. However there are a few reasons why this is not such a crazy approach as it looks:

  • GAMS has a relatively high fixed cost overhead in setting up a model and calling a solver.
  • A single model is a little bit more compact.
  • I solved this model with CONOPT, and CONOPT will actually detect the diagonal structure and then solve the equations one by one. We can see this from some of the messages CONOPT produces.
Diagnostic messages

The log shows the following:

   C O N O P T 3   version 3.17C
   Copyright (C)   ARKI Consulting and Development A/S
                   Bagsvaerdvej 246 A
                   DK-2880 Bagsvaerd, Denmark


  Iter Phase Ninf   Infeasibility   RGmax    NSB   Step InItr MX OK
     0   0        4.3170875683E+01 (Input point)

                  Pre-triangular equations:   100
                  Post-triangular equations:  0

     1   0        0.0000000000E+00 (After pre-processing)
     2   0        0.0000000000E+00 (After scaling)
     2            0.0000000000E+00

** Feasible solution to a square system.

When CONOPT receives the model it searches for triangular pieces of the model. The following graphic illustrates this:

image

In this picture rows are equations and columns are variables. After reordering the rows and columns we have in left-upper corner a triangular part. Those equations we can solve one-by-one. In the middle we have the remaining simultaneous equations. This is the difficult part to solve. Finally we have a post-triangular part, which we can again solve one-by-one. Note that this presolve strategy is applied for optimization models as well (in that case the Jacobian matrix is usually not a square matrix).

For our “diagonal” example all equations fall in the pre-triangular part of the model. So indeed CONOPT does not solve this as a simultaneous equation problem.

Funny message

In the listing file we found one more message:

It may be more efficient to turn the Post-triangular equations
into ordinary equations with option:
 
Lspost=False

Initially this reads a silly message: we have 0 post-triangular equations. However, it is the searching that is the expensive part. So even if there are eventually zero post-triangular equations, in some cases it is better to turn this search off. For our particular problem we don’t have to bother.

Results

The final results of this model look like:

----     21 PARAMETER F  cdf values

i1   0.172,    i2   0.843,    i3   0.550,    i4   0.301,    i5   0.292,    i6   0.224,    i7   0.350,    i8   0.856
i9   0.067,    i10  0.500,    i11  0.998,    i12  0.579,    i13  0.991,    i14  0.762,    i15  0.131,    i16  0.640
i17  0.160,    i18  0.250,    i19  0.669,    i20  0.435,    i21  0.360,    i22  0.351,    i23  0.131,    i24  0.150
i25  0.589,    i26  0.831,    i27  0.231,    i28  0.666,    i29  0.776,    i30  0.304,    i31  0.110,    i32  0.502
i33  0.160,    i34  0.872,    i35  0.265,    i36  0.286,    i37  0.594,    i38  0.723,    i39  0.628,    i40  0.464
i41  0.413,    i42  0.118,    i43  0.314,    i44  0.047,    i45  0.339,    i46  0.182,    i47  0.646,    i48  0.561
i49  0.770,    i50  0.298,    i51  0.661,    i52  0.756,    i53  0.627,    i54  0.284,    i55  0.086,    i56  0.103
i57  0.641,    i58  0.545,    i59  0.032,    i60  0.792,    i61  0.073,    i62  0.176,    i63  0.526,    i64  0.750
i65  0.178,    i66  0.034,    i67  0.585,    i68  0.621,    i69  0.389,    i70  0.359,    i71  0.243,    i72  0.246
i73  0.131,    i74  0.933,    i75  0.380,    i76  0.783,    i77  0.300,    i78  0.125,    i79  0.749,    i80  0.069
i81  0.202,    i82  0.005,    i83  0.270,    i84  0.500,    i85  0.151,    i86  0.174,    i87  0.331,    i88  0.317
i89  0.322,    i90  0.964,    i91  0.994,    i92  0.370,    i93  0.373,    i94  0.772,    i95  0.397,    i96  0.913
i97  0.120,    i98  0.735,    i99  0.055,    i100 0.576


----     21 VARIABLE x.L  variates

i1   0.115,    i2   0.743,    i3   0.399,    i4   0.205,    i5   0.199,    i6   0.151,    i7   0.240,    i8   0.767
i9   0.045,    i10  0.357,    i11  1.690,    i12  0.424,    i13  1.396,    i14  0.620,    i15  0.087,    i16  0.482
i17  0.107,    i18  0.169,    i19  0.512,    i20  0.305,    i21  0.248,    i22  0.242,    i23  0.088,    i24  0.101
i25  0.434,    i26  0.721,    i27  0.156,    i28  0.508,    i29  0.638,    i30  0.207,    i31  0.074,    i32  0.358
i33  0.107,    i34  0.799,    i35  0.180,    i36  0.194,    i37  0.438,    i38  0.571,    i39  0.471,    i40  0.327
i41  0.288,    i42  0.079,    i43  0.215,    i44  0.031,    i45  0.232,    i46  0.122,    i47  0.488,    i48  0.408
i49  0.630,    i50  0.203,    i51  0.503,    i52  0.612,    i53  0.470,    i54  0.193,    i55  0.058,    i56  0.069
i57  0.483,    i58  0.395,    i59  0.021,    i60  0.662,    i61  0.049,    i62  0.118,    i63  0.378,    i64  0.605
i65  0.120,    i66  0.023,    i67  0.430,    i68  0.464,    i69  0.270,    i70  0.247,    i71  0.164,    i72  0.167
i73  0.087,    i74  0.964,    i75  0.263,    i76  0.649,    i77  0.204,    i78  0.084,    i79  0.603,    i80  0.046
i81  0.136,    i82  0.003,    i83  0.183,    i84  0.356,    i85  0.101,    i86  0.117,    i87  0.226,    i88  0.216
i89  0.220,    i90  1.105,    i91  1.460,    i92  0.255,    i93  0.257,    i94  0.633,    i95  0.275,    i96  0.898
i97  0.080,    i98  0.586,    i99  0.037,    i100 0.422

Notes

As noted in the comments this particular CDF can be solved directly using:

\[
x_i = \frac{\ln\left(\beta (1-F_i)^{-1/s}-\beta+1 \right)}{b}
\]
References
  1. Gamma/Gompertz distribution, https://en.wikipedia.org/wiki/Gamma/Gompertz_distribution

Friday, May 5, 2017

Upper bounds on integer variables

In a model I used to solve as a continuous problem, I have some variables \(0 \le x_i \le U_i\). In addition I have some cost \(c_i\) and a budget \(B\):

\[ \sum_i c_i x_i \le B \]

When moving towards an integer model, of course I want to tighten the upper bounds \(U_i\). I.e.:

\[ U_i := \min\left\{U_i,\frac{B}{c_i}\right\} \]

Now, GAMS does not accept non-integer bounds on integer variables:

**** Matrix error - bounds on discrete variables have to be integer

(it is debatable if it is a good thing to require this). So we can do something like:

x.up(i) = trunc(min(u(i), budget/cost(i)));

However we could end up with something like trunc(0.999999) which in all likelihood should be considered as 1. Numbers like that can easily result from round-off and truncation errors in floating point computations and during data transfer between systems. We now need to apply a certain tolerance to safeguard against this, e.g.:

x.up(i) = trunc(min(u(i), budget/cost(i))+1e-6);

It looks like I am taking on much responsibility here. It may have been better to leave things as they were before.

Monday, May 1, 2017

Linearizing an average

In (1) a question is asked about averaging a certain allocation.

Let \(i\) be the set of prices to allocate to buckets \(j\). As each price can go to at most one bucket, it makes sense to use a binary variable:

\[x_{i,j} = \begin{cases} 1 & \text{if price $i$ is allocated to bucket $j$}\\
                                   0 & \text{otherwise}\end{cases}\]

The objective is to minimize the spread of the average prices seen in each bucket. A (nonlinear) model can look like:

\[
\bbox[lightcyan,10px,border:3px solid darkblue]
{\begin{align} \min\>& \left(\max_j  \mu_j – \min_j \mu_j \right)\\
                             & \mu_j = \frac{\displaystyle\sum_i p_i x_{i,j}}{\displaystyle\sum_i x_{i,j}}\\
                             &\sum_j x_{i,j} = 1 & \forall i\\
                             &x_{i,j} \in \{0,1\}
\end{align}}\]

Here the variable \(\mu_j\) is the average price in bucket \(j\). We have two nonlinearities in this model: the objective, and the calculation of \(\mu_j\).

The objective is easily linearized:

\[\begin{align}
\min\> & \mu_{max} – \mu_{min}\\
         & \mu_{max} \ge \mu_j & \forall j\\
         & \mu_{min} \le \mu_j & \forall j
\end{align}\]

But what about the calculation of the average? We can write:

\[\mu_j \sum_i x_{i,j} = \sum_i p_i x_{i,j} \]

This does not help much. If we introduce \(z_j = \sum_i x_{i,j}\) then the left-hand side is \(\mu_j z_j\) which is still not easy to linearize. However, a better approach is to write

\[\sum_i \mu_j x_{i,j} = \sum_i p_i x_{i,j} \]

The expressions \(\mu_j x_{i,j}\) are still non-linear, but they are a multiplication of a binary variables times a continuous variable. This we know how to linearize (2).

The complete linearized model can look like:

\[\bbox[lightcyan,10px,border:3px solid darkblue]{\begin{align}
\min\> & \mu_{max} – \mu_{min}\\
         & \mu_{max} \ge \mu_j & \forall j\\
         & \mu_{min} \le \mu_j & \forall j \\
         & \sum_i y_{i,j} = \sum_i p_i x_{i,j} & \forall j\\
         & y_{i,j} \le U x_{i,j}\\
         & y_{i,j} \le \mu_j\\
         & y_{i,j} \ge \mu_j – U(1-x_{i,j})\\
         &\sum_j x_{i,j} = 1 & \forall i\\
         &y_{i,j} \in [0,U]\\
         &x_{i,j} \in \{0,1\}
\end{align}}\]

where \(U=\displaystyle \max_i p_i\) (this is data). The variable \(y_{i,j}\) can be interpreted as \(y_{i,j}=\mu_j x_{i,j}\).

In a real application we would have other restrictions as well. The most obvious is that we want to allocate at least one price to a bucket. This is simply done by requiring \(\displaystyle\sum_i x_{i,j} \ge 1\). Here is some example output while including this constraint:

----     53 SET i  prices

i1 ,    i2 ,    i3 ,    i4 ,    i5 ,    i6 ,    i7 ,    i8 ,    i9
i10


----     53 SET j  buckets

j1,    j2,    j3,    j4,    j5


----     53 PARAMETER p  price

i1  1.717,    i2  8.433,    i3  5.504,    i4  3.011,    i5  2.922
i6  2.241,    i7  3.498,    i8  8.563,    i9  0.671,    i10 5.002


----     53 VARIABLE x.L  assignment

             j1          j2          j3          j4          j5

i1            1
i2                        1
i3                                    1
i4                                                1
i5            1
i6                                    1
i7                                                            1
i8            1
i9                        1
i10                                               1


----     53 VARIABLE mu.L  average

j1 4.401,    j2 4.552,    j3 3.872,    j4 4.007,    j5 3.498


----     53 VARIABLE y.L  average or zero

             j1          j2          j3          j4          j5

i1        4.401
i2                    4.552
i3                                3.872
i4                                            4.007
i5        4.401
i6                                3.872
i7                                                        3.498
i8        4.401
i9                    4.552
i10                                           4.007


----     53 VARIABLE mumin.L               =        3.498 
            VARIABLE mumax.L               =        4.552 

Side note: What would happen if we allow zero prices to go into a bucket? The model would not restrict \(\mu_j\) for such a bucket. In extremis we can put all prices in the first bucket (with average price \(\mu_1\)), and then have all the remaining buckets empty. That would allow the model to pick \(\mu_j=\mu_1\) i.e. no spread. That is exactly what I got when I tried this (by dropping the constraint \(\displaystyle\sum_i x_{i,j} \ge 1\)). The results with the same data as before:

----     53 VARIABLE x.L  assignment

             j1

i1            1
i2            1
i3            1
i4            1
i5            1
i6            1
i7            1
i8            1
i9            1
i10           1


----     53 VARIABLE mu.L  average

j1 4.156,    j2 4.156,    j3 4.156,    j4 4.156,    j5 4.156


----     53 VARIABLE y.L  average or zero

             j1

i1        4.156
i2        4.156
i3        4.156
i4        4.156
i5        4.156
i6        4.156
i7        4.156
i8        4.156
i9        4.156
i10       4.156


----     53 VARIABLE mumin.L               =        4.156 
            VARIABLE mumax.L               =        4.156 

The same trick can be used to linearize the problem stated in (3).

References
  1. Fair allocations to buckets with constraints, http://stackoverflow.com/questions/43662264/fair-allocation-to-buckets-with-constraints
  2. Multiplication of a continuous and a binary variable, http://yetanothermathprogrammingconsultant.blogspot.com/2008/05/multiplication-of-continuous-and-binary.html 
  3. Convert nonlinear objective function to linear objective function, http://stackoverflow.com/questions/43591174/convert-nonlinear-objective-function-to-linear-objective-function


Appendix: GAMS Model

$ontext

   
Assign items to buckets (bins) such that the
   
average price is as equal as possible.


$offtext

sets
 i
'prices' /i1*i10/
 j
'buckets' /j1*j5/;


*-------------------------------------------------------
* data: random prices
*-------------------------------------------------------

* random data
parameter p(i) 'price';
p(i) = uniform(0,10);
display p;

* upperbound on avg price in a bucket
scalar U;
U =
smax(i,p(i));
display U;


*-------------------------------------------------------
* optimization model
*-------------------------------------------------------


binary variable x(i,j) 'assignment';
positive variable y(i,j) 'average or zero';
variables
   mu(j) 
'average'
   mumax 
'max of mu(j)'
   mumin 
'max of mu(j)'
   z     
'objective variable'
;

equations
  emumin(j) 
'minimum of averages'
  emumax(j) 
'maximum of averages'
  ratio(j)  
'reformulation of ratio mu(j)=sum(i,p(i)*x(i,j))/sum(i,x(i,j))'
  ydef1(i,j)
'linearization of y(i,j) = mu(j)*x(i,j)'
  ydef2(i,j)
'linearization of y(i,j) = mu(j)*x(i,j)'
  ydef3(i,j)
'linearization of y(i,j) = mu(j)*x(i,j)'
  assign1(i) 
'standard assignment constraint'
  assign2(j) 
'minimum one item per bin'
  obj       
'objective'
;


emumin(j).. mumin =l= mu(j);
emumax(j).. mumax =g= mu(j);
ratio(j)..
sum(i, y(i,j)) =e= sum(i, p(i)*x(i,j));

ydef1(i,j).. y(i,j) =l= U*x(i,j);
ydef2(i,j).. y(i,j) =l= mu(j);
ydef3(i,j).. y(i,j) =g= mu(j)-U*(1-x(i,j));

assign1(i)..
sum(j, x(i,j)) =e= 1;
assign2(j)..
sum(i, x(i,j)) =g= 1;


y.up(i,j) = U;

obj.. z =e= mumax-mumin;

model m /all/;
* global optimality
option optcr=0;
solve m  minimizing z using mip;


*-------------------------------------------------------
* print results
*-------------------------------------------------------

option x:0;
display i,j,p,x.l,mu.l,y.l,mumin.l,mumax.l;