Selecting \(k\) out of \(n\) can be a computational difficult task. Here we look at a special case and see how we can exploit the properties of the data.
Suppose we have \(n\) data points. We want to select \(k\) points such that the average of these \(k\) points is as close to a target value \(\color{darkblue}t\) as possible [1]. I.e. we want \[\min_{\color{darkred}S} \left| \frac{\sum_{s \in \color{darkred}S}\color{darkblue}p_s}{\color{darkblue}k} - \color{darkblue}t \right| \] where \(|\color{darkred}S|=k\).
The special property of the data is that it is composed of just a few different values. Let's say our data looks like:
---- 21 PARAMETER p data
i1 1, i2 3, i3 2, i4 1, i5 1, i6 1, i7 2, i8 3
i9 1, i10 2, i11 3, i12 2, i13 3, i14 3, i15 1, i16 2
i17 1, i18 1, i19 3, i20 2, i21 2, i22 2, i23 1, i24 1
i25 2, i26 3, i27 1, i28 2, i29 3, i30 1, i31 1, i32 2
i33 1, i34 3, i35 1, i36 1, i37 2, i38 3, i39 2, i40 2
i41 2, i42 1, i43 1, i44 1, i45 2, i46 1, i47 2, i48 2
i49 3, i50 1, i51 2, i52 3, i53 2, i54 1, i55 1, i56 1
i57 2, i58 2, i59 1, i60 3, i61 1, i62 1, i63 2, i64 3
i65 1, i66 1, i67 2, i68 2, i69 2, i70 2, i71 1, i72 1
i73 1, i74 3, i75 2, i76 3, i77 1, i78 1, i79 3, i80 1
i81 1, i82 1, i83 1, i84 2, i85 1, i86 1, i87 1, i88 1
i89 1, i90 3, i91 3, i92 2, i93 2, i94 3, i95 2, i96 3
i97 1, i98 3, i99 1, i100 2
---- 21 PARAMETER k = 50 number to select
PARAMETER t = 2.12345 target for mean of selected values
In this post, I will start with a generic model using \(n=100\) binary variables. A second model will exploit that we have just a few unique values in our data. We will exploit that in a model with just 3 integer variables. These models are selecting exactly \(k\) points.
When we consider the problem: allow up to \(kmax=50\) points to be selected, things become more difficult. The obvious approach would be to use a loop, try each \(k\) and pick the best solution. A more complex, non-linear model can be formed. With some effort, we can linearize this model.
Model 1: binary variables
First, let's look at a generic model. Here we don't exploit the properties of the data. We introduce binary variables: \[\color{darkred}\delta_i = \begin{cases} 1 & \text{if point $i$ is selected} \\ 0 & \text{otherwise}\end{cases}\] The absolute value is easily modeled through variable splitting:
Generic Model |
\[
\begin{align} \min\> & \color{darkred}d^+ + \color{darkred}d^- \\
& \color{darkred}d^+ - \color{darkred}d^- = \sum_i \frac{\color{darkred}\delta_i \cdot \color{darkblue} p_i}{\color{darkblue} k} - \color{darkblue}t \\
& \sum_i \color{darkred}\delta_i = \color{darkblue}k \\
& \color{darkred}\delta_i \in \{0,1\} \\
& \color{darkred}d^+,\color{darkred}d^- \ge 0 \end{align}
\]
|
The output is:
---- 62 VARIABLE delta.L selection of points
i2 1, i5 1, i6 1, i7 1, i8 1, i9 1
i11 1, i12 1, i13 1, i14 1, i15 1, i16 1
i17 1, i18 1, i19 1, i22 1, i26 1, i27 1
i28 1, i29 1, i30 1, i31 1, i33 1, i34 1
i35 1, i36 1, i38 1, i41 1, i49 1, i52 1
i55 1, i56 1, i58 1, i60 1, i61 1, i62 1
i64 1, i67 1, i68 1, i74 1, i76 1, i79 1
i84 1, i90 1, i91 1, i92 1, i93 1, i94 1
i96 1, i98 1
---- 62 PARAMETER value selected values
i2 3, i5 1, i6 1, i7 2, i8 3, i9 1
i11 3, i12 2, i13 3, i14 3, i15 1, i16 2
i17 1, i18 1, i19 3, i22 2, i26 3, i27 1
i28 2, i29 3, i30 1, i31 1, i33 1, i34 3
i35 1, i36 1, i38 3, i41 2, i49 3, i52 3
i55 1, i56 1, i58 2, i60 3, i61 1, i62 1
i64 3, i67 2, i68 2, i74 3, i76 3, i79 3
i84 2, i90 3, i91 3, i92 2, i93 2, i94 3
i96 3, i98 3
---- 62 PARAMETER vcount count of each value
1 16, 2 12, 3 22
---- 62 VARIABLE dev.L deviation
- 0.003450
---- 62 VARIABLE z.L = 0.003450 objective
This model has \(n=100\) binary variables.
Note that we can scale the objective by a constant value, and write: \[\color{darkred}d^+ - \color{darkred}d^- = \sum_i \color{darkred}\delta_i \cdot \color{darkblue} p_i - \color{darkblue}t\cdot\color{darkblue} k\] As we see later on, this is a valid approach if \(\color{darkblue}k\) is a variable.
Model 2: Exploit special properties of the data
Instead of using binary variables, we can keep track of the number of occurrences of each unique value \(1,2,3\). So we define the variable: \[\color{darkred}n_j \in \{0,\dots,\color{darkblue}N_j\}\] indicating how many of unique value \(j\) we select. This reduces the size of the model to just three integer variables.
The model looks like:
Specialized Model |
\[
\begin{align} \min\> & \color{darkred}d^+ + \color{darkred}d^- \\
& \color{darkred}d^+ - \color{darkred}d^- = \sum_j \frac{\color{darkred}n_j \cdot \color{darkblue} j}{\color{darkblue} k} - \color{darkblue}t \\
& \sum_j \color{darkred}n_j = \color{darkblue}k \\
& \color{darkred}n_j \in \{0,1,\dots,\color{darkblue}N_j\} \\
& \color{darkred}d^+,\color{darkred}d^- \ge 0 \end{align}
\]
|
Here \(\color{darkblue}N_j\) is the number of data points with \(\color{darkblue}p_i = j\). For our data set these look like:
---- 70 PARAMETER tab tabulation
1 45, 2 33, 3 22
A solution is:
---- 88 VARIABLE num.L number selected from bin j
1 6, 2 32, 3 12
---- 88 VARIABLE dev.L deviation
- 0.003450
---- 88 VARIABLE z.L = 0.003450 objective
We note that this is a different solution, but with the same optimal objective function value.
Extension: select up to \(kmax\) points
A simple extension of this model can look like: instead of exactly selecting \(k\) points, select up to \(kmax\) points. This potentially should reduce the objective. The simplest would be to loop over \(k=1,\dots,kmax\), solve each model, and pick the best solution. This means solving 50 MIP models. When I do this, the results are:
---- 119 **** Loop
---- 119 PARAMETER traceObj trace objective values
i1 0.123450, i2 0.123450, i3 0.123450, i4 0.123450
i5 0.076550, i6 0.043217, i7 0.019407, i8 0.001550
i9 0.012339, i10 0.023450, i11 0.032541, i12 0.040117
i13 0.030396, i14 0.019407, i15 0.009883, i16 0.001550
i17 0.005803, i18 0.012339, i19 0.018187, i20 0.023450
i21 0.019407, i22 0.012914, i23 0.006985, i24 0.001550
i25 0.003450, i26 0.008065, i27 0.012339, i28 0.016307
i29 0.014481, i30 0.009883, i31 0.005582, i32 0.001550
i33 0.002238, i34 0.005803, i35 0.009164, i36 0.012339
i37 0.011685, i38 0.008129, i39 0.004755, i40 0.001550
i41 0.001499, i42 0.004402, i43 0.007171, i44 0.009814
i45 0.009883, i46 0.006985, i47 0.004210, i48 0.001550
i49 0.001001, i50 0.003450
---- 119 PARAMETER bestObj = 0.001001 best objective
---- 119 PARAMETER bestNum best selection
1 5, 2 33, 3 11
We see the objective goes up and down for different values of \(k\). The best is using \(k=49\) points with an objective value of 0.001.
The obvious question is: Can we do this in one model instead?
The model for this could look like:
Incorrect Extension Model |
---|
\[ \begin{align} \min\> & \color{darkred}d^+ + \color{darkred}d^- \\ & \color{darkred}d^+ - \color{darkred}d^- = \sum_j \color{darkred}n_j \cdot \color{darkblue} j - \color{darkred}k \cdot \color{darkblue}t \\ &\color{darkred}k = \sum_j \color{darkred}n_j \\ & \color{darkred}n_j \in \{0,1,\dots,\color{darkblue}N_j\} \\ & \color{darkred}d^+,\color{darkred}d^- \ge 0 \\ & \color{darkred}k \in \{1,\dots,\color{darkblue}{\mathit{kmax}}\} \end{align} \] |
Here \(\color{darkred}{k}\) is now a variable, so we prefer not to divide by \(\color{darkred}k\). That would create an MINLP model. It looks like, we can multiply the objective by \(\color{darkred}k\): this makes the problem linear again. But, but.... we have changed the problem a bit: we now prefer smaller \(\color{darkred}k\) above larger ones. Multiplication by \(\color{darkred}k\) this way is only allowed for constant \(k\).
We need to be a bit more careful. In model 1, we measure the deviation as: \[\color{darkred}\Delta = \sum_i \frac{\color{darkblue}p_i\cdot \color{darkred}\delta_i}{\color{darkred}k}-\color{darkblue}t\] We can rewrite this as: \[\color{darkred}\Delta\cdot \color{darkred}k = \sum_i \color{darkblue}p_i\cdot \color{darkred}\delta_i -\color{darkblue}t\cdot\color{darkred}k\] or \[\color{darkred}\Delta\cdot \sum_i \color{darkred}\delta_i = \sum_i \color{darkblue}p_i\cdot \color{darkred}\delta_i -\color{darkblue}t\cdot\color{darkred}k\] The first term is a series of multiplications of a continuous variable with a binary variable. This we know how to linearize. A slightly simpler linearization can be applied to non-negative variables. We can actually use non-negative variables by writing \[\sum_i \color{darkred}\delta_i\cdot \color{darkred}d^+ - \sum_i \color{darkred}\delta_i\cdot \color{darkred}d^-\] Linearization the product of a binary variable and a continuous, non-negative variable can be accomplished as follows: \[\begin{aligned} & \color{darkred}y=\color{darkred}x\cdot\color{darkred}\delta \\ &\color{darkred}x,\color{darkred}y \in [0,\color{darkblue}U] \\ & \color{darkred}\delta \in \{0,1\} \end{aligned} \Longleftrightarrow \begin{aligned} & \color{darkred}y \le \color{darkred}\delta\cdot\color{darkblue}U \\ & \color{darkred}y \le \color{darkred}x \\ & \color{darkred}y \ge \color{darkred}x - (1-\color{darkred}\delta) \cdot \color{darkblue}U\\ & \color{darkred}x,\color{darkred}y \in [0,\color{darkblue}U] \\ & \color{darkred}\delta \in \{0,1\}\end{aligned} \] When we assemble this, we have:
Linearized Extension Model |
\[ \begin{align}\min\> & \color{darkred}d^+ + \color{darkred}d^- \\ & \sum_i \left(\color{darkred}y_i^+ - \color{darkred}y_i^- \right) = \sum_i \color{darkred}\delta_i\cdot \color{darkblue}p_i - \color{darkred}k \cdot \color{darkblue}t \\ &
\color{darkred}y_i^{pm} \le \color{darkred}\delta_i\cdot\color{darkblue}U && \forall i, {pm}\in\{+,-\} \\ & \color{darkred}y_i^{pm} \le \color{darkred}d^{pm} && \forall i, {pm}\in\{+,-\} \\ &\color{darkred}y_i^{pm} \le \color{darkred}d^{pm} -(1-\color{darkred}\delta_i)\cdot \color{darkblue}U &&\forall i, {pm}\in\{+,-\} \\ & \color{darkred} k \in \{1,\dots,\color{darkblue}{\mathit{kmax}}\} \\ & \color{darkred}y_i^{pm} \in [0,\color{darkblue}U] \\ & \color{darkred}\delta_i \in \{0,1\}
\end{align}\] |
Before we can run this model we need to establish a value for \(\color{darkblue}U\). I used the safe value: \[ \color{darkblue}U := \color{darkblue}{\mathit{kmax}}\cdot \left(\max_i \color{darkblue}p_i-\min_i \color{darkblue}p_i\right)\]
Note that in this model we explicitly need to forbid \(\color{darkred}k=0\).
The solution is:
---- 162 VARIABLE kvar.L = 49 k as variable
---- 162 PARAMETER vcount count of each value
1 5, 2 33, 3 11
---- 162 VARIABLE z.L = 0.001001 objective
This is the same solution as we found using the loop approach.
Conclusions
The first two models deal with the "select exactly \(k\) points" case. The specialized model 2 exploits the property of the data that there are just a few unique values. This can make the model easier to solve than the generic model 1. Especially when using less sophisticated solvers. High-end solvers do like binary variables: I found no real gain for those solvers. For less sophisticated solvers, however, this trick can help a lot.
The problem "allow up to \(kmax\) points" turned out to be way more challenging to formulate in a linear fashion than I anticipated. But with some effort, it can be done. This is an example where a small change in the model specification has a lot of consequences.
References
Appendix: GAMS Model
*------------------------------------------------------------
* data
*------------------------------------------------------------
set
i 'data
points' /i1*i100/
j 'unique
values' /1,2,3/
pm /'+','-'/
;
*
* data drawn from set j
*
parameter p(i) 'data';
p(i) = uniformint(1,card(j));
p(i) = sum(j$(ord(j)=p(i)),j.val);
option p:0:0:8;
scalars
k 'number
to select' /50/
t 'target
for mean of selected values' /2.12345/
;
option k:0,t:5;
display p,k,t;
*------------------------------------------------------------
* model 1
* binary variables delta(i)
*------------------------------------------------------------
variables
z
'objective'
delta(i) 'selection of
points'
dev(pm) 'deviation'
;
binary variable delta;
positive variable dev;
Equations
obj 'objective'
deviation 'deviation of
mean from target'
count 'number of
selected points'
;
obj.. z =e= sum(pm,dev(pm));
deviation.. dev('+')-dev('-') =e= sum(i,delta(i)*p(i))/k-t;
count.. sum(i,delta(i)) =e= k;
model m1 /obj,deviation,count/;
solve m1
minimizing z using mip;
parameter value(i) 'selected values';
value(i) = p(i)*delta.l(i);
parameter vcount(j) 'count of each value';
vcount(j) = sum(i$(value(i)=j.val), 1);
option delta:0:0:6,
value:0:0:6, vcount:0;
option dev:6,z:6;
display
delta.l,value,vcount,dev.l,z.l;
*------------------------------------------------------------
* model 2
* integer variable num(j)
*------------------------------------------------------------
parameter tab(j) 'tabulation';
tab(j) = sum(i$(p(i)=j.val),1);
option tab:0;
display tab;
integer variable num(j) 'number selected from bin j';
equation idev,
icount;
idev.. dev('+')-dev('-') =e= sum(j,num(j)*j.val)/k-t;
icount.. sum(j,num(j)) =e= k;
num.up(j) = tab(j);
model m2 /obj,idev,icount/;
solve m2
minimizing z using mip;
option num:0;
display num.l,dev.l,z.l;
*------------------------------------------------------------
* model 3
* best k=1...Kmax using a loop
*------------------------------------------------------------
scalar kmax;
kmax = k;
* fast solve loop
option
solprint=silent, solvelink=5;
parameter
bestObj 'best objective' /INF/
bestNum 'best selection'
traceObj 'trace objective values'
;
for(k=1 to kmax,
solve m2 minimizing z using mip;
traceObj(i)$(ord(i)=k) = z.l;
if (z.l<bestObj,
bestObj=z.l;
bestNum(j) = num.l(j);
);
);
option
bestObj:6,bestNum:0,traceObj:6:0:4;
display "**** Loop",traceObj,bestObj,bestNum;
option solprint =
on;
*------------------------------------------------------------
* model 4
* best k=1...Kmax using single model
*------------------------------------------------------------
Scalar U 'maximum deviation';
U = kmax*(smax(i,p(i))-smin(i,p(i)));
positive variable
y(i,pm) 'dev(i,pm)*delta(i)'
;
integer Variable
kvar 'k as variable'
;
kvar.lo = 1;
kvar.up = kmax;
Equation
LinDev 'linearized version'
Linearization1(i,pm)
Linearization2(i,pm)
Linearization3(i,pm)
kdef
;
LinDev.. sum(i,y(i,'+')-y(i,'-')) =e= sum(i,delta(i)*p(i))-t*kvar;
Linearization1(i,pm).. y(i,pm) =l= delta(i)*U;
Linearization2(i,pm).. y(i,pm) =l= dev(pm);
Linearization3(i,pm).. y(i,pm) =g= dev(pm) - (1-delta(i))*U;
kdef.. sum(i,delta(i)) =e= kvar;
model m4 /obj,lindev,linearization1,linearization2,linearization3,kdef/;
solve m4
minimizing z using mip;
value(i) = p(i)*delta.l(i);
vcount(j) = sum(i$(value(i)=j.val), 1);
option kvar:0;
display kvar.l,
vcount, z.l;
|
No comments:
Post a Comment