Thursday, June 2, 2022

A subset selection problem

Selecting \(k\) out of \(n\) can be a computational difficult task. Here we look at a special case and see how we can exploit the properties of the data. Suppose we have \(n\) data points. We want to select \(k\) points such that the average of these \(k\) points is as close to a target value \(\color{darkblue}t\) as possible [1]. I.e. we want \[\min_{\color{darkred}S} \left| \frac{\sum_{s \in \color{darkred}S}\color{darkblue}p_s}{\color{darkblue}k} - \color{darkblue}t \right| \] where \(|\color{darkred}S|=k\). The special property of the data is that it is composed of just a few different values. Let's say our data looks like:


----     21 PARAMETER p  data

i1   1,    i2   3,    i3   2,    i4   1,    i5   1,    i6   1,    i7   2,    i8   3
i9   1,    i10  2,    i11  3,    i12  2,    i13  3,    i14  3,    i15  1,    i16  2
i17  1,    i18  1,    i19  3,    i20  2,    i21  2,    i22  2,    i23  1,    i24  1
i25  2,    i26  3,    i27  1,    i28  2,    i29  3,    i30  1,    i31  1,    i32  2
i33  1,    i34  3,    i35  1,    i36  1,    i37  2,    i38  3,    i39  2,    i40  2
i41  2,    i42  1,    i43  1,    i44  1,    i45  2,    i46  1,    i47  2,    i48  2
i49  3,    i50  1,    i51  2,    i52  3,    i53  2,    i54  1,    i55  1,    i56  1
i57  2,    i58  2,    i59  1,    i60  3,    i61  1,    i62  1,    i63  2,    i64  3
i65  1,    i66  1,    i67  2,    i68  2,    i69  2,    i70  2,    i71  1,    i72  1
i73  1,    i74  3,    i75  2,    i76  3,    i77  1,    i78  1,    i79  3,    i80  1
i81  1,    i82  1,    i83  1,    i84  2,    i85  1,    i86  1,    i87  1,    i88  1
i89  1,    i90  3,    i91  3,    i92  2,    i93  2,    i94  3,    i95  2,    i96  3
i97  1,    i98  3,    i99  1,    i100 2


----     21 PARAMETER k                    =           50  number to select
            PARAMETER t                    =      2.12345  target for mean of selected values


In this post, I will start with a generic model using \(n=100\) binary variables. A second model will exploit that we have just a few unique values in our data. We will exploit that in a model with just 3 integer variables. These models are selecting exactly \(k\) points.

When we consider the problem: allow up to \(kmax=50\) points to be selected, things become more difficult. The obvious approach would be to use a loop, try each \(k\) and pick the best solution. A more complex, non-linear model can be formed. With some effort, we can linearize this model.  


Model 1: binary variables


First, let's look at a generic model. Here we don't exploit the properties of the data. We introduce binary variables: \[\color{darkred}\delta_i = \begin{cases} 1 & \text{if point $i$ is selected} \\ 0 & \text{otherwise}\end{cases}\] The absolute value is easily modeled through variable splitting:


Generic Model
\[ \begin{align} \min\> & \color{darkred}d^+ + \color{darkred}d^- \\ & \color{darkred}d^+ - \color{darkred}d^- = \sum_i \frac{\color{darkred}\delta_i \cdot \color{darkblue} p_i}{\color{darkblue} k} - \color{darkblue}t \\ & \sum_i \color{darkred}\delta_i = \color{darkblue}k \\ & \color{darkred}\delta_i \in \{0,1\} \\ & \color{darkred}d^+,\color{darkred}d^- \ge 0 \end{align} \]

The output is:

 
----     62 VARIABLE delta.L  selection of points

i2  1,    i5  1,    i6  1,    i7  1,    i8  1,    i9  1
i11 1,    i12 1,    i13 1,    i14 1,    i15 1,    i16 1
i17 1,    i18 1,    i19 1,    i22 1,    i26 1,    i27 1
i28 1,    i29 1,    i30 1,    i31 1,    i33 1,    i34 1
i35 1,    i36 1,    i38 1,    i41 1,    i49 1,    i52 1
i55 1,    i56 1,    i58 1,    i60 1,    i61 1,    i62 1
i64 1,    i67 1,    i68 1,    i74 1,    i76 1,    i79 1
i84 1,    i90 1,    i91 1,    i92 1,    i93 1,    i94 1
i96 1,    i98 1


----     62 PARAMETER value  selected values

i2  3,    i5  1,    i6  1,    i7  2,    i8  3,    i9  1
i11 3,    i12 2,    i13 3,    i14 3,    i15 1,    i16 2
i17 1,    i18 1,    i19 3,    i22 2,    i26 3,    i27 1
i28 2,    i29 3,    i30 1,    i31 1,    i33 1,    i34 3
i35 1,    i36 1,    i38 3,    i41 2,    i49 3,    i52 3
i55 1,    i56 1,    i58 2,    i60 3,    i61 1,    i62 1
i64 3,    i67 2,    i68 2,    i74 3,    i76 3,    i79 3
i84 2,    i90 3,    i91 3,    i92 2,    i93 2,    i94 3
i96 3,    i98 3


----     62 PARAMETER vcount  count of each value

1 16,    2 12,    3 22


----     62 VARIABLE dev.L  deviation

- 0.003450


----     62 VARIABLE z.L                   =     0.003450  objective

This model has \(n=100\) binary variables.

Note that we can scale the objective by a constant value, and write: \[\color{darkred}d^+ - \color{darkred}d^- = \sum_i \color{darkred}\delta_i \cdot \color{darkblue} p_i - \color{darkblue}t\cdot\color{darkblue} k\] As we see later on, this is a valid approach if \(\color{darkblue}k\) is a variable.


Model 2: Exploit special properties of the data


Instead of using binary variables, we can keep track of the number of occurrences of each unique value \(1,2,3\). So we define the variable: \[\color{darkred}n_j \in \{0,\dots,\color{darkblue}N_j\}\] indicating how many of unique value \(j\) we select. This reduces the size of the model to just three integer variables.

The model looks like:

Specialized Model
\[ \begin{align} \min\> & \color{darkred}d^+ + \color{darkred}d^- \\ & \color{darkred}d^+ - \color{darkred}d^- = \sum_j \frac{\color{darkred}n_j \cdot \color{darkblue} j}{\color{darkblue} k} - \color{darkblue}t \\ & \sum_j \color{darkred}n_j = \color{darkblue}k \\ & \color{darkred}n_j \in \{0,1,\dots,\color{darkblue}N_j\} \\ & \color{darkred}d^+,\color{darkred}d^- \ge 0 \end{align} \]

Here \(\color{darkblue}N_j\) is the number of data points with \(\color{darkblue}p_i = j\). For our data set these look like:

 
----     70 PARAMETER tab  tabulation

1 45,    2 33,    3 22

A solution is:

----     88 VARIABLE num.L  number selected from bin j

1  6,    2 32,    3 12


----     88 VARIABLE dev.L  deviation

- 0.003450


----     88 VARIABLE z.L                   =     0.003450  objective

We note that this is a different solution, but with the same optimal objective function value.


Extension: select up to \(kmax\) points


A simple extension of this model can look like: instead of exactly selecting \(k\) points, select up to \(kmax\) points. This potentially should reduce the objective. The simplest would be to loop over \(k=1,\dots,kmax\), solve each model, and pick the best solution. This means solving 50 MIP models. When I do this, the results are:


----    119 **** Loop

----    119 PARAMETER traceObj  trace objective values

i1  0.123450,    i2  0.123450,    i3  0.123450,    i4  0.123450
i5  0.076550,    i6  0.043217,    i7  0.019407,    i8  0.001550
i9  0.012339,    i10 0.023450,    i11 0.032541,    i12 0.040117
i13 0.030396,    i14 0.019407,    i15 0.009883,    i16 0.001550
i17 0.005803,    i18 0.012339,    i19 0.018187,    i20 0.023450
i21 0.019407,    i22 0.012914,    i23 0.006985,    i24 0.001550
i25 0.003450,    i26 0.008065,    i27 0.012339,    i28 0.016307
i29 0.014481,    i30 0.009883,    i31 0.005582,    i32 0.001550
i33 0.002238,    i34 0.005803,    i35 0.009164,    i36 0.012339
i37 0.011685,    i38 0.008129,    i39 0.004755,    i40 0.001550
i41 0.001499,    i42 0.004402,    i43 0.007171,    i44 0.009814
i45 0.009883,    i46 0.006985,    i47 0.004210,    i48 0.001550
i49 0.001001,    i50 0.003450


----    119 PARAMETER bestObj              =     0.001001  best objective

----    119 PARAMETER bestNum  best selection

1  5,    2 33,    3 11


We see the objective goes up and down for different values of \(k\). The best is using \(k=49\) points with an objective value of 0.001.


The obvious question is: Can we do this in one model instead?

The model for this could look like:


Incorrect Extension Model
\[ \begin{align} \min\> & \color{darkred}d^+ + \color{darkred}d^- \\ & \color{darkred}d^+ - \color{darkred}d^- = \sum_j \color{darkred}n_j \cdot \color{darkblue} j - \color{darkred}k \cdot \color{darkblue}t \\ &\color{darkred}k = \sum_j \color{darkred}n_j \\ & \color{darkred}n_j \in \{0,1,\dots,\color{darkblue}N_j\} \\ & \color{darkred}d^+,\color{darkred}d^- \ge 0 \\ & \color{darkred}k \in \{1,\dots,\color{darkblue}{\mathit{kmax}}\} \end{align} \]

Here \(\color{darkred}{k}\) is now a variable, so we prefer not to divide by \(\color{darkred}k\). That would create an MINLP model. It looks like, we can multiply the objective by \(\color{darkred}k\): this makes the problem linear again. But, but.... we have changed the problem a bit: we now prefer smaller \(\color{darkred}k\) above larger ones. Multiplication by \(\color{darkred}k\) this way is only allowed for constant \(k\).

We need to be a bit more careful. In model 1, we measure the deviation as: \[\color{darkred}\Delta = \sum_i \frac{\color{darkblue}p_i\cdot \color{darkred}\delta_i}{\color{darkred}k}-\color{darkblue}t\] We can rewrite this as: \[\color{darkred}\Delta\cdot \color{darkred}k = \sum_i \color{darkblue}p_i\cdot \color{darkred}\delta_i -\color{darkblue}t\cdot\color{darkred}k\] or \[\color{darkred}\Delta\cdot \sum_i \color{darkred}\delta_i = \sum_i \color{darkblue}p_i\cdot \color{darkred}\delta_i -\color{darkblue}t\cdot\color{darkred}k\] The first term is a series of multiplications of a continuous variable with a binary variable. This we know how to linearize. A slightly simpler linearization can be applied to non-negative variables. We can actually use non-negative variables by writing \[\sum_i \color{darkred}\delta_i\cdot \color{darkred}d^+ - \sum_i \color{darkred}\delta_i\cdot \color{darkred}d^-\] Linearization the product of a binary variable and a continuous, non-negative variable can be accomplished as follows:  \[\begin{aligned} & \color{darkred}y=\color{darkred}x\cdot\color{darkred}\delta \\ &\color{darkred}x,\color{darkred}y \in [0,\color{darkblue}U] \\ & \color{darkred}\delta \in \{0,1\} \end{aligned} \Longleftrightarrow \begin{aligned} & \color{darkred}y \le  \color{darkred}\delta\cdot\color{darkblue}U \\ &  \color{darkred}y \le \color{darkred}x \\ &    \color{darkred}y \ge   \color{darkred}x - (1-\color{darkred}\delta) \cdot   \color{darkblue}U\\ & \color{darkred}x,\color{darkred}y \in [0,\color{darkblue}U] \\ & \color{darkred}\delta \in \{0,1\}\end{aligned}  \] When we assemble this, we have:

Linearized Extension Model
\[ \begin{align}\min\> & \color{darkred}d^+ + \color{darkred}d^- \\ & \sum_i \left(\color{darkred}y_i^+ - \color{darkred}y_i^- \right) = \sum_i \color{darkred}\delta_i\cdot \color{darkblue}p_i - \color{darkred}k \cdot \color{darkblue}t \\ & \color{darkred}y_i^{pm} \le \color{darkred}\delta_i\cdot\color{darkblue}U && \forall i, {pm}\in\{+,-\} \\ & \color{darkred}y_i^{pm} \le \color{darkred}d^{pm} && \forall i, {pm}\in\{+,-\} \\ &\color{darkred}y_i^{pm} \le \color{darkred}d^{pm} -(1-\color{darkred}\delta_i)\cdot \color{darkblue}U &&\forall i, {pm}\in\{+,-\} \\ & \color{darkred} k \in \{1,\dots,\color{darkblue}{\mathit{kmax}}\} \\ & \color{darkred}y_i^{pm} \in [0,\color{darkblue}U] \\ & \color{darkred}\delta_i \in \{0,1\} \end{align}\]

Before we can run this model we need to establish a value for \(\color{darkblue}U\). I used the safe value: \[ \color{darkblue}U := \color{darkblue}{\mathit{kmax}}\cdot \left(\max_i \color{darkblue}p_i-\min_i \color{darkblue}p_i\right)\]

Note that in this model we explicitly need to forbid \(\color{darkred}k=0\).

The solution is:

----    162 VARIABLE kvar.L                =           49  k as variable

----    162 PARAMETER vcount  count of each value

1  5,    2 33,    3 11


----    162 VARIABLE z.L                   =     0.001001  objective


This is the same solution as we found using the loop approach.


Conclusions


The first two models deal with the "select exactly \(k\) points" case. The specialized model 2 exploits the property of the data that there are just a few unique values. This can make the model easier to solve than the generic model 1. Especially when using less sophisticated solvers. High-end solvers do like binary variables: I found no real gain for those solvers. For less sophisticated solvers, however, this trick can help a lot.

The problem "allow up to \(kmax\) points" turned out to be way more challenging to formulate in a linear fashion than I anticipated. But with some effort, it can be done. This is an example where a small change in the model specification has a lot of consequences.

References


Appendix: GAMS Model

*------------------------------------------------------------
* data
*------------------------------------------------------------

set
   i
'data points' /i1*i100/
   j
'unique values' /1,2,3/
   pm
/'+','-'/
 ;

*
* data drawn from set j
*
parameter p(i) 'data';
p(i) = uniformint(1,
card(j));
p(i) =
sum(j$(ord(j)=p(i)),j.val);
option p:0:0:8;

scalars
   k
'number to select' /50/
   t
'target for mean of selected values' /2.12345/
;
option k:0,t:5;

display p,k,t;


*------------------------------------------------------------
* model 1
* binary variables delta(i)
*------------------------------------------------------------

variables
   z          
'objective'
   delta(i)  
'selection of points'
   dev(pm)   
'deviation'
;
binary variable delta;
positive variable dev;

Equations
   obj        
'objective'
   deviation  
'deviation of mean from target'
   count      
'number of selected points'
;

obj.. z =e=
sum(pm,dev(pm));
deviation.. dev(
'+')-dev('-') =e= sum(i,delta(i)*p(i))/k-t;
count..
sum(i,delta(i)) =e= k;

model m1 /obj,deviation,count/;
solve m1 minimizing z using mip;

parameter value(i) 'selected values';
value(i) = p(i)*delta.l(i);

parameter vcount(j) 'count of each value';
vcount(j) =
sum(i$(value(i)=j.val), 1);

option delta:0:0:6, value:0:0:6, vcount:0;
option dev:6,z:6;
display delta.l,value,vcount,dev.l,z.l;

*------------------------------------------------------------
* model 2
* integer variable num(j)
*------------------------------------------------------------


parameter tab(j) 'tabulation';
tab(j) =
sum(i$(p(i)=j.val),1);
option tab:0;
display tab;

integer variable num(j) 'number selected from bin j';


equation idev, icount;
idev.. dev(
'+')-dev('-') =e= sum(j,num(j)*j.val)/k-t;
icount..
sum(j,num(j)) =e= k;

num.up(j) = tab(j);

model m2 /obj,idev,icount/;
solve m2 minimizing z using mip;

option num:0;
display num.l,dev.l,z.l;

*------------------------------------------------------------
* model 3
* best k=1...Kmax using a loop
*------------------------------------------------------------

scalar kmax;
kmax = k;

* fast solve loop
option solprint=silent, solvelink=5;

parameter
  bestObj
'best objective' /INF/
  bestNum
'best selection'
  traceObj
'trace objective values'
;
for(k=1 to kmax,

  
solve m2 minimizing z using mip;
   traceObj(i)$(
ord(i)=k) = z.l;
  
if (z.l<bestObj,
       bestObj=z.l;
       bestNum(j) = num.l(j);
   );

);

option bestObj:6,bestNum:0,traceObj:6:0:4;
display "**** Loop",traceObj,bestObj,bestNum;

option solprint = on;
*------------------------------------------------------------
* model 4
* best k=1...Kmax using single model
*------------------------------------------------------------

Scalar U 'maximum deviation';
U = kmax*(
smax(i,p(i))-smin(i,p(i)));

positive variable
    y(i,pm) 
'dev(i,pm)*delta(i)'
;

integer Variable
    kvar
'k as variable'
;
kvar.lo = 1;
kvar.up = kmax;

Equation
    LinDev
'linearized version'
    Linearization1(i,pm)
    Linearization2(i,pm)
    Linearization3(i,pm)
    kdef
;
LinDev..        
sum(i,y(i,'+')-y(i,'-')) =e= sum(i,delta(i)*p(i))-t*kvar;

Linearization1(i,pm).. y(i,pm) =l= delta(i)*U;
Linearization2(i,pm).. y(i,pm) =l= dev(pm);
Linearization3(i,pm).. y(i,pm) =g= dev(pm) - (1-delta(i))*U;

kdef..
sum(i,delta(i)) =e= kvar;

model m4 /obj,lindev,linearization1,linearization2,linearization3,kdef/;
solve m4 minimizing z using mip;

value(i) = p(i)*delta.l(i);
vcount(j) =
sum(i$(value(i)=j.val), 1);

option kvar:0;
display kvar.l, vcount, z.l;

 

No comments:

Post a Comment