Sunday, January 31, 2010

Solving an optimization problem: selecting an "optimal" subset

This problem is described in http://comments.gmane.org/gmane.comp.lang.r.general/177553. An obvious MIQP formulation in GAMS could look like:

$ontext

Given vector of numbers x, I wish to select an n-subset with sum closest to
fixed value s. Can anyone advise me how to approach this, in R?

I have considered Rcplex package, which handles integer/binary
linear/quadratic optimization problems, but have difficulty setting up the
quadratic form for [sum(x) - s]^2.

(Dynamic programming over [0, sum(x)]? A genetic algorithm? Can anyone
contribute a binary GA optimization sample?)

--------------------------------------------------------------------------

I need to pick 2-40 elements out of a 50-200-element-long vector.


$offtext

set i /i1*i200/;
scalars
   k
/40/
   target
/3.1415926535/
;

parameter p(i);
p(i) = uniform(-100,100);

binary variable x(i);
variable z,s;

equation e1,e2,obj;

e1.. s =e=
sum(i, p(i)*x(i));
e2..
sum(i,x(i)) =e= k;
obj.. z =e= sqr(s-target);

model m /all/;
option optcr=0,miqcp=cplex;
solve m minimizing z using miqcp;

option x:1;
display p,x.l;

scalar value,sqdeviation,num;
value =
sum(i, p(i)*x.l(i));
sqdeviation = sqr(value  - target);
num =
sum(i, x.l(i));
display target,value,sqdeviation,num;


The results look like:

----     54 PARAMETER p 

i1   -65.651,    i2    68.653,    i3    10.075,    i4   -39.772,    i5   -41.558,    i6   -55.189,    i7   -30.034
i8    71.254,    i9   -86.577,    i10    0.042,    i11   99.624,    i12   15.747,    i13   98.227,    i14   52.450
i15  -73.862,    i16   27.944,    i17  -68.096,    i18  -49.984,    i19   33.786,    i20  -12.929,    i21  -28.060
i22  -29.712,    i23  -73.702,    i24  -69.980,    i25   17.823,    i26   66.179,    i27  -53.837,    i28   33.147
i29   55.172,    i30  -39.268,    i31  -77.902,    i32    0.477,    i33  -67.965,    i34   74.492,    i35  -46.977
i36  -42.837,    i37   18.791,    i38   44.544,    i39   25.650,    i40   -7.240,    i41  -17.339,    i42  -76.461
i43  -37.158,    i44  -90.690,    i45  -32.290,    i46  -63.580,    i47   29.145,    i48   12.149,    i49   53.992
i50  -40.439,    i51   32.221,    i52   51.164,    i53   25.489,    i54  -43.227,    i55  -82.715,    i56  -79.497
i57   28.250,    i58    9.062,    i59  -93.695,    i60   58.472,    i61  -85.447,    i62  -64.868,    i63    5.127
i64   50.042,    i65  -64.375,    i66  -93.172,    i67   17.026,    i68   24.246,    i69  -22.128,    i70  -28.257
i71  -51.393,    i72  -50.716,    i73  -73.899,    i74   86.690,    i75  -24.012,    i76   56.680,    i77  -39.993
i78  -74.903,    i79   49.775,    i80  -86.154,    i81  -59.597,    i82  -98.987,    i83  -46.077,    i84   -0.030
i85  -69.743,    i86  -65.166,    i87  -33.872,    i88  -36.619,    i89  -35.583,    i90   92.795,    i91   98.720
i92  -26.019,    i93  -25.422,    i94   54.396,    i95  -20.663,    i96   82.619,    i97  -76.084,    i98   47.096
i99  -88.916,    i100  15.260,    i101 -89.719,    i102 -98.798,    i103 -19.754,    i104   3.976,    i105  25.775
i106 -54.850,    i107 -20.776,    i108 -44.799,    i109 -69.525,    i110  87.265,    i111 -15.468,    i112 -73.067
i113 -22.789,    i114 -25.073,    i115 -46.304,    i116  89.674,    i117 -62.212,    i118 -40.498,    i119 -85.089
i120 -19.731,    i121 -79.662,    i122 -23.222,    i123 -35.181,    i124 -61.573,    i125 -77.526,    i126  19.312
i127   2.290,    i128 -90.987,    i129  56.620,    i130  89.150,    i131  19.293,    i132  21.468,    i133 -27.498
i134  18.814,    i135  35.971,    i136   1.318,    i137 -68.149,    i138  31.378,    i139   4.776,    i140 -75.121
i141  97.344,    i142 -54.375,    i143  35.131,    i144  55.355,    i145  86.490,    i146 -59.752,    i147 -40.573
i148 -60.554,    i149 -50.731,    i150  29.295,    i151  46.995,    i152 -82.913,    i153 -69.930,    i154 -13.162
i155 -62.612,    i156  38.539,    i157  52.595,    i158 -69.039,    i159 -22.124,    i160  39.086,    i161  69.162
i162  22.544,    i163  95.194,    i164 -94.622,    i165 -62.510,    i166 -82.576,    i167   8.080,    i168 -74.627
i169  46.800,    i170 -77.354,    i171  -2.329,    i172  59.120,    i173  -1.591,    i174   6.712,    i175 -97.875
i176   8.774,    i177  -9.774,    i178  95.066,    i179 -63.231,    i180 -67.294,    i181 -95.073,    i182 -64.435
i183 -87.736,    i184 -96.671,    i185  67.131,    i186  20.332,    i187 -94.597,    i188 -60.781,    i189  90.142
i190 -32.892,    i191  18.852,    i192 -48.162,    i193  28.127,    i194 -68.950,    i195  -7.997,    i196 -21.332
i197  61.092,    i198   8.198,    i199 -21.856,    i200  11.564

----     54 VARIABLE x.L 

i1           1.0,    i8           1.0,    i11          1.0,    i13          1.0,    i31          1.0
i34          1.0,    i41          1.0,    i44          1.0,    i59          1.0,    i66          1.0
i74          1.0,    i75          1.0,    i82          1.0,    i90          1.0,    i91          1.0
i96          1.0,    i101         1.0,    i102         1.0,    i107         1.0,    i110         1.0
i111         1.0,    i116         1.0,    i121         1.0,    i125         1.0,    i128         1.0
i130         1.0,    i141         1.0,    i144         1.0,    i157 4.625855E-6,    i161         1.0
i163         1.0,    i164         1.0,    i170         1.0,    i173         1.0,    i178         1.0
i184         1.0,    i187         1.0,    i192         1.0,    i193         1.0,    i196         1.0
i197         1.0

----     60 PARAMETER target               =        3.142 
            PARAMETER value                =        3.142 
            PARAMETER sqdeviation          =  2.55591E-28 
            PARAMETER num                  =       40.000 

The model solves very quickly: 2 seconds, but that is very much dependent on the data. The results show one binary variable with a small value. The obvious repair in a MIP model would be to set the integer feasibility tolerance (option epint) to zero. This causes the solution time to increase to 11 seconds, but unfortunately does not really solve the problem for this MIQP problem:

----     55 VARIABLE x.L 

i1           1.0,    i8           1.0,    i11          1.0,    i13          1.0,    i14  -7.57698E-7
i31          1.0,    i34          1.0,    i44          1.0,    i59          1.0,    i66          1.0
i74          1.0,    i82          1.0,    i90          1.0,    i91          1.0,    i95          1.0
i96          1.0,    i101         1.0,    i102         1.0,    i103         1.0,    i110         1.0
i116         1.0,    i121         1.0,    i122         1.0,    i125         1.0,    i128         1.0
i130         1.0,    i141         1.0,    i144         1.0,    i154         1.0,    i159         1.0
i161         1.0,    i163         1.0,    i164         1.0,    i170         1.0,    i173         1.0
i178         1.0,    i184         1.0,    i187         1.0,    i192         1.0,    i193         1.0
i197         1.0

We still have a tolerance issue. The log indicates there is some issue with scaling:

Warning: integer solution contains unscaled infeasibilities.
MIQP status(115): integer optimal with unscaled infeasibilities

I tried to disable scaling by scaind = -1, but that gave the same message. It looks like we have some numerical issues here (although the data looks quite benign).

This model can actually be solved as a MIP. The quadratic objective can be replaced by a linear one by minimizing the absolute deviation. The optimal solution should be the same for both objectives (apart from the issue of existence of multiple optimal solutions). The linear model can look like:

$ontext

 
Linear version

$offtext

*
* data
*
set i /i1*i200/;
scalars
   k
/40/
   target
/3.1415926535/
;

parameter p(i);
p(i) = uniform(-100,100);

*
* model
*
binary variable x(i);
variable z,s,d1,d2;
positive variable d1,d2;

equation e1,e2,e3,obj;

e1.. s =e=
sum(i, p(i)*x(i));
e2..
sum(i,x(i)) =e= k;
e3.. d1-d2 =e= s-target;
obj.. z =e= d1+d2;

model m /all/;
option optcr=0, mip=cplex;
solve m minimizing z using mip;

*
* do some reporting
*

option x:1;
display p,x.l;

scalar value,sqdeviation,num;
value =
sum(i, p(i)*x.l(i));
sqdeviation = sqr(value  - target);
num =
sum(i, x.l(i));
display target,value,sqdeviation,num;

The result are now:

----     42 VARIABLE x.L 

i11  1.0,    i19  1.0,    i63  1.0,    i92  1.0,    i101 1.0,    i102 1.0,    i103 1.0,    i104 1.0,    i105 1.0
i106 1.0,    i107 1.0,    i108 1.0,    i110 1.0,    i111 1.0,    i112 1.0,    i113 1.0,    i120 1.0,    i122 1.0
i123 1.0,    i126 1.0,    i127 1.0,    i129 1.0,    i130 1.0,    i131 1.0,    i132 1.0,    i133 1.0,    i134 1.0
i135 1.0,    i138 1.0,    i140 1.0,    i141 1.0,    i143 1.0,    i144 1.0,    i145 1.0,    i146 1.0,    i147 1.0
i148 1.0,    i149 1.0,    i150 1.0,    i167 1.0

----     48 PARAMETER target               =        3.142 
            PARAMETER value                =        3.142 
            PARAMETER sqdeviation          =  1.09841E-26 
            PARAMETER num                  =       40.000 

This solves faster and seems numerically a little bit more stable.