## Sunday, January 31, 2010

### Solving an optimization problem: selecting an "optimal" subset

This problem is described in http://comments.gmane.org/gmane.comp.lang.r.general/177553. An obvious MIQP formulation in GAMS could look like:

\$ontext

Given vector of numbers x, I wish to select an n-subset with sum closest to
fixed value s. Can anyone advise me how to approach this, in R?

I have considered Rcplex package, which handles integer/binary
linear/quadratic optimization problems, but have difficulty setting up the
quadratic form for [sum(x) - s]^2.

(Dynamic programming over [0, sum(x)]? A genetic algorithm? Can anyone
contribute a binary GA optimization sample?)

--------------------------------------------------------------------------

I need to pick 2-40 elements out of a 50-200-element-long vector.

\$offtext

set i /i1*i200/;
scalars
k
/40/
target
/3.1415926535/
;

parameter p(i);
p(i) = uniform(-100,100);

binary variable x(i);
variable z,s;

equation e1,e2,obj;

e1.. s =e=
sum(i, p(i)*x(i));
e2..
sum(i,x(i)) =e= k;
obj.. z =e= sqr(s-target);

model m /all/;
option optcr=0,miqcp=cplex;
solve m minimizing z using miqcp;

option x:1;
display p,x.l;

scalar value,sqdeviation,num;
value =
sum(i, p(i)*x.l(i));
sqdeviation = sqr(value  - target);
num =
sum(i, x.l(i));
display target,value,sqdeviation,num;

The results look like:
 ----     54 PARAMETER p  i1   -65.651,    i2    68.653,    i3    10.075,    i4   -39.772,    i5   -41.558,    i6   -55.189,    i7   -30.034 i8    71.254,    i9   -86.577,    i10    0.042,    i11   99.624,    i12   15.747,    i13   98.227,    i14   52.450 i15  -73.862,    i16   27.944,    i17  -68.096,    i18  -49.984,    i19   33.786,    i20  -12.929,    i21  -28.060 i22  -29.712,    i23  -73.702,    i24  -69.980,    i25   17.823,    i26   66.179,    i27  -53.837,    i28   33.147 i29   55.172,    i30  -39.268,    i31  -77.902,    i32    0.477,    i33  -67.965,    i34   74.492,    i35  -46.977 i36  -42.837,    i37   18.791,    i38   44.544,    i39   25.650,    i40   -7.240,    i41  -17.339,    i42  -76.461 i43  -37.158,    i44  -90.690,    i45  -32.290,    i46  -63.580,    i47   29.145,    i48   12.149,    i49   53.992 i50  -40.439,    i51   32.221,    i52   51.164,    i53   25.489,    i54  -43.227,    i55  -82.715,    i56  -79.497 i57   28.250,    i58    9.062,    i59  -93.695,    i60   58.472,    i61  -85.447,    i62  -64.868,    i63    5.127 i64   50.042,    i65  -64.375,    i66  -93.172,    i67   17.026,    i68   24.246,    i69  -22.128,    i70  -28.257 i71  -51.393,    i72  -50.716,    i73  -73.899,    i74   86.690,    i75  -24.012,    i76   56.680,    i77  -39.993 i78  -74.903,    i79   49.775,    i80  -86.154,    i81  -59.597,    i82  -98.987,    i83  -46.077,    i84   -0.030 i85  -69.743,    i86  -65.166,    i87  -33.872,    i88  -36.619,    i89  -35.583,    i90   92.795,    i91   98.720 i92  -26.019,    i93  -25.422,    i94   54.396,    i95  -20.663,    i96   82.619,    i97  -76.084,    i98   47.096 i99  -88.916,    i100  15.260,    i101 -89.719,    i102 -98.798,    i103 -19.754,    i104   3.976,    i105  25.775 i106 -54.850,    i107 -20.776,    i108 -44.799,    i109 -69.525,    i110  87.265,    i111 -15.468,    i112 -73.067 i113 -22.789,    i114 -25.073,    i115 -46.304,    i116  89.674,    i117 -62.212,    i118 -40.498,    i119 -85.089 i120 -19.731,    i121 -79.662,    i122 -23.222,    i123 -35.181,    i124 -61.573,    i125 -77.526,    i126  19.312 i127   2.290,    i128 -90.987,    i129  56.620,    i130  89.150,    i131  19.293,    i132  21.468,    i133 -27.498 i134  18.814,    i135  35.971,    i136   1.318,    i137 -68.149,    i138  31.378,    i139   4.776,    i140 -75.121 i141  97.344,    i142 -54.375,    i143  35.131,    i144  55.355,    i145  86.490,    i146 -59.752,    i147 -40.573 i148 -60.554,    i149 -50.731,    i150  29.295,    i151  46.995,    i152 -82.913,    i153 -69.930,    i154 -13.162 i155 -62.612,    i156  38.539,    i157  52.595,    i158 -69.039,    i159 -22.124,    i160  39.086,    i161  69.162 i162  22.544,    i163  95.194,    i164 -94.622,    i165 -62.510,    i166 -82.576,    i167   8.080,    i168 -74.627 i169  46.800,    i170 -77.354,    i171  -2.329,    i172  59.120,    i173  -1.591,    i174   6.712,    i175 -97.875 i176   8.774,    i177  -9.774,    i178  95.066,    i179 -63.231,    i180 -67.294,    i181 -95.073,    i182 -64.435 i183 -87.736,    i184 -96.671,    i185  67.131,    i186  20.332,    i187 -94.597,    i188 -60.781,    i189  90.142 i190 -32.892,    i191  18.852,    i192 -48.162,    i193  28.127,    i194 -68.950,    i195  -7.997,    i196 -21.332 i197  61.092,    i198   8.198,    i199 -21.856,    i200  11.564 ----     54 VARIABLE x.L  i1           1.0,    i8           1.0,    i11          1.0,    i13          1.0,    i31          1.0 i34          1.0,    i41          1.0,    i44          1.0,    i59          1.0,    i66          1.0 i74          1.0,    i75          1.0,    i82          1.0,    i90          1.0,    i91          1.0 i96          1.0,    i101         1.0,    i102         1.0,    i107         1.0,    i110         1.0 i111         1.0,    i116         1.0,    i121         1.0,    i125         1.0,    i128         1.0 i130         1.0,    i141         1.0,    i144         1.0,    i157 4.625855E-6,    i161         1.0 i163         1.0,    i164         1.0,    i170         1.0,    i173         1.0,    i178         1.0 i184         1.0,    i187         1.0,    i192         1.0,    i193         1.0,    i196         1.0 i197         1.0 ----     60 PARAMETER target               =        3.142              PARAMETER value                =        3.142              PARAMETER sqdeviation          =  2.55591E-28              PARAMETER num                  =       40.000

The model solves very quickly: 2 seconds, but that is very much dependent on the data. The results show one binary variable with a small value. The obvious repair in a MIP model would be to set the integer feasibility tolerance (option epint) to zero. This causes the solution time to increase to 11 seconds, but unfortunately does not really solve the problem for this MIQP problem:

 ----     55 VARIABLE x.L  i1           1.0,    i8           1.0,    i11          1.0,    i13          1.0,    i14  -7.57698E-7 i31          1.0,    i34          1.0,    i44          1.0,    i59          1.0,    i66          1.0 i74          1.0,    i82          1.0,    i90          1.0,    i91          1.0,    i95          1.0 i96          1.0,    i101         1.0,    i102         1.0,    i103         1.0,    i110         1.0 i116         1.0,    i121         1.0,    i122         1.0,    i125         1.0,    i128         1.0 i130         1.0,    i141         1.0,    i144         1.0,    i154         1.0,    i159         1.0 i161         1.0,    i163         1.0,    i164         1.0,    i170         1.0,    i173         1.0 i178         1.0,    i184         1.0,    i187         1.0,    i192         1.0,    i193         1.0 i197         1.0

We still have a tolerance issue. The log indicates there is some issue with scaling:

 Warning: integer solution contains unscaled infeasibilities. MIQP status(115): integer optimal with unscaled infeasibilities

I tried to disable scaling by scaind = -1, but that gave the same message. It looks like we have some numerical issues here (although the data looks quite benign).

This model can actually be solved as a MIP. The quadratic objective can be replaced by a linear one by minimizing the absolute deviation. The optimal solution should be the same for both objectives (apart from the issue of existence of multiple optimal solutions). The linear model can look like:

\$ontext

Linear version

\$offtext

*
* data
*
set i /i1*i200/;
scalars
k
/40/
target
/3.1415926535/
;

parameter p(i);
p(i) = uniform(-100,100);

*
* model
*
binary variable x(i);
variable z,s,d1,d2;
positive variable d1,d2;

equation e1,e2,e3,obj;

e1.. s =e=
sum(i, p(i)*x(i));
e2..
sum(i,x(i)) =e= k;
e3.. d1-d2 =e= s-target;
obj.. z =e= d1+d2;

model m /all/;
option optcr=0, mip=cplex;
solve m minimizing z using mip;

*
* do some reporting
*

option x:1;
display p,x.l;

scalar value,sqdeviation,num;
value =
sum(i, p(i)*x.l(i));
sqdeviation = sqr(value  - target);
num =
sum(i, x.l(i));
display target,value,sqdeviation,num;

The result are now:

 ----     42 VARIABLE x.L  i11  1.0,    i19  1.0,    i63  1.0,    i92  1.0,    i101 1.0,    i102 1.0,    i103 1.0,    i104 1.0,    i105 1.0 i106 1.0,    i107 1.0,    i108 1.0,    i110 1.0,    i111 1.0,    i112 1.0,    i113 1.0,    i120 1.0,    i122 1.0 i123 1.0,    i126 1.0,    i127 1.0,    i129 1.0,    i130 1.0,    i131 1.0,    i132 1.0,    i133 1.0,    i134 1.0 i135 1.0,    i138 1.0,    i140 1.0,    i141 1.0,    i143 1.0,    i144 1.0,    i145 1.0,    i146 1.0,    i147 1.0 i148 1.0,    i149 1.0,    i150 1.0,    i167 1.0 ----     48 PARAMETER target               =        3.142              PARAMETER value                =        3.142              PARAMETER sqdeviation          =  1.09841E-26              PARAMETER num                  =       40.000

This solves faster and seems numerically a little bit more stable.