Monday, August 17, 2020

A variant of the max-sum-subarray problem

In [1] a variant of the max-sum-subarray problem is posed: 

Suppose you have two arrays each containing the same amount of "1"s and "-1"s. Additionally, suppose each "1" in the first array has a corresponding or sibling "1" in the second array. Same for each "-1". The task is to find the optimal subarrays, one in the first array and one in the second, such that their combined sum is the largest (maximal) with the added constraint that an element in one subarray only counts towards the sum if the other subarray contains its sibling.


Example


In [1] the following example is given:


An optimal solution is: select elements 3 through 10 in the first array and elements 5 through 10 in the second array, with a total objective of 8.


Reviewing the max-sum-subarray problem


Let's first review the max-sum-subarray problem [2]. We have data aiai with both positive and negative numbers. For example:


----      5 SET i  number of elements

i1 ,    i2 ,    i3 ,    i4 ,    i5 ,    i6 ,    i7 ,    i8 ,    i9 ,    i10,    i11,    i12,    i13,    i14,    i15
i16,    i17,    i18,    i19,    i20,    i21,    i22,    i23,    i24,    i25,    i26,    i27,    i28,    i29,    i30
i31,    i32,    i33,    i34,    i35,    i36,    i37,    i38,    i39,    i40,    i41,    i42,    i43,    i44,    i45
i46,    i47,    i48,    i49,    i50


----      5 PARAMETER a  data

i1  -5.280,    i2   8.640,    i3   2.006,    i4  -0.865,    i5  -8.367,    i6   1.227,    i7   2.775,    i8   0.607
i9   5.950,    i10  6.512,    i11  4.469,    i12 -5.984,    i13 -5.207,    i14  0.889,    i15  2.652,    i16  6.418
i17 -5.499,    i18 -9.400,    i19 -6.543,    i20 -6.948,    i21 -1.842,    i22 -6.746,    i23  5.136,    i24  8.559
i25 -6.108,    i26  9.186,    i27 -5.854,    i28  6.325,    i29  6.854,    i30 -8.525,    i31 -8.762,    i32  2.467
i33 -3.522,    i34  3.210,    i35 -2.523,    i36 -2.771,    i37  1.839,    i38 -1.234,    i39  2.501,    i40  4.430
i41  3.220,    i42 -9.646,    i43  6.979,    i44 -4.781,    i45 -6.631,    i46  9.047,    i47  0.211,    i48  6.068
i49 -7.734,    i50  7.059


A standard approach is to introduce two parallel binary variables: xi={1if data item ai is selected to be included in the subarray0otherwise and starti={1when we switch from xi1=0 to xi=10otherwise The integer programming model can look like: 


max-sum-subarray MIP formulation
maxz=iaixistartixixi1istarti1xi{0,1}starti{0,1}

We ignored here the else part in the definition of start and just implement the implication: xi1=0andxi=1starti=1 Note that the missing else part leaves variables starti unrestricted when we do not switch from 0 to 1. As we only have an upper limit on the number of starts this is no problem. Remarkably, if we want, we can relax the variable starti to be continuous between 0 and 1.

When we apply the model on our data set, we see the following results:


----     28 VARIABLE x.L  select a number

i23 1.000,    i24 1.000,    i25 1.000,    i26 1.000,    i27 1.000,    i28 1.000,    i29 1.000


----     28 VARIABLE start.L  start of subarray

i23 1.000


----     28 VARIABLE z.L                   =       24.099  objective

This shows that our optimal sub-array consists of elements 23 through 29.

Original problem


In the original problem, we have two data arrays ai and bi.  In addition, we have a mapping from a to b. We can encode the original data set as:


----     12 SET i  number of elements

i1 ,    i2 ,    i3 ,    i4 ,    i5 ,    i6 ,    i7 ,    i8 ,    i9 ,    i10


----     12 SET k  array

a,    b


----     12 PARAMETER d  data arrays

           i1          i2          i3          i4          i5          i6          i7          i8          i9

a       1.000      -1.000       1.000       1.000      -1.000      -1.000       1.000      -1.000       1.000
b      -1.000       1.000      -1.000      -1.000       1.000       1.000      -1.000       1.000       1.000

+         i10

a       1.000
b       1.000


----     12 SET map  mapping between a and b

             i1          i2          i3          i4          i5          i6          i7          i8          i9

i1                      YES
i2          YES
i3                                                          YES
i4                                                                      YES
i5                                  YES
i6                                                                                  YES
i7                                                                                                          YES
i8                                              YES
i9                                                                                              YES

  +         i10

i10         YES


The data is organized so we can use the "start" formulation for both arrays in one swoop: startk,ixk,ixk,i1k,iistartk,i1kxk,i,startk,i{0,1} Adding an index to a data-structure is a trick that can help preventing duplicating equations.

The objective is a bit complicated as only entries are counted when both siblings are selected. So, we can write: i,j|map(i,j)(da,i+db,j)xa,ixb,j This is a quadratic objective. Some solvers can handle this directly. Note that we don't sum over all (i,j) combinations but only over (i,j)|mapi,j, which has just 10 elements.

The complete model looks like: 

MIQP formulation
maxz=i,j|map(i,j)(da,i+db,j)xa,ixb,jstartk,ixk,ixk,i1k,iistartk,i1kxk,i{0,1}startk,i{0,1}


When we run this, we may get the following solution:


----     39 VARIABLE x.L  select a number

           i1          i2          i3          i4          i5          i6          i7          i8          i9

a       1.000       1.000       1.000       1.000       1.000       1.000       1.000       1.000       1.000
b                                                       1.000       1.000       1.000       1.000       1.000

+         i10

a       1.000
b       1.000


----     39 VARIABLE start.L  start of subarray

           i1          i5

a       1.000
b                   1.000


----     39 VARIABLE z.L                   =        8.000  objective


This is a slightly different solution but with the same objective. One way to get the same solution as described in [1] is to add a second objective: minimize the number of selected numbers (i.e. a sparser solution).  We can do this with a weighted objective: maxi,j|map(i,j)(da,i+db,j)xa,ixb,jλk,ixk,i for a small value for λ, e.g. λ=0.001. When we run this, we see: 


----     39 VARIABLE x.L  select a number

           i3          i4          i5          i6          i7          i8          i9         i10

a       1.000       1.000       1.000       1.000       1.000       1.000       1.000       1.000
b                               1.000       1.000       1.000       1.000       1.000       1.000


----     39 VARIABLE start.L  start of subarray

           i3          i5

a       1.000
b                   1.000


----     39 VARIABLE z.L                   =        7.986  objective


This corresponds to the solution reported by [1].

Maybe a better approach is to use a lexicographic model. In this case, we don't need a weight λ, but instead, solve two problems:

  1. Solve the problem with objective  maxz1=i,j|map(i,j)(da,i+db,j)xa,ixb,j 
  2. Record the objective function value of the solution z1.
  3. Add the constraint z1=i,j|map(i,j)(da,i+db,j)xa,ixb,j 
  4. Solve the problem with objective  minz2=k,ixk,i
A disadvantage is that we now have a quadratic constraint in the second problem.


It is not very difficult to linearize this problem. Notice that we only have n products xa,ixb,j as we need only to consider combinations (i,j)|mapi,j. We can write:


Linear formulation
maxz=i,j|map(i,j)(da,i+db,j)yi,jstartk,ixk,ixk,i1k,iistartk,i1kyi,jxa,ii,j|map(i,j)yi,jxb,ji,j|map(i,j)yi,jxa,i+xb,j1i,j|map(i,j)xk,i{0,1}startk,i{0,1}yi,j{0,1}

A larger data set


Let's create a larger data set:
  • 50 random numbers for a and b (not restricted to -1, 1),
  • For simplicity use an identity mapping: ai is mapped to bi

----     11 SET i  number of elements

i1 ,    i2 ,    i3 ,    i4 ,    i5 ,    i6 ,    i7 ,    i8 ,    i9 ,    i10,    i11,    i12,    i13,    i14,    i15
i16,    i17,    i18,    i19,    i20,    i21,    i22,    i23,    i24,    i25,    i26,    i27,    i28,    i29,    i30
i31,    i32,    i33,    i34,    i35,    i36,    i37,    i38,    i39,    i40,    i41,    i42,    i43,    i44,    i45
i46,    i47,    i48,    i49,    i50


----     11 SET k  array

a,    b


----     11 PARAMETER d  data arrays

           i1          i2          i3          i4          i5          i6          i7          i8          i9

a      -6.565       6.865       1.008      -3.977      -4.156      -5.519      -3.003       7.125      -8.658
b       3.222       5.116       2.549      -4.323      -8.272      -7.950       2.825       0.906      -9.370

+         i10         i11         i12         i13         i14         i15         i16         i17         i18

a       0.004       9.962       1.575       9.823       5.245      -7.386       2.794      -6.810      -4.998
b       5.847      -8.545      -6.487       0.513       5.004      -6.438      -9.317       1.703       2.425

+         i19         i20         i21         i22         i23         i24         i25         i26         i27

a       3.379      -1.293      -2.806      -2.971      -7.370      -6.998       1.782       6.618      -5.384
b      -2.213      -2.826      -5.139      -5.072      -7.390       8.669      -2.401       5.668      -3.999

+         i28         i29         i30         i31         i32         i33         i34         i35         i36

a       3.315       5.517      -3.927      -7.790       0.048      -6.797       7.449      -4.698      -4.284
b      -7.490       4.977      -8.615      -5.960      -9.899      -4.608      -0.003      -6.974      -6.517

+         i37         i38         i39         i40         i41         i42         i43         i44         i45

a       1.879       4.454       2.565      -0.724      -1.734      -7.646      -3.716      -9.069      -3.229
b      -3.387      -3.662      -3.558       9.280       9.872      -2.602      -2.542       5.440      -2.066

+         i46         i47         i48         i49         i50

a      -6.358       2.915       1.215       5.399      -4.044
b       8.262      -7.608       4.710      -8.892       1.526


----     11 SET map  mapping between a and b

             i1          i2          i3          i4          i5          i6          i7          i8          i9

i1          YES
i2                      YES
i3                                  YES
i4                                              YES
i5                                                          YES
i6                                                                      YES
i7                                                                                  YES
i8                                                                                              YES
i9                                                                                                          YES

  +         i10         i11         i12         i13         i14         i15         i16         i17         i18

i10         YES
i11                     YES
i12                                 YES
i13                                             YES
i14                                                         YES
i15                                                                     YES
i16                                                                                 YES
i17                                                                                             YES
i18                                                                                                         YES

  +         i19         i20         i21         i22         i23         i24         i25         i26         i27

i19         YES
i20                     YES
i21                                 YES
i22                                             YES
i23                                                         YES
i24                                                                     YES
i25                                                                                 YES
i26                                                                                             YES
i27                                                                                                         YES

  +         i28         i29         i30         i31         i32         i33         i34         i35         i36

i28         YES
i29                     YES
i30                                 YES
i31                                             YES
i32                                                         YES
i33                                                                     YES
i34                                                                                 YES
i35                                                                                             YES
i36                                                                                                         YES

  +         i37         i38         i39         i40         i41         i42         i43         i44         i45

i37         YES
i38                     YES
i39                                 YES
i40                                             YES
i41                                                         YES
i42                                                                     YES
i43                                                                                 YES
i44                                                                                             YES
i45                                                                                                         YES

  +         i46         i47         i48         i49         i50

i46         YES
i47                     YES
i48                                 YES
i49                                             YES
i50                                                         YES

The solution is found quickly using a good solver. E.g. Cplex needs 0 nodes to solve this to optimality: it finds the optimal solution during preprocessing.


----     44 VARIABLE x.L  select a number

          i10         i11         i12         i13         i14         i15         i16

a       1.000       1.000       1.000       1.000       1.000       1.000       1.000
b       1.000       1.000       1.000       1.000       1.000


----     44 VARIABLE start.L  start of subarray

          i10

a       1.000
b       1.000


----     44 VARIABLE y.L  x('a',i)*x('b',j) for map(i,j)

            i10         i11         i12         i13         i14

i10       1.000
i11                   1.000
i12                               1.000
i13                                           1.000
i14                                                       1.000


----     44 VARIABLE z.L                   =       22.941  objective


We see that a has more selected numbers than b, The extra ones have no value, but do not incur a cost either. If we use a lexicographic approach, we can find the sparsest optimal solution:


----     55 VARIABLE x.L  select a number

          i10         i11         i12         i13         i14

a       1.000       1.000       1.000       1.000       1.000
b       1.000       1.000       1.000       1.000       1.000


----     55 VARIABLE start.L  start of subarray

          i10

a       1.000
b       1.000


----     55 VARIABLE y.L  x('a',i)*x('b',j) for map(i,j)

            i10         i11         i12         i13         i14

i10       1.000
i11                   1.000
i12                               1.000
i13                                           1.000
i14                                                       1.000


----     55 VARIABLE z.L                   =       22.941  objective
            VARIABLE z2.L                  =       10.000  second level objective: number of elements selected


With the linearized model, we do not generate a quadratic constraint when using the lexicographic algorithm. This removes a complication. The implementation is shown below.

GAMS model


The GAMS implementation of the last model (lexicographic, linearized) can look like:


set
  i
'number of elements' /i1*i50/
  k
'array' /a, b/
;
alias(i,j);

parameter d(k, i) 'data arrays';
d(k,i) = uniform(-10,10);

set map(i,j) 'mapping between a and b';
map(i,i) =
yes;

display i,k,d,map;

binary variables
   x(k,i)
'select a number'
   start(k,i)
'start of subarray'
   y(i,j)
"x('a',i)*x('b',j) for map(i,j)"
;

variable z 'objective: total value';

equations
  obj  
'level 1 obj: total value'
  mul1(i,j) 
'linearization'
  mul2(i,j) 
'linearization'
  mul3(i,j) 
'linearization'
  startbnd(k,i)
'bound on start variable'
  sumstart(k)  
'allow only one start'
;

obj.. z =e=
sum(map(i,j), (d('a',i)+d('b',j))*y(i,j));
mul1(map(i,j)).. y(i,j) =l= x(
'a',i);
mul2(map(i,j)).. y(i,j) =l= x(
'b',j);
mul3(map(i,j)).. y(i,j) =g= x(
'a',i)+x('b',j)-1;
startbnd(k,i).. start(k,i) =g= x(k,i)-x(k,i-1);
sumstart(k)..
sum(i, start(k,i)) =l= 1;

model m /all/;
option optcr=0;
solve m maximizing z using mip;

display x.l, start.l, y.l, z.l;

* fix z to its optimal value
z.fx = z.l;

equation obj2 'second level obj: sparsity';
variable z2 'second level objective: number of elements selected';

obj2.. z2 =e=
sum((k,i), x(k,i));

model m2 /all/;
solve m2 minimizing z2 using mip;
display x.l, start.l, y.l, z.l, z2.l;



More optimization


There is one more optimization we can use.  If the objective coefficient (da,i+db,j) is positive the model will push yi,j upwards, so we can drop the constraints yi,jxa,i+xb,j1. If on the other hand, (da,i+db,j) is negative the model will push yi,j downwards to zero. That means we can drop the constraints: yi,jxb,j and yi,jxb,j,  Implementing this can look like: yi,jxa,ii,j|map(i,j),da,i+db,j>0yi,jxb,ji,j|map(i,j),da,i+db,j>0yi,jxa,i+xb,j1i,j|map(i,j),da,i+db,j<0 For the special case when da,i+db,j=0 we can remove the corresponding yi,j from the model. For very large data sets this can make a difference.

A second thing to look at is the type of variables y and start. Instead of being a binary variable, we can relax them to be continuous between 0 and 1.

I did some limited tests, and it looks that we are not really helping Cplex with these tricks. For other solvers, things may be different.  
 

References


2 comments:

  1. Nice post. One correction: in the description of the implication, replace $a$ with $x$.

    ReplyDelete