Monday, August 17, 2020

A variant of the max-sum-subarray problem

In [1] a variant of the max-sum-subarray problem is posed: 

Suppose you have two arrays each containing the same amount of "1"s and "-1"s. Additionally, suppose each "1" in the first array has a corresponding or sibling "1" in the second array. Same for each "-1". The task is to find the optimal subarrays, one in the first array and one in the second, such that their combined sum is the largest (maximal) with the added constraint that an element in one subarray only counts towards the sum if the other subarray contains its sibling.


Example


In [1] the following example is given:


An optimal solution is: select elements 3 through 10 in the first array and elements 5 through 10 in the second array, with a total objective of 8.


Reviewing the max-sum-subarray problem


Let's first review the max-sum-subarray problem [2]. We have data \(a_i\) with both positive and negative numbers. For example:


----      5 SET i  number of elements

i1 ,    i2 ,    i3 ,    i4 ,    i5 ,    i6 ,    i7 ,    i8 ,    i9 ,    i10,    i11,    i12,    i13,    i14,    i15
i16,    i17,    i18,    i19,    i20,    i21,    i22,    i23,    i24,    i25,    i26,    i27,    i28,    i29,    i30
i31,    i32,    i33,    i34,    i35,    i36,    i37,    i38,    i39,    i40,    i41,    i42,    i43,    i44,    i45
i46,    i47,    i48,    i49,    i50


----      5 PARAMETER a  data

i1  -5.280,    i2   8.640,    i3   2.006,    i4  -0.865,    i5  -8.367,    i6   1.227,    i7   2.775,    i8   0.607
i9   5.950,    i10  6.512,    i11  4.469,    i12 -5.984,    i13 -5.207,    i14  0.889,    i15  2.652,    i16  6.418
i17 -5.499,    i18 -9.400,    i19 -6.543,    i20 -6.948,    i21 -1.842,    i22 -6.746,    i23  5.136,    i24  8.559
i25 -6.108,    i26  9.186,    i27 -5.854,    i28  6.325,    i29  6.854,    i30 -8.525,    i31 -8.762,    i32  2.467
i33 -3.522,    i34  3.210,    i35 -2.523,    i36 -2.771,    i37  1.839,    i38 -1.234,    i39  2.501,    i40  4.430
i41  3.220,    i42 -9.646,    i43  6.979,    i44 -4.781,    i45 -6.631,    i46  9.047,    i47  0.211,    i48  6.068
i49 -7.734,    i50  7.059


A standard approach is to introduce two parallel binary variables: \[x_i = \begin{cases} 1 & \text{if data item $a_i$ is selected to be included in the subarray} \\  0 &\text{otherwise}\end{cases}\] and \[\mathit{start}_i = \begin{cases} 1 & \text{when we switch from $x_{i-1}=0$ to $x_{i}=1$} \\  0 &\text{otherwise}\end{cases}\] The integer programming model can look like: 


max-sum-subarray MIP formulation
\[\begin{align}\max\>& \color{darkred} z = \sum_i \color{darkblue}a_i \cdot \color{darkred}x_i \\ & \color{darkred}{\mathit{start}}_i \ge \color{darkred}x_i - \color{darkred}x_{i-1} \\ & \sum_i \color{darkred}{\mathit{start}}_i \le 1 \\ & \color{darkred}x_i \in \{0,1\} \\ & \color{darkred}{\mathit{start}}_i \in \{0,1\} \end{align}\]

We ignored here the else part in the definition of \(\mathit{start}\) and just implement the implication: \[x_{i-1}=0\>{\bf and}\>x_i=1 \Rightarrow \mathit{start}_i=1\] Note that the missing else part leaves variables \(\mathit{start}_i\) unrestricted when we do not switch from 0 to 1. As we only have an upper limit on the number of starts this is no problem. Remarkably, if we want, we can relax the variable \(\mathit{start}_i \) to be continuous between 0 and 1.

When we apply the model on our data set, we see the following results:


----     28 VARIABLE x.L  select a number

i23 1.000,    i24 1.000,    i25 1.000,    i26 1.000,    i27 1.000,    i28 1.000,    i29 1.000


----     28 VARIABLE start.L  start of subarray

i23 1.000


----     28 VARIABLE z.L                   =       24.099  objective

This shows that our optimal sub-array consists of elements 23 through 29.

Original problem


In the original problem, we have two data arrays \(a_i\) and \(b_i\).  In addition, we have a mapping from \(a\) to \(b\). We can encode the original data set as:


----     12 SET i  number of elements

i1 ,    i2 ,    i3 ,    i4 ,    i5 ,    i6 ,    i7 ,    i8 ,    i9 ,    i10


----     12 SET k  array

a,    b


----     12 PARAMETER d  data arrays

           i1          i2          i3          i4          i5          i6          i7          i8          i9

a       1.000      -1.000       1.000       1.000      -1.000      -1.000       1.000      -1.000       1.000
b      -1.000       1.000      -1.000      -1.000       1.000       1.000      -1.000       1.000       1.000

+         i10

a       1.000
b       1.000


----     12 SET map  mapping between a and b

             i1          i2          i3          i4          i5          i6          i7          i8          i9

i1                      YES
i2          YES
i3                                                          YES
i4                                                                      YES
i5                                  YES
i6                                                                                  YES
i7                                                                                                          YES
i8                                              YES
i9                                                                                              YES

  +         i10

i10         YES


The data is organized so we can use the "start" formulation for both arrays in one swoop: \[\begin{align} & \mathit{start}_{k,i} \ge x_{k,i} - x_{k,i-1} && \forall k,i\\ &\sum_i \mathit{start}_{k,i} \le 1 && \forall k \\ & x_{k,i}, \mathit{start}_{k,i} \in \{0,1\} \end{align}\] Adding an index to a data-structure is a trick that can help preventing duplicating equations.

The objective is a bit complicated as only entries are counted when both siblings are selected. So, we can write: \[\sum_{i,j|\mathit{map}(i,j)} (d_{a,i}+d_{b,j})\cdot x_{a,i} \cdot x_{b,j} \] This is a quadratic objective. Some solvers can handle this directly. Note that we don't sum over all \((i,j)\) combinations but only over \((i,j)|\mathit{map}_{i,j}\), which has just 10 elements.

The complete model looks like: 

MIQP formulation
\[\begin{align}\max\>& \color{darkred} z = \sum_{i,j|\color{darkblue}{\mathit{map}}(i,j)} (\color{darkblue}d_{\color{darkgreen}a,i}+\color{darkblue}d_{\color{darkgreen}b,j}) \cdot \color{darkred}x_{\color{darkgreen}a,i}\cdot \color{darkred}x_{\color{darkgreen}b,j} \\ & \color{darkred}{\mathit{start}}_{k,i} \ge \color{darkred}x_{k,i} - \color{darkred}x_{k,i-1} && \forall k,i \\ & \sum_i \color{darkred}{\mathit{start}}_{k,i} \le 1 && \forall k\\ & \color{darkred}x_{k,i} \in \{0,1\} \\ & \color{darkred}{\mathit{start}}_{k,i} \in \{0,1\} \end{align}\]


When we run this, we may get the following solution:


----     39 VARIABLE x.L  select a number

           i1          i2          i3          i4          i5          i6          i7          i8          i9

a       1.000       1.000       1.000       1.000       1.000       1.000       1.000       1.000       1.000
b                                                       1.000       1.000       1.000       1.000       1.000

+         i10

a       1.000
b       1.000


----     39 VARIABLE start.L  start of subarray

           i1          i5

a       1.000
b                   1.000


----     39 VARIABLE z.L                   =        8.000  objective


This is a slightly different solution but with the same objective. One way to get the same solution as described in [1] is to add a second objective: minimize the number of selected numbers (i.e. a sparser solution).  We can do this with a weighted objective: \[\max \sum_{i,j|\mathit{map}(i,j)} (d_{a,i}+d_{b,j})\cdot x_{a,i} \cdot x_{b,j} - \lambda \sum_{k,i} x_{k,i} \] for a small value for \(\lambda\), e.g. \(\lambda = 0.001\). When we run this, we see: 


----     39 VARIABLE x.L  select a number

           i3          i4          i5          i6          i7          i8          i9         i10

a       1.000       1.000       1.000       1.000       1.000       1.000       1.000       1.000
b                               1.000       1.000       1.000       1.000       1.000       1.000


----     39 VARIABLE start.L  start of subarray

           i3          i5

a       1.000
b                   1.000


----     39 VARIABLE z.L                   =        7.986  objective


This corresponds to the solution reported by [1].

Maybe a better approach is to use a lexicographic model. In this case, we don't need a weight \(\lambda\), but instead, solve two problems:

  1. Solve the problem with objective  \[\max\> z_1 = \sum_{i,j|\mathit{map}(i,j)} (d_{a,i}+d_{b,j})\cdot x_{a,i} \cdot x_{b,j} \] 
  2. Record the objective function value of the solution \(z^*_1\).
  3. Add the constraint \[ z^*_1 = \sum_{i,j|\mathit{map}(i,j)} (d_{a,i}+d_{b,j})\cdot x_{a,i} \cdot x_{b,j} \] 
  4. Solve the problem with objective  \[\min\>z_2 = \sum_{k,i} x_{k,i}\]
A disadvantage is that we now have a quadratic constraint in the second problem.


It is not very difficult to linearize this problem. Notice that we only have \(n\) products \(x_{a,i} \cdot x_{b,j}\) as we need only to consider combinations \((i,j)|\mathit{map}_{i,j}\). We can write:


Linear formulation
\[\begin{align}\max\>& \color{darkred} z = \sum_{i,j|\color{darkblue}{\mathit{map}}(i,j)} (\color{darkblue}d_{\color{darkgreen}a,i}+\color{darkblue}d_{\color{darkgreen}b,j}) \cdot \color{darkred}y_{i,j} \\ & \color{darkred}{\mathit{start}}_{k,i} \ge \color{darkred}x_{k,i} - \color{darkred}x_{k,i-1} && \forall k,i \\ & \sum_i \color{darkred}{\mathit{start}}_{k,i} \le 1 && \forall k \\ & \color{darkred}y_{i,j} \le \color{darkred}x_{\color{darkgreen}a,i} && \forall i,j|\color{darkblue}{\mathit{map}}(i,j)\\& \color{darkred}y_{i,j} \le \color{darkred}x_{\color{darkgreen}b,j} && \forall i,j|\color{darkblue}{\mathit{map}}(i,j) \\& \color{darkred}y_{i,j} \ge \color{darkred}x_{\color{darkgreen}a,i}+ \color{darkred}x_{\color{darkgreen}b,j} -1 && \forall i,j|\color{darkblue}{\mathit{map}}(i,j) \\ & \color{darkred}x_{k,i} \in \{0,1\} \\ & \color{darkred}{\mathit{start}}_{k,i} \in \{0,1\} \\ & \color{darkred}y_{i,j} \in \{0,1\} \end{align}\]

A larger data set


Let's create a larger data set:
  • 50 random numbers for \(a\) and \(b\) (not restricted to -1, 1),
  • For simplicity use an identity mapping: \(a_i\) is mapped to \(b_i\)

----     11 SET i  number of elements

i1 ,    i2 ,    i3 ,    i4 ,    i5 ,    i6 ,    i7 ,    i8 ,    i9 ,    i10,    i11,    i12,    i13,    i14,    i15
i16,    i17,    i18,    i19,    i20,    i21,    i22,    i23,    i24,    i25,    i26,    i27,    i28,    i29,    i30
i31,    i32,    i33,    i34,    i35,    i36,    i37,    i38,    i39,    i40,    i41,    i42,    i43,    i44,    i45
i46,    i47,    i48,    i49,    i50


----     11 SET k  array

a,    b


----     11 PARAMETER d  data arrays

           i1          i2          i3          i4          i5          i6          i7          i8          i9

a      -6.565       6.865       1.008      -3.977      -4.156      -5.519      -3.003       7.125      -8.658
b       3.222       5.116       2.549      -4.323      -8.272      -7.950       2.825       0.906      -9.370

+         i10         i11         i12         i13         i14         i15         i16         i17         i18

a       0.004       9.962       1.575       9.823       5.245      -7.386       2.794      -6.810      -4.998
b       5.847      -8.545      -6.487       0.513       5.004      -6.438      -9.317       1.703       2.425

+         i19         i20         i21         i22         i23         i24         i25         i26         i27

a       3.379      -1.293      -2.806      -2.971      -7.370      -6.998       1.782       6.618      -5.384
b      -2.213      -2.826      -5.139      -5.072      -7.390       8.669      -2.401       5.668      -3.999

+         i28         i29         i30         i31         i32         i33         i34         i35         i36

a       3.315       5.517      -3.927      -7.790       0.048      -6.797       7.449      -4.698      -4.284
b      -7.490       4.977      -8.615      -5.960      -9.899      -4.608      -0.003      -6.974      -6.517

+         i37         i38         i39         i40         i41         i42         i43         i44         i45

a       1.879       4.454       2.565      -0.724      -1.734      -7.646      -3.716      -9.069      -3.229
b      -3.387      -3.662      -3.558       9.280       9.872      -2.602      -2.542       5.440      -2.066

+         i46         i47         i48         i49         i50

a      -6.358       2.915       1.215       5.399      -4.044
b       8.262      -7.608       4.710      -8.892       1.526


----     11 SET map  mapping between a and b

             i1          i2          i3          i4          i5          i6          i7          i8          i9

i1          YES
i2                      YES
i3                                  YES
i4                                              YES
i5                                                          YES
i6                                                                      YES
i7                                                                                  YES
i8                                                                                              YES
i9                                                                                                          YES

  +         i10         i11         i12         i13         i14         i15         i16         i17         i18

i10         YES
i11                     YES
i12                                 YES
i13                                             YES
i14                                                         YES
i15                                                                     YES
i16                                                                                 YES
i17                                                                                             YES
i18                                                                                                         YES

  +         i19         i20         i21         i22         i23         i24         i25         i26         i27

i19         YES
i20                     YES
i21                                 YES
i22                                             YES
i23                                                         YES
i24                                                                     YES
i25                                                                                 YES
i26                                                                                             YES
i27                                                                                                         YES

  +         i28         i29         i30         i31         i32         i33         i34         i35         i36

i28         YES
i29                     YES
i30                                 YES
i31                                             YES
i32                                                         YES
i33                                                                     YES
i34                                                                                 YES
i35                                                                                             YES
i36                                                                                                         YES

  +         i37         i38         i39         i40         i41         i42         i43         i44         i45

i37         YES
i38                     YES
i39                                 YES
i40                                             YES
i41                                                         YES
i42                                                                     YES
i43                                                                                 YES
i44                                                                                             YES
i45                                                                                                         YES

  +         i46         i47         i48         i49         i50

i46         YES
i47                     YES
i48                                 YES
i49                                             YES
i50                                                         YES

The solution is found quickly using a good solver. E.g. Cplex needs 0 nodes to solve this to optimality: it finds the optimal solution during preprocessing.


----     44 VARIABLE x.L  select a number

          i10         i11         i12         i13         i14         i15         i16

a       1.000       1.000       1.000       1.000       1.000       1.000       1.000
b       1.000       1.000       1.000       1.000       1.000


----     44 VARIABLE start.L  start of subarray

          i10

a       1.000
b       1.000


----     44 VARIABLE y.L  x('a',i)*x('b',j) for map(i,j)

            i10         i11         i12         i13         i14

i10       1.000
i11                   1.000
i12                               1.000
i13                                           1.000
i14                                                       1.000


----     44 VARIABLE z.L                   =       22.941  objective


We see that \(a\) has more selected numbers than \(b\), The extra ones have no value, but do not incur a cost either. If we use a lexicographic approach, we can find the sparsest optimal solution:


----     55 VARIABLE x.L  select a number

          i10         i11         i12         i13         i14

a       1.000       1.000       1.000       1.000       1.000
b       1.000       1.000       1.000       1.000       1.000


----     55 VARIABLE start.L  start of subarray

          i10

a       1.000
b       1.000


----     55 VARIABLE y.L  x('a',i)*x('b',j) for map(i,j)

            i10         i11         i12         i13         i14

i10       1.000
i11                   1.000
i12                               1.000
i13                                           1.000
i14                                                       1.000


----     55 VARIABLE z.L                   =       22.941  objective
            VARIABLE z2.L                  =       10.000  second level objective: number of elements selected


With the linearized model, we do not generate a quadratic constraint when using the lexicographic algorithm. This removes a complication. The implementation is shown below.

GAMS model


The GAMS implementation of the last model (lexicographic, linearized) can look like:


set
  i
'number of elements' /i1*i50/
  k
'array' /a, b/
;
alias(i,j);

parameter d(k, i) 'data arrays';
d(k,i) = uniform(-10,10);

set map(i,j) 'mapping between a and b';
map(i,i) =
yes;

display i,k,d,map;

binary variables
   x(k,i)
'select a number'
   start(k,i)
'start of subarray'
   y(i,j)
"x('a',i)*x('b',j) for map(i,j)"
;

variable z 'objective: total value';

equations
  obj  
'level 1 obj: total value'
  mul1(i,j) 
'linearization'
  mul2(i,j) 
'linearization'
  mul3(i,j) 
'linearization'
  startbnd(k,i)
'bound on start variable'
  sumstart(k)  
'allow only one start'
;

obj.. z =e=
sum(map(i,j), (d('a',i)+d('b',j))*y(i,j));
mul1(map(i,j)).. y(i,j) =l= x(
'a',i);
mul2(map(i,j)).. y(i,j) =l= x(
'b',j);
mul3(map(i,j)).. y(i,j) =g= x(
'a',i)+x('b',j)-1;
startbnd(k,i).. start(k,i) =g= x(k,i)-x(k,i-1);
sumstart(k)..
sum(i, start(k,i)) =l= 1;

model m /all/;
option optcr=0;
solve m maximizing z using mip;

display x.l, start.l, y.l, z.l;

* fix z to its optimal value
z.fx = z.l;

equation obj2 'second level obj: sparsity';
variable z2 'second level objective: number of elements selected';

obj2.. z2 =e=
sum((k,i), x(k,i));

model m2 /all/;
solve m2 minimizing z2 using mip;
display x.l, start.l, y.l, z.l, z2.l;



More optimization


There is one more optimization we can use.  If the objective coefficient \((d_{a,i}+d_{b,j})\) is positive the model will push \(y_{i,j}\) upwards, so we can drop the constraints \(y_{i,j} \ge x_{a,i}+ x_{b,j} -1\). If on the other hand, \((d_{a,i}+d_{b,j})\) is negative the model will push \(y_{i,j}\) downwards to zero. That means we can drop the constraints: \(y_{i,j} \le x_{b,j}\) and \(y_{i,j} \le x_{b,j}\),  Implementing this can look like: \[\begin{align}& y_{i,j} \le x_{a,i} && \forall i,j|\mathit{map}(i,j), d_{a,i}+d_{b,j} \gt 0\\& y_{i,j} \le x_{b,j} && \forall i,j|\mathit{map}(i,j), d_{a,i}+d_{b,j} \gt 0 \\& y_{i,j} \ge x_{a,i}+ x_{b,j} -1 && \forall i,j|\mathit{map}(i,j), d_{a,i}+d_{b,j} \lt 0\end{align}\] For the special case when \(d_{a,i}+d_{b,j}=0\) we can remove the corresponding \(y_{i,j}\) from the model. For very large data sets this can make a difference.

A second thing to look at is the type of variables \(y\) and \(\mathit{start}\). Instead of being a binary variable, we can relax them to be continuous between 0 and 1.

I did some limited tests, and it looks that we are not really helping Cplex with these tricks. For other solvers, things may be different.  
 

References


2 comments:

  1. Nice post. One correction: in the description of the implication, replace $a$ with $x$.

    ReplyDelete