In [1] a variant of
the max-sum-subarray problem is posed:
Suppose you have two arrays each containing the same amount of "1"s and "-1"s. Additionally, suppose each "1" in the first array has a corresponding or sibling "1" in the second array. Same for each "-1". The task is to find the optimal subarrays, one in the first array and one in the second, such that their combined sum is the largest (maximal) with the added constraint that an element in one subarray only counts towards the sum if the other subarray contains its sibling.
Example
In [1] the following example is given:
An optimal solution is: select elements 3 through 10 in the first array and elements 5 through 10 in the second array, with a total objective of 8.
Reviewing the max-sum-subarray problem
Let's first review the max-sum-subarray problem [2]. We have data \(a_i\) with both positive and negative numbers. For example:
---- 5 SET i number of elements
i1 , i2 , i3 , i4 , i5 , i6 , i7 , i8 , i9 , i10, i11, i12, i13, i14, i15
i16, i17, i18, i19, i20, i21, i22, i23, i24, i25, i26, i27, i28, i29, i30
i31, i32, i33, i34, i35, i36, i37, i38, i39, i40, i41, i42, i43, i44, i45
i46, i47, i48, i49, i50
---- 5 PARAMETER a data
i1 -5.280, i2 8.640, i3 2.006, i4 -0.865, i5 -8.367, i6 1.227, i7 2.775, i8 0.607
i9 5.950, i10 6.512, i11 4.469, i12 -5.984, i13 -5.207, i14 0.889, i15 2.652, i16 6.418
i17 -5.499, i18 -9.400, i19 -6.543, i20 -6.948, i21 -1.842, i22 -6.746, i23 5.136, i24 8.559
i25 -6.108, i26 9.186, i27 -5.854, i28 6.325, i29 6.854, i30 -8.525, i31 -8.762, i32 2.467
i33 -3.522, i34 3.210, i35 -2.523, i36 -2.771, i37 1.839, i38 -1.234, i39 2.501, i40 4.430
i41 3.220, i42 -9.646, i43 6.979, i44 -4.781, i45 -6.631, i46 9.047, i47 0.211, i48 6.068
i49 -7.734, i50 7.059
A standard approach is to introduce two parallel binary variables: \[x_i = \begin{cases} 1 & \text{if data item $a_i$ is selected to be included in the subarray} \\ 0 &\text{otherwise}\end{cases}\] and \[\mathit{start}_i = \begin{cases} 1 & \text{when we switch from $x_{i-1}=0$ to $x_{i}=1$} \\ 0 &\text{otherwise}\end{cases}\] The integer programming model can look like:
max-sum-subarray MIP formulation |
\[\begin{align}\max\>& \color{darkred} z = \sum_i \color{darkblue}a_i \cdot \color{darkred}x_i \\ & \color{darkred}{\mathit{start}}_i \ge \color{darkred}x_i - \color{darkred}x_{i-1} \\ & \sum_i \color{darkred}{\mathit{start}}_i \le 1 \\ & \color{darkred}x_i \in \{0,1\} \\ & \color{darkred}{\mathit{start}}_i \in \{0,1\} \end{align}\] |
We ignored here the
else part in the definition of \(\mathit{start}\) and just implement the implication: \[x_{i-1}=0\>{\bf and}\>x_i=1 \Rightarrow \mathit{start}_i=1\] Note that the missing else part leaves variables \(\mathit{start}_i\) unrestricted when we do not switch from 0 to 1. As we only have an upper limit on the number of starts this is no problem. Remarkably, if we want, we can relax the variable \(\mathit{start}_i \) to be continuous between 0 and 1.
When we apply the model on our data set, we see the following results:
---- 28 VARIABLE x.L select a number
i23 1.000, i24 1.000, i25 1.000, i26 1.000, i27 1.000, i28 1.000, i29 1.000
---- 28 VARIABLE start.L start of subarray
i23 1.000
---- 28 VARIABLE z.L = 24.099 objective
This shows that our optimal sub-array consists of elements 23 through 29.
Original problem
In the original problem, we have two data arrays \(a_i\) and \(b_i\). In addition, we have a mapping from \(a\) to \(b\). We can encode the original data set as:
---- 12 SET i number of elements
i1 , i2 , i3 , i4 , i5 , i6 , i7 , i8 , i9 , i10
---- 12 SET k array
a, b
---- 12 PARAMETER d data arrays
i1 i2 i3 i4 i5 i6 i7 i8 i9
a 1.000 -1.000 1.000 1.000 -1.000 -1.000 1.000 -1.000 1.000
b -1.000 1.000 -1.000 -1.000 1.000 1.000 -1.000 1.000 1.000
+ i10
a 1.000
b 1.000
---- 12 SET map mapping between a and b
i1 i2 i3 i4 i5 i6 i7 i8 i9
i1 YES
i2 YES
i3 YES
i4 YES
i5 YES
i6 YES
i7 YES
i8 YES
i9 YES
+ i10
i10 YES
The data is organized so we can use the "start" formulation for both arrays in one swoop: \[\begin{align} & \mathit{start}_{k,i} \ge x_{k,i} - x_{k,i-1} && \forall k,i\\ &\sum_i \mathit{start}_{k,i} \le 1 && \forall k \\ & x_{k,i}, \mathit{start}_{k,i} \in \{0,1\} \end{align}\] Adding an index to a data-structure is a trick that can help preventing duplicating equations.
The objective is a bit complicated as only entries are counted when both siblings are selected. So, we can write: \[\sum_{i,j|\mathit{map}(i,j)} (d_{a,i}+d_{b,j})\cdot x_{a,i} \cdot x_{b,j} \] This is a quadratic objective. Some solvers can handle this directly. Note that we don't sum over all \((i,j)\) combinations but only over \((i,j)|\mathit{map}_{i,j}\), which has just 10 elements.
The complete model looks like:
MIQP formulation |
\[\begin{align}\max\>& \color{darkred} z = \sum_{i,j|\color{darkblue}{\mathit{map}}(i,j)} (\color{darkblue}d_{\color{darkgreen}a,i}+\color{darkblue}d_{\color{darkgreen}b,j}) \cdot \color{darkred}x_{\color{darkgreen}a,i}\cdot \color{darkred}x_{\color{darkgreen}b,j} \\ & \color{darkred}{\mathit{start}}_{k,i} \ge \color{darkred}x_{k,i} - \color{darkred}x_{k,i-1} && \forall k,i \\ & \sum_i \color{darkred}{\mathit{start}}_{k,i} \le 1 && \forall k\\ & \color{darkred}x_{k,i} \in \{0,1\} \\ & \color{darkred}{\mathit{start}}_{k,i} \in \{0,1\} \end{align}\] |
When we run this, we may get the following solution:
---- 39 VARIABLE x.L select a number
i1 i2 i3 i4 i5 i6 i7 i8 i9
a 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
b 1.000 1.000 1.000 1.000 1.000
+ i10
a 1.000
b 1.000
---- 39 VARIABLE start.L start of subarray
i1 i5
a 1.000
b 1.000
---- 39 VARIABLE z.L = 8.000 objective
This is a slightly different solution but with the same objective. One way to get the same solution as described in [1] is to add a second objective: minimize the number of selected numbers (i.e. a sparser solution). We can do this with a weighted objective: \[\max \sum_{i,j|\mathit{map}(i,j)} (d_{a,i}+d_{b,j})\cdot x_{a,i} \cdot x_{b,j} - \lambda \sum_{k,i} x_{k,i} \] for a small value for \(\lambda\), e.g. \(\lambda = 0.001\). When we run this, we see:
---- 39 VARIABLE x.L select a number
i3 i4 i5 i6 i7 i8 i9 i10
a 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
b 1.000 1.000 1.000 1.000 1.000 1.000
---- 39 VARIABLE start.L start of subarray
i3 i5
a 1.000
b 1.000
---- 39 VARIABLE z.L = 7.986 objective
This corresponds to the solution reported by [1].
Maybe a better approach is to use a lexicographic model. In this case, we don't need a weight \(\lambda\), but instead, solve two problems:
- Solve the problem with objective \[\max\> z_1 = \sum_{i,j|\mathit{map}(i,j)} (d_{a,i}+d_{b,j})\cdot x_{a,i} \cdot x_{b,j} \]
- Record the objective function value of the solution \(z^*_1\).
- Add the constraint \[ z^*_1 = \sum_{i,j|\mathit{map}(i,j)} (d_{a,i}+d_{b,j})\cdot x_{a,i} \cdot x_{b,j} \]
- Solve the problem with objective \[\min\>z_2 = \sum_{k,i} x_{k,i}\]
A disadvantage is that we now have a quadratic constraint in the second problem.
It is not very difficult to linearize this problem. Notice that we only have \(n\) products \(x_{a,i} \cdot x_{b,j}\) as we need only to consider combinations \((i,j)|\mathit{map}_{i,j}\). We can write:
Linear formulation |
\[\begin{align}\max\>& \color{darkred} z = \sum_{i,j|\color{darkblue}{\mathit{map}}(i,j)} (\color{darkblue}d_{\color{darkgreen}a,i}+\color{darkblue}d_{\color{darkgreen}b,j}) \cdot \color{darkred}y_{i,j} \\ & \color{darkred}{\mathit{start}}_{k,i} \ge \color{darkred}x_{k,i} - \color{darkred}x_{k,i-1} && \forall k,i \\ & \sum_i \color{darkred}{\mathit{start}}_{k,i} \le 1 && \forall k \\ & \color{darkred}y_{i,j} \le \color{darkred}x_{\color{darkgreen}a,i} && \forall i,j|\color{darkblue}{\mathit{map}}(i,j)\\& \color{darkred}y_{i,j} \le \color{darkred}x_{\color{darkgreen}b,j} && \forall i,j|\color{darkblue}{\mathit{map}}(i,j) \\& \color{darkred}y_{i,j} \ge \color{darkred}x_{\color{darkgreen}a,i}+ \color{darkred}x_{\color{darkgreen}b,j} -1 && \forall i,j|\color{darkblue}{\mathit{map}}(i,j) \\ & \color{darkred}x_{k,i} \in \{0,1\} \\ & \color{darkred}{\mathit{start}}_{k,i} \in \{0,1\} \\
& \color{darkred}y_{i,j} \in \{0,1\} \end{align}\] |
A larger data set
Let's create a larger data set:
- 50 random numbers for \(a\) and \(b\) (not restricted to -1, 1),
- For simplicity use an identity mapping: \(a_i\) is mapped to \(b_i\)
---- 11 SET i number of elements
i1 , i2 , i3 , i4 , i5 , i6 , i7 , i8 , i9 , i10, i11, i12, i13, i14, i15
i16, i17, i18, i19, i20, i21, i22, i23, i24, i25, i26, i27, i28, i29, i30
i31, i32, i33, i34, i35, i36, i37, i38, i39, i40, i41, i42, i43, i44, i45
i46, i47, i48, i49, i50
---- 11 SET k array
a, b
---- 11 PARAMETER d data arrays
i1 i2 i3 i4 i5 i6 i7 i8 i9
a -6.565 6.865 1.008 -3.977 -4.156 -5.519 -3.003 7.125 -8.658
b 3.222 5.116 2.549 -4.323 -8.272 -7.950 2.825 0.906 -9.370
+ i10 i11 i12 i13 i14 i15 i16 i17 i18
a 0.004 9.962 1.575 9.823 5.245 -7.386 2.794 -6.810 -4.998
b 5.847 -8.545 -6.487 0.513 5.004 -6.438 -9.317 1.703 2.425
+ i19 i20 i21 i22 i23 i24 i25 i26 i27
a 3.379 -1.293 -2.806 -2.971 -7.370 -6.998 1.782 6.618 -5.384
b -2.213 -2.826 -5.139 -5.072 -7.390 8.669 -2.401 5.668 -3.999
+ i28 i29 i30 i31 i32 i33 i34 i35 i36
a 3.315 5.517 -3.927 -7.790 0.048 -6.797 7.449 -4.698 -4.284
b -7.490 4.977 -8.615 -5.960 -9.899 -4.608 -0.003 -6.974 -6.517
+ i37 i38 i39 i40 i41 i42 i43 i44 i45
a 1.879 4.454 2.565 -0.724 -1.734 -7.646 -3.716 -9.069 -3.229
b -3.387 -3.662 -3.558 9.280 9.872 -2.602 -2.542 5.440 -2.066
+ i46 i47 i48 i49 i50
a -6.358 2.915 1.215 5.399 -4.044
b 8.262 -7.608 4.710 -8.892 1.526
---- 11 SET map mapping between a and b
i1 i2 i3 i4 i5 i6 i7 i8 i9
i1 YES
i2 YES
i3 YES
i4 YES
i5 YES
i6 YES
i7 YES
i8 YES
i9 YES
+ i10 i11 i12 i13 i14 i15 i16 i17 i18
i10 YES
i11 YES
i12 YES
i13 YES
i14 YES
i15 YES
i16 YES
i17 YES
i18 YES
+ i19 i20 i21 i22 i23 i24 i25 i26 i27
i19 YES
i20 YES
i21 YES
i22 YES
i23 YES
i24 YES
i25 YES
i26 YES
i27 YES
+ i28 i29 i30 i31 i32 i33 i34 i35 i36
i28 YES
i29 YES
i30 YES
i31 YES
i32 YES
i33 YES
i34 YES
i35 YES
i36 YES
+ i37 i38 i39 i40 i41 i42 i43 i44 i45
i37 YES
i38 YES
i39 YES
i40 YES
i41 YES
i42 YES
i43 YES
i44 YES
i45 YES
+ i46 i47 i48 i49 i50
i46 YES
i47 YES
i48 YES
i49 YES
i50 YES
The solution is found quickly using a good solver. E.g. Cplex needs 0 nodes to solve this to optimality: it finds the optimal solution during preprocessing.
---- 44 VARIABLE x.L select a number
i10 i11 i12 i13 i14 i15 i16
a 1.000 1.000 1.000 1.000 1.000 1.000 1.000
b 1.000 1.000 1.000 1.000 1.000
---- 44 VARIABLE start.L start of subarray
i10
a 1.000
b 1.000
---- 44 VARIABLE y.L x('a',i)*x('b',j) for map(i,j)
i10 i11 i12 i13 i14
i10 1.000
i11 1.000
i12 1.000
i13 1.000
i14 1.000
---- 44 VARIABLE z.L = 22.941 objective
We see that \(a\) has more selected numbers than \(b\), The extra ones have no value, but do not incur a cost either. If we use a lexicographic approach, we can find the sparsest optimal solution:
---- 55 VARIABLE x.L select a number
i10 i11 i12 i13 i14
a 1.000 1.000 1.000 1.000 1.000
b 1.000 1.000 1.000 1.000 1.000
---- 55 VARIABLE start.L start of subarray
i10
a 1.000
b 1.000
---- 55 VARIABLE y.L x('a',i)*x('b',j) for map(i,j)
i10 i11 i12 i13 i14
i10 1.000
i11 1.000
i12 1.000
i13 1.000
i14 1.000
---- 55 VARIABLE z.L = 22.941 objective
VARIABLE z2.L = 10.000 second level objective: number of elements selected
With the linearized model, we do not generate a quadratic constraint when using the lexicographic algorithm. This removes a complication. The implementation is shown below.
GAMS model
The GAMS implementation of the last model (lexicographic, linearized) can look like:
set
i 'number of elements' /i1*i50/
k 'array' /a, b/
;
alias(i,j);
parameter d(k, i) 'data
arrays';
d(k,i) = uniform(-10,10);
set map(i,j) 'mapping
between a and b';
map(i,i) = yes;
display i,k,d,map;
binary variables
x(k,i) 'select a number'
start(k,i) 'start of subarray'
y(i,j) "x('a',i)*x('b',j) for
map(i,j)"
;
variable z 'objective:
total value';
equations
obj
'level 1 obj: total value'
mul1(i,j) 'linearization'
mul2(i,j) 'linearization'
mul3(i,j) 'linearization'
startbnd(k,i) 'bound on start variable'
sumstart(k) 'allow
only one start'
;
obj.. z =e= sum(map(i,j), (d('a',i)+d('b',j))*y(i,j));
mul1(map(i,j)).. y(i,j) =l= x('a',i);
mul2(map(i,j)).. y(i,j) =l= x('b',j);
mul3(map(i,j)).. y(i,j) =g= x('a',i)+x('b',j)-1;
startbnd(k,i).. start(k,i) =g= x(k,i)-x(k,i-1);
sumstart(k).. sum(i, start(k,i)) =l= 1;
model m /all/;
option optcr=0;
solve m maximizing z using mip;
display x.l, start.l, y.l, z.l;
* fix z to its optimal value
z.fx = z.l;
equation obj2 'second
level obj: sparsity';
variable z2 'second
level objective: number of elements selected';
obj2.. z2 =e= sum((k,i), x(k,i));
model m2 /all/;
solve m2 minimizing z2 using mip;
display x.l, start.l, y.l, z.l, z2.l;
|
More optimization
There is one more optimization we can use. If the objective coefficient \((d_{a,i}+d_{b,j})\) is positive the model will push \(y_{i,j}\) upwards, so we can drop the constraints \(y_{i,j} \ge x_{a,i}+ x_{b,j} -1\). If on the other hand, \((d_{a,i}+d_{b,j})\) is negative the model will push \(y_{i,j}\) downwards to zero. That means we can drop the constraints: \(y_{i,j} \le x_{b,j}\) and \(y_{i,j} \le x_{b,j}\), Implementing this can look like: \[\begin{align}& y_{i,j} \le x_{a,i} && \forall i,j|\mathit{map}(i,j), d_{a,i}+d_{b,j} \gt 0\\& y_{i,j} \le x_{b,j} && \forall i,j|\mathit{map}(i,j), d_{a,i}+d_{b,j} \gt 0 \\& y_{i,j} \ge x_{a,i}+ x_{b,j} -1 && \forall i,j|\mathit{map}(i,j), d_{a,i}+d_{b,j} \lt 0\end{align}\] For the special case when \(d_{a,i}+d_{b,j}=0\) we can remove the corresponding \(y_{i,j}\) from the model. For very large data sets this can make a difference.
A second thing to look at is the type of variables \(y\) and \(\mathit{start}\). Instead of being a binary variable, we can relax them to be continuous between 0 and 1.
I did some limited tests, and it looks that we are not really helping Cplex with these tricks. For other solvers, things may be different.
References
Nice post. One correction: in the description of the implication, replace $a$ with $x$.
ReplyDeleteFixed. Thanks.
Delete