In [1], the following problem is stated:
Given a boolean matrix, with \(m\) rows and \(n\) columns, find the largest pattern of ones that is found in at least \(\color{darkblue}K\) rows. We can ignore cells where the pattern has a zero value: they don't count.
A small example [1] is given:
row 1:[0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1] row 2:[0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1] row 3:[0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1] row 4:[1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1] row 5:[1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1] row 6:[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1]
With \(K=3\), we can form a pattern with 10 nonzero elements:
row 1: [0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1] row 2: [0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1] row 3: [0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1] row 4: [1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1] row 5: [1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1] row 6: [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1] pattern:[1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1]
Mathematical Model version 1 |
---|
\[ \begin{align} \max & \sum_j \color{darkred}{\mathit{pattern}}_j \\ & \sum_j \color{darkblue}d_{i,j}\cdot \color{darkred}{\mathit{pattern}}_j + \color{darkred}{\mathit{diff}}_i = \sum_j \color{darkred}{\mathit{pattern}}_j && \forall i \\ & \color{darkred}{\mathit{diff}}_i \le\color{darkblue}M \cdot (1-\color{darkred}{\mathit{ok}}_i) && \forall i \\ & \sum_i \color{darkred}{\mathit{ok}}_i \ge \color{darkblue}K \\ & \color{darkred}{\mathit{pattern}}_j \in \{0,1\} \\ & \color{darkred}{\mathit{diff}}_i \in \{0,1,2,\dots\} \\ & \color{darkred}{\mathit{ok}}_i \in \{0,1\} \end{align}\] |
Mathematical Model version 2 |
---|
\[ \begin{align} \max & \sum_j \color{darkred}{\mathit{pattern}}_j \\ & \color{darkred}{\mathit{pattern}}_j +\color{darkred}{\mathit{ok}}_i\le 1 && \forall i,j|\color{darkblue}d_{i,j}=0 \\ & \sum_i \color{darkred}{\mathit{ok}}_i \ge \color{darkblue}K \\ & \color{darkred}{\mathit{pattern}}_j \in \{0,1\} \\ & \color{darkred}{\mathit{ok}}_i \in \{0,1\} \end{align}\] |
Mathematical Model version 3 |
---|
\[ \begin{align} \max & \sum_j \color{darkred}{\mathit{pattern}}_j \\ & \sum_{j|\color{darkblue}d_{i,j}=0}\color{darkred}{\mathit{pattern}}_j \le \color{darkblue}M_i \cdot (1-\color{darkred}{\mathit{ok}}_i) && \forall i\\ & \sum_i \color{darkred}{\mathit{ok}}_i \ge \color{darkblue}K \\ & \color{darkred}{\mathit{pattern}}_j \in \{0,1\} \\ & \color{darkred}{\mathit{ok}}_i \in \{0,1\} \end{align}\] |
random 50x50 matrix, K=5 (click to enlarge) |
Both versions are very fast. For the random 50x50 data set, we have:
---- 117 PARAMETER report performance statistics model 1 model 2 model 3 Variables 151.000 101.000 101.000 Equations 102.000 1288.000 52.000 Objective 9.000 9.000 9.000 Time 0.516 0.547 0.625 Nodes 25126.000 6355.000 28903.000 Iterations 61866.000 40431.000 77324.000
---- 117 PARAMETER report performance statistics model 1 model 2 model 3 Variables 301.000 201.000 201.000 Equations 202.000 5004.000 102.000 Objective 16.000 16.000 16.000 Time 5.296 44.078 5.266 Nodes 390382.000 389739.000 390382.000 Iterations 675128.000 3414636.000 675128.000
Conclusion
References
- Linear programming problem, searching for a group that has the most in common, https://stackoverflow.com/questions/76157767/linear-programming-problem-searching-for-a-group-that-has-the-most-in-common
Appendix: GAMS model
$onText
search for K rows with the most ones in common $offText
$set dataset 2
*---------------------------------------------------------------------------- * dataset 1 *----------------------------------------------------------------------------
$ifthen %dataset%==1
set i 'rows' /r1*r6/ j 'columns' /c1*c21/ ;
table d(i,j) 'data' c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17 c18 c19 c20 c21 r1 0 1 0 1 1 0 1 0 1 0 1 1 1 0 1 0 0 0 1 0 1 r2 0 1 0 1 1 0 0 1 0 1 0 1 0 0 0 1 1 1 0 1 1 r3 0 0 0 1 1 0 0 1 0 0 0 1 0 0 1 0 1 1 0 0 1 r4 1 1 0 1 1 1 0 1 0 0 0 1 1 1 0 0 0 0 0 1 1 r5 1 1 0 1 1 1 0 0 0 0 0 1 1 1 0 0 0 1 0 1 1 r6 1 1 1 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 ;
scalar K 'number of rows to select' /3/;
$endif
*---------------------------------------------------------------------------- * dataset 2 *----------------------------------------------------------------------------
$ifthen %dataset%==2
set i 'rows' /r1*r50/ j 'columns' /c1*c50/ ;
parameter d(i,j) 'data'; d(i,j) = uniformint(0,1); option d:0; display d;
scalar K 'number of rows to select' /5/;
$endif
*---------------------------------------------------------------------------- * reporting macro *----------------------------------------------------------------------------
parameter report(*,*) 'performance statistics'; $macro statistics(m,name) \ report('Variables',name) = m.numvar; \ report('Equations',name) = m.numequ; \ report('Objective',name) = m.objval; \ report('Time',name) = m.resusd; \ report('Nodes',name) = m.nodusd; \ report('Iterations',name) = m.iterusd; \ display report;
option threads=0;
*---------------------------------------------------------------------------- * model 1 *----------------------------------------------------------------------------
binary Variables pattern(j) 'must be shared by k rows' ok(i) 'row contains pattern' ; integer variable diff(i) 'difference between pattern and row i'; variable z 'objective';
equations obj 'objective' difference(i) 'calculate difference between pattern and row i' isdiff(i) 'ok(i)=1 => diff(i)=0' krows 'match at least K rows' ;
obj.. z =e= sum(j, pattern(j)); difference(i).. sum(j,d(i,j)*pattern(j))+diff(i) =e= sum(j,pattern(j)) ; isdiff(i).. diff(i) =l= (1-ok(i))*card(j); krows.. sum(i,ok(i)) =g= k;
model m1 /all/; solve m1 maximizing z using mip; display pattern.l,diff.l,ok.l;
statistics(m1,'model 1')
* * reporting: form cube * parameter cube(*,*); cube(i,j) = d(i,j); cube('pattern',j) = round(pattern.l(j)); cube(i,'diff') = diff.l(i); cube(i,'ok') = ok.l(i); option cube:0; display cube;
*---------------------------------------------------------------------------- * model 2 *----------------------------------------------------------------------------
equation notBoth(i,j) 'for d(i,j)=0: not both pattern(j)=1 and ok(i)=1'; notBoth(i,j)$(d(i,j)=0).. ok(i) + pattern(j) =l= 1;
model m2 /obj,notBoth,krows/; solve m2 maximizing z using mip; display pattern.l,ok.l;
statistics(m2,'model 2')
*---------------------------------------------------------------------------- * model 3 *----------------------------------------------------------------------------
equation aggregated(i) 'aggregated version'; aggregated(i).. sum(j$(d(i,j)=0),pattern(j)) =l= sum(j$(d(i,j)=0),1) * (1-ok(i));
model m3 /obj,aggregated,krows/; solve m3 maximizing z using mip; display pattern.l,ok.l;
statistics(m3,'model 3') |
It might be interesting to compare to an aggregated version of model 2, with the first set of constraints replaced with sum {j: d_{ij} = 0} pattern_j <= |{j: d_{ij} = 0}| (1 - ok_i) for all i.
ReplyDeleteI added that formulation. Performance is very similar to model 1.
Delete