Sunday, May 7, 2023

Finding common patterns

In [1], the following problem is stated:

Given a boolean matrix, with $$m$$ rows and $$n$$ columns, find the largest pattern of ones that is found in at least $$\color{darkblue}K$$ rows. We can ignore cells where the pattern has a zero value: they don't count.

A small example [1] is given:

row 1:[0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1]
row 2:[0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1]
row 3:[0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1]
row 4:[1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1]
row 5:[1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1]
row 6:[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1]


With $$K=3$$, we can form a pattern with 10 nonzero elements:

row 1:  [0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1]
row 2:  [0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1]
row 3:  [0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1]
row 4:  [1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1]
row 5:  [1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1]
row 6:  [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1]
pattern:[1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1]


The pattern is shared by rows 4, 5, and 6.

Can we formulate a MIP model for this? My first attempt is as follows.

Mathematical Model version 1
\begin{align} \max & \sum_j \color{darkred}{\mathit{pattern}}_j \\ & \sum_j \color{darkblue}d_{i,j}\cdot \color{darkred}{\mathit{pattern}}_j + \color{darkred}{\mathit{diff}}_i = \sum_j \color{darkred}{\mathit{pattern}}_j && \forall i \\ & \color{darkred}{\mathit{diff}}_i \le\color{darkblue}M \cdot (1-\color{darkred}{\mathit{ok}}_i) && \forall i \\ & \sum_i \color{darkred}{\mathit{ok}}_i \ge \color{darkblue}K \\ & \color{darkred}{\mathit{pattern}}_j \in \{0,1\} \\ & \color{darkred}{\mathit{diff}}_i \in \{0,1,2,\dots\} \\ & \color{darkred}{\mathit{ok}}_i \in \{0,1\} \end{align}

The constraint $\sum_j \color{darkblue}d_{i,j}\cdot \color{darkred}{\mathit{pattern}}_j + \color{darkred}{\mathit{diff}}_i = \sum_j \color{darkred}{\mathit{pattern}}_j$ measures the difference between the pattern and row $$i$$. If $$\color{darkred}{\mathit{diff}}_i=0$$ we have a match. The implication $$\color{darkred}{\mathit{diff}}_i\ne 0 \implies \color{darkred}{\mathit{ok}}_i=0$$ is implemented as: $$\color{darkred}{\mathit{diff}}_i \le (1-\color{darkred}{\mathit{ok}}_i)\cdot \color{darkblue}M$$ where $$\color{darkblue}M$$ can be chosen as $$M:={\bf{card}}(j)$$ (the number of columns). If your modeling tool allows indicator constraints, you can combine these two constraints into: $\color{darkred}{\mathit{ok}}_i=1 \implies \sum_j \color{darkblue}d_{i,j}\cdot \color{darkred}{\mathit{pattern}}_j = \sum_j \color{darkred}{\mathit{pattern}}_j$Finally, we make sure we have at least $$\color{darkblue}K$$ rows that are ok.

A different approach can be based on $\color{darkred}{\mathit{ok}}_i=1 \implies \color{darkred}{\mathit{pattern}}_j \le \color{darkblue}d_{i,j}\>\>\forall i,j$ This can be implemented as $\color{darkred}{\mathit{pattern}}_j \le \color{darkblue}d_{i,j} + (1-\color{darkred}{\mathit{ok}}_i) \>\> \forall i,j$ or better  $\color{darkred}{\mathit{pattern}}_j + \color{darkred}{\mathit{ok}}_i \le 1 \>\>\forall i,j|\color{darkblue}d_{i,j}=0$ Here we skip all cases where $$\color{darkblue}d_{i,j}=1$$. If  $$\color{darkblue}d_{i,j}=1$$, we know in advance it will always obey $$\color{darkred}{\mathit{ok}}_i=1 \implies \color{darkred}{\mathit{pattern}}_j \le \color{darkblue}d_{i,j}$$. Of course, this new constraint also has a direct interpretation: for any $$(i,j)$$ with $$\color{darkblue}d_{i,j} = 0$$,  we can't have both: "row $$i$$ is selected" and "column $$j$$ is part of the pattern".

Mathematical Model version 2
\begin{align} \max & \sum_j \color{darkred}{\mathit{pattern}}_j \\ & \color{darkred}{\mathit{pattern}}_j +\color{darkred}{\mathit{ok}}_i\le 1 && \forall i,j|\color{darkblue}d_{i,j}=0 \\ & \sum_i \color{darkred}{\mathit{ok}}_i \ge \color{darkblue}K \\ & \color{darkred}{\mathit{pattern}}_j \in \{0,1\} \\ & \color{darkred}{\mathit{ok}}_i \in \{0,1\} \end{align}

A third version, proposed in the comments below, is a variant of version 2: we aggregate over $$j$$. This yields:

Mathematical Model version 3
\begin{align} \max & \sum_j \color{darkred}{\mathit{pattern}}_j \\ & \sum_{j|\color{darkblue}d_{i,j}=0}\color{darkred}{\mathit{pattern}}_j \le \color{darkblue}M_i \cdot (1-\color{darkred}{\mathit{ok}}_i) && \forall i\\ & \sum_i \color{darkred}{\mathit{ok}}_i \ge \color{darkblue}K \\ & \color{darkred}{\mathit{pattern}}_j \in \{0,1\} \\ & \color{darkred}{\mathit{ok}}_i \in \{0,1\} \end{align}

where $\color{darkblue}M_i = \sum_j (1-\color{darkblue}d_{i,j})$
To be able to run some larger problems, I generated random data sets of any size. Here is a 50x50 matrix with $$\color{darkblue}K=5$$:

 random 50x50 matrix, K=5 (click to enlarge)

Both versions are very fast.  For the random 50x50 data set, we have:

----    117 PARAMETER report  performance statistics

model 1     model 2     model 3

Variables      151.000     101.000     101.000
Equations      102.000    1288.000      52.000
Objective        9.000       9.000       9.000
Time             0.516       0.547       0.625
Nodes        25126.000    6355.000   28903.000
Iterations   61866.000   40431.000   77324.000


All variants solve this quite fast. Let's also try a 100x100 matrix:

----    117 PARAMETER report  performance statistics

model 1     model 2     model 3

Variables      301.000     201.000     201.000
Equations      202.000    5004.000     102.000
Objective       16.000      16.000      16.000
Time             5.296      44.078       5.266
Nodes       390382.000  389739.000  390382.000
Iterations  675128.000 3414636.000  675128.000


Model 1 and 3 are basically the same: the same node and iteration counts! But only for the 100x100 data. The reason could be that the presolver generates almost the same presolved model for these two cases. Model 2 is slower, although it uses fewer nodes (but more simplex iterations).

Often we want to prevent big-M formulations. However, in this case the big-M models (model 1 and 3) are outperforming model 2.

For some of these problems, we could have used complete enumeration. The small data set has ${6 \choose 3} = 20$ possible combinations of three rows we need to check. For our 50x50 and 100x100 problems with $$K=5$$, this number increases to 2,118,760 and 75,287,520.

Conclusion

Three different MIP models for the same problem are presented. Many practical problems can be described using different mathematical models, sometimes very different ones. Version 1 vs. version 2 is a good example: they don't have much in common. It is not always easy to flip a switch in your mind and come up with totally different models. You need to take a different point of view, which can be difficult when in the thick of it.

Appendix: GAMS model

 $onText search for K rows with the most ones in common$offText  $set dataset 2 *----------------------------------------------------------------------------* dataset 1*----------------------------------------------------------------------------$ifthen %dataset%==1 set  i 'rows' /r1*r6/  j 'columns' /c1*c21/; table d(i,j) 'data'  c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17 c18 c19 c20 c21r1 0  1  0  1  1  0  1  0  1   0   1   1   1   0   1   0   0   0   1   0   1r2 0  1  0  1  1  0  0  1  0   1   0   1   0   0   0   1   1   1   0   1   1r3 0  0  0  1  1  0  0  1  0   0   0   1   0   0   1   0   1   1   0   0   1r4 1  1  0  1  1  1  0  1  0   0   0   1   1   1   0   0   0   0   0   1   1r5 1  1  0  1  1  1  0  0  0   0   0   1   1   1   0   0   0   1   0   1   1r6 1  1  1  1  1  1  0  0  0   0   0   1   1   1   0   0   0   0   0   1   1; scalar K 'number of rows to select' /3/; $endif *----------------------------------------------------------------------------* dataset 2*----------------------------------------------------------------------------$ifthen %dataset%==2 set  i 'rows' /r1*r50/  j 'columns' /c1*c50/; parameter d(i,j) 'data';d(i,j) = uniformint(0,1);option d:0; display d; scalar K 'number of rows to select' /5/; $endif *----------------------------------------------------------------------------* reporting macro*---------------------------------------------------------------------------- parameter report(*,*) 'performance statistics';$macro statistics(m,name) \report('Variables',name)  = m.numvar;  \report('Equations',name)  = m.numequ;  \report('Objective',name)  = m.objval;  \report('Time',name)       = m.resusd;  \report('Nodes',name)      = m.nodusd;  \report('Iterations',name) = m.iterusd; \display report; option threads=0; *----------------------------------------------------------------------------* model 1*----------------------------------------------------------------------------  binary Variables   pattern(j)  'must be shared by k rows'   ok(i) 'row contains pattern';integer variable diff(i) 'difference between pattern and row i';variable z 'objective'; equations   obj           'objective'   difference(i) 'calculate difference between pattern and row i'   isdiff(i)     'ok(i)=1 => diff(i)=0'   krows         'match at least K rows' ; obj..   z =e= sum(j, pattern(j));difference(i).. sum(j,d(i,j)*pattern(j))+diff(i) =e= sum(j,pattern(j)) ;isdiff(i).. diff(i) =l= (1-ok(i))*card(j);krows.. sum(i,ok(i)) =g= k; model m1 /all/;solve m1 maximizing z using mip;display pattern.l,diff.l,ok.l; statistics(m1,'model 1') ** reporting: form cube*parameter cube(*,*);cube(i,j) = d(i,j);cube('pattern',j) = round(pattern.l(j));cube(i,'diff') = diff.l(i);cube(i,'ok') = ok.l(i);option cube:0; display cube;  *----------------------------------------------------------------------------* model 2*---------------------------------------------------------------------------- equation notBoth(i,j) 'for d(i,j)=0: not both pattern(j)=1 and ok(i)=1';notBoth(i,j)$(d(i,j)=0).. ok(i) + pattern(j) =l= 1; model m2 /obj,notBoth,krows/;solve m2 maximizing z using mip;display pattern.l,ok.l; statistics(m2,'model 2') *----------------------------------------------------------------------------* model 3*---------------------------------------------------------------------------- equation aggregated(i) 'aggregated version';aggregated(i).. sum(j$(d(i,j)=0),pattern(j)) =l= sum(j\$(d(i,j)=0),1) * (1-ok(i)); model m3 /obj,aggregated,krows/;solve m3 maximizing z using mip;display pattern.l,ok.l; statistics(m3,'model 3')