Monday, October 3, 2022

Select points

In [1], the following problem is posted:


I have multiple sets of data points. For example, set 1 contains 5 data points, set 2 contains 1 data point, set 3 contains 10, etc. I need to select one data point from each set so that distances between these selected points is minimal. Any Python based functions to be used will be very helpful


This can be stated as an optimization problem. Writing down the model is a useful exercise, not only to solve it and get solutions but also to define the problem a bit more precisely than a typical "word problem". 

A super simple non-convex MIQP model is:


Non-convex MIQP Model
\[\begin{align}\min&\sum_{i,j|\color{darkblue}{\mathit{ok}}_{i,j}} \color{darkblue}{\mathit{dist}}_{i,j}\cdot\color{darkred}x_i \cdot\color{darkred}x_j \\ & \sum_{i|\color{darkblue}{\mathit{group}}_{i,g}} \color{darkred}x_i = 1 && \forall g\\ & \color{darkred}x_i \in \{0,1\} \end{align}\]

Notes:

  • The objective simply adds up all distances between the selected points.
  • The constraint says: we want to select exactly one point in each group.
  • Here the set \(\color{darkblue}{\mathit{ok}}_{i,j}\) indicates if points \(i\) and \(j\) are in different groups and if \(i\lt j\). These are the allowed pairs. The purpose of restriction \(i\lt j\) is to prevent double counting.
  • \(\color{darkblue}{\mathit{group}}_{i,g}\) indicates if point \(i\) is in group \(g\).
  • These boolean indicators can be implemented as sets or boolean parameters. When used as sets, we restrict the summations. When used as boolean parameters, we just can use multiplication. I.e., the objective would look like: \(\min\>\sum_{i,j} \color{darkblue}{\mathit{ok}}_{i,j} \cdot \color{darkblue}{\mathit{dist}}_{i,j}\cdot\color{darkred}x_i \cdot\color{darkred}x_j \). The mathematical models and the GAMS model below use sets.
  • Some solvers, such as Cplex and Gurobi, may accept this model directly. They will convexify or linearize the problem. We can also do such a linearization ourselves at the model level (this is demonstrated below). 
  • An alternative objective could be: minimize the size of the rectangle that contains all selected points. However, that would make the model non-linear (non-convex quadratic objective). Without an easy way to repair. 
As promised, here is a linearization of the above quadratic model:


MIP Model
\[\begin{align}\min&\sum_{i,j|\color{darkblue}{\mathit{ok}}_{i,j}}  \color{darkblue}{\mathit{dist}}_{i,j}\cdot\color{darkred}{\mathit{pair}}_{i,j} \\ & \sum_{i|\color{darkblue}{\mathit{group}}_{i,g}}  \color{darkred}x_i = 1 && \forall g\\ & \color{darkred}{\mathit{pair}}_{i,j} \ge \color{darkred}x_i + \color{darkred}x_j - 1 && \forall i,j|\color{darkblue}{\mathit{ok}}_{i,j}\\ & \color{darkred}x_i \in \{0,1\}\\ & \color{darkred}{\mathit{pair}}_{i,j} \in \{0,1\} \end{align}\]


Notes:
  • The second constraint implements the implication: \[\color{darkred}x_i = \color{darkred}x_j = 1 \Rightarrow \color{darkred}{\mathit{pair}}_{i,j}=1\]
  • We can safely ignore the cases when either \(\color{darkred}x_i=0\) or \(\color{darkred}x_j=0\). As we are minimizing, the corresponding  \(\color{darkred}{\mathit{pair}}_{i,j}\) will be zero automatically.

A more complex linearization is based on RLT (Reformulation-Linearization Technique) [2] and suggested by Rob Pratt in the comments. The basic idea is as follows. Suppose we have a model as follows (using notation from [2]): \[\begin{align}\min&\sum_{i,j}\color{darkblue}c_{i,j} \cdot \color{darkred}x_i\cdot \color{darkred}x_j\\ & \sum_{i\in \color{darkblue}I_k} \color{darkred}x_i = 1 && \forall k \\ & \color{darkred}x_i \in \{0,1\}\end{align}\] This is basically our model, except we used a little bit more indexing magic. The standard linearization for this model is:  \[\begin{align}\min&\sum_{i,j}\color{darkblue}c_{i,j} \cdot \color{darkred}w_{i,j}\\ & \sum_{i\in \color{darkblue}I_k} \color{darkred}x_i = 1 && \forall k \\ & \color{darkred}w_{i,j} \le \color{darkred}x_i && \forall i,j\\ & \color{darkred}w_{i,j} \le \color{darkred}x_j && \forall i,j\\ & \color{darkred}w_{i,j} \ge \color{darkred}x_i+\color{darkred}x_j-1 && \forall i,j\\ & \color{darkred}x_i \in \{0,1\} \\ & \color{darkred}w_{i,j} \ge 0 \end{align}\] This is just our MIP model, except we could drop some the inequalities. We exploited knowledge of the signs of the objective coefficients. Using the RLT formulation, we arrive at: \[\begin{align}\min&\sum_{i,j}\color{darkblue}c_{i,j} \cdot \color{darkred}w_{i,j}\\ & \sum_{i\in \color{darkblue}I_k} \color{darkred}x_i = 1 && \forall k \\ & \sum_{i\in \color{darkblue}I_k} \color{darkred}w_{i,j} = \color{darkred}x_j && \forall k, \forall j \in J_k \\  & \color{darkred}x_i \in \{0,1\} \\ & \color{darkred}w_{i,j} \ge 0 \end{align}\] using a bit of simplified indexing. We can basically use this, apart from the fact that our variables \(\color{darkred}w_{i,j}\) are only defined over some \(i,j\). In the end our RLT model becomes:


RLT Model
\[\begin{align}\min&\sum_{i,j|\color{darkblue}{\mathit{ok}}_{i,j}}\color{darkblue}{\mathit{dist}}_{i,j}\cdot\color{darkred}{\mathit{pair}}_{i,j} \\ & \sum_{i|\color{darkblue}{\mathit{group}}_{i,g}}  \color{darkred}x_i = 1 && \forall g \\ & \sum_{i| \color{darkblue}{\mathit{group}}_{i,g}} (\color{darkred}{\mathit{pair}}_{i,j}|\color{darkblue}{\mathit{ok}}_{i,j}) + (\color{darkred}{\mathit{pair}}_{j,i}|\color{darkblue}{\mathit{ok}}_{j,i}) = \color{darkred}x_j && \forall j,g|{\bf{not}}\>\color{darkblue}{\mathit{group}}_{j,g}   \\ & \color{darkred}x_i \in \{0,1\}\\ & \color{darkred}{\mathit{pair}}_{i,j} \ge 0 \end{align}\]


The notation \( (\color{darkred}{\mathit{pair}}_{i,j}|\color{darkblue}{\mathit{ok}}_{i,j})\) means that the term \(\color{darkred}{\mathit{pair}}_{i,j}\) is only used if \(\color{darkblue}{\mathit{ok}}_{i,j}\) exists.

Example data


To see if this all makes sense, it is a good idea to use a small random data set, and visualize the result. Here is my random data set with \(N=50\) points, and \(G=5\) groups.

----     29 SET i  points

i1 ,    i2 ,    i3 ,    i4 ,    i5 ,    i6 ,    i7 ,    i8 ,    i9 ,    i10,    i11,    i12,    i13,    i14,    i15
i16,    i17,    i18,    i19,    i20,    i21,    i22,    i23,    i24,    i25,    i26,    i27,    i28,    i29,    i30
i31,    i32,    i33,    i34,    i35,    i36,    i37,    i38,    i39,    i40,    i41,    i42,    i43,    i44,    i45
i46,    i47,    i48,    i49,    i50


----     29 SET g  groups

g1,    g2,    g3,    g4,    g5


----     29 SET ig  group membership

             g1          g2          g3          g4          g5

i1          YES
i2                                                          YES
i3                                  YES
i4                      YES
i5                      YES
i6                      YES
i7                      YES
i8                                                          YES
i9          YES
i10                                 YES
i11                                                         YES
i12                                 YES
i13                                                         YES
i14                                             YES
i15         YES
i16                                             YES
i17         YES
i18                     YES
i19                                             YES
i20                                 YES
i21                     YES
i22                     YES
i23         YES
i24         YES
i25                                 YES
i26                                                         YES
i27                     YES
i28                                             YES
i29                                             YES
i30                     YES
i31         YES
i32                                 YES
i33         YES
i34                                                         YES
i35                     YES
i36                     YES
i37                                 YES
i38                                             YES
i39                                             YES
i40                                 YES
i41                                 YES
i42         YES
i43                     YES
i44         YES
i45                     YES
i46         YES
i47                                             YES
i48                                 YES
i49                                             YES
i50                     YES


----     50 PARAMETER loc  coordinates of locations

              x           y

i1       66.111      75.582
i2       62.745      28.386
i3        8.642      10.251
i4       64.125      54.531
i5        3.152      79.236
i6        7.277      17.566
i7       52.563      75.021
i8       17.812       3.414
i9       58.513      62.123
i10      38.936      35.871
i11      24.303      24.642
i12      13.050      93.345
i13      37.994      78.340
i14      30.003      12.548
i15      74.887       6.923
i16      20.202       0.507
i17      26.961      49.985
i18      15.129      17.417
i19      33.064      31.691
i20      32.209      96.398
i21      99.360      36.990
i22      37.289      77.198
i23      39.668      91.310
i24      11.958      73.548
i25       5.542      57.630
i26       5.141       0.601
i27      40.123      51.988
i28      62.888      22.575
i29      39.612      27.601
i30      15.237      93.632
i31      42.266      13.466
i32      38.606      37.463
i33      26.848      94.837
i34      18.894      29.751
i35       7.455      40.135
i36      10.169      38.389
i37      32.409      19.213
i38      11.237      59.656
i39      51.145       4.507
i40      78.310      94.575
i41      59.646      60.734
i42      36.251      59.407
i43      67.985      50.659
i44      15.925      65.689
i45      52.388      12.440
i46      98.672      22.812
i47      67.565      77.678
i48      93.245      20.124
i49      29.714      19.723
i50      24.635      64.648


The model solves very fast given this small data set, and the solution looks like this:


----     83 VARIABLE x.L  select points

i11 1.000,    i18 1.000,    i31 1.000,    i37 1.000,    i49 1.000


This solution is a bit difficult to understand, so a visualization is useful.



This looks good to me.


Results


For the 50 point, 5 groups data sets, all formulations arrive at the optimal solution very quickly:

----    146 PARAMETER results  

                  MIQP  LINEARIZED         RLT

variables       51.000    1034.000    1034.000
 discrete       50.000      50.000      50.000
equations        6.000     989.000     206.000
status         Optimal     Optimal     Optimal
obj            137.644     137.644     137.644
time             0.609       0.062       0.016
nodes        40501.000
iterations   50953.000      24.000      72.000

When we increase the size to 100 points and 10 groups, things are a bit different:

----    146 PARAMETER results  

                  MIQP  LINEARIZED         RLT

variables      101.000    4504.000    4504.000
 discrete      100.000     100.000     100.000
equations       11.000    4414.000     911.000
status       TimeLimit     Optimal     Optimal
obj            686.463     686.463     686.463
time          1000.187       0.672       0.047
nodes      1.026863E+7    1149.000
iterations 1.865805E+7   33934.000     332.000


We see that the MIQP formulation is unable to prove optimality in 1000 seconds. The reason is that Cplex decides not to linearize the model, but instead just to convexify the objective. We see this in the message:

Classifier predicts products in MIQP should not be linearized.

Of course, this is the wrong decision. Cplex should have linearized this. This confirms that Machine Learning really means: Machine needs some more Learning

We could have instructed Cplex explicitly to linearize using the QToLin option.

We also see that this RLT formulation is indeed better than the traditional linearization.

I always feel good when I see something like this:

The original problem has:
    29859 rows, 30084 columns and 89752 non-zero elements
    250 binaries

Presolving the problem

The presolved problem has:
    29859 rows, 30084 columns and 89752 non-zero elements
    250 binaries

This means that all the obvious reductions have been done at the model level. Which probably means I had some level of understanding of the model. 

References


  1. Find common data point in multiple sets, https://stackoverflow.com/questions/73904007/find-common-data-point-in-multiple-sets
  2. Liberti, L., Compact linearization for binary quadratic problems, 4OR 5, 231–245 (2007)

Appendix: GAMS model

$onText

  
Given n points in g groups, select g points (one in each group)
  
such that total distance between the selected points is minimized.

$offText

*------------------------------------------------------
* options
*------------------------------------------------------

option mip=cplex, miqcp=cplex, threads=0, reslim=1000;

*------------------------------------------------------
* size of the problem
*------------------------------------------------------

sets
  i
'points' /i1*i50/
  g
'groups' /g1*g5/
;

*------------------------------------------------------
* grounp membership: random groups
*------------------------------------------------------

set ig(i,g) 'group membership';
scalar ng 'random group number';
loop(i,
  ng = uniformint(1,
card(g));
  ig(i,g) =
ord(g) = ng;
);

display i,g,ig;


*------------------------------------------------------
* derived data: allowed pairs ok(i,j)
*------------------------------------------------------

alias(i,j);
set ok(i,j) 'allowed pairs';
* prevent double couting
ok(i,j) =
ord(i)<ord(j);
* exclude points in same group
ok(i,j)$
sum(g$(ig(i,g) and ig(j,g)),1) = no;

*------------------------------------------------------
* distances from random locations
*------------------------------------------------------

set coord /x,y/;
parameter loc(i,coord) 'locations';
loc(i,coord) = uniform(0,100);
display loc;

parameter dist(i,j) 'distances';
dist(ok(i,j)) = sqrt(
sum(coord,sqr(loc(i,coord)-loc(j,coord))));
display dist;

*------------------------------------------------------------------
* reporting macros
*------------------------------------------------------------------

parameter results(*,*);

acronym Optimal;
acronym TimeLimit;

* macros for reporting
$macro report(m,label)  \
    results(
'variables',label)  = m.numvar;  \
    results(
' discrete',label)  = m.numdvar;  \
    results(
'equations',label)  = m.numequ;  \
    results(
'status',label)= m.solvestat; \
    results(
'status',label)$(m.solvestat=1) = Optimal; \
    results(
'status',label)$(m.solvestat=3) = TimeLimit; \
    results(
'obj',label) = td.l; \
    results(
'time',label) = m.resusd; \
    results(
'nodes',label) = m.nodusd; \
    results(
'iterations',label) = m.iterusd; \
   
display results;


*------------------------------------------------------
* optimization model 1: non-convex MIQP
*------------------------------------------------------

binary variables x(i) 'select points';
variable td 'total distance';

equations
    totalDist1
'quadratic objective'
    onePoint(g)
'one point per group'
;

totalDist1.. td =e=
sum(ok(i,j), dist(i,j)*x(i)*x(j));

onePoint(g).. 
sum(ig(i,g), x(i)) =e= 1;

model selection1 /all/;
solve selection1 minimizing td using miqcp;

display x.l;
report(selection1,
"MIQP")

*------------------------------------------------------
* optimization model 2: MIP (linearized version)
*------------------------------------------------------

* relax variable pair
binary variables pair(i,j) 'both i and j are selected';
pair.prior(i,j) =
INF;
equations
    totalDist2
'linear objective'
    bothSelected(i,j)
'points i and j are selected'
;

totalDist2.. td =e=
sum(ok, pair(ok)*dist(ok));

* x(i)=x(j)=1 ==> pair(i,j)=1
bothSelected(ok(i,j)).. pair(i,j) =g= x(i) + x(j) - 1;


model selection2 /totalDist2, onePoint, bothSelected/;
solve selection2 minimizing td using mip;

display x.l;
report(selection2,
"LINEARIZED")


*------------------------------------------------------
* optimization model 3: compact RLT
*    (Reformulation-Linearization Technique)
*------------------------------------------------------

equation RLT(j,g) 'Reformulation-Linearization Technique';
RLT(j,g)$(
not ig(j,g))..
  
sum(ig(i,g), pair(i,j)$ok(i,j) + pair(j,i)$ok(j,i)) =e= x(j);

model selection3 /totalDist2, onePoint, RLT/;
solve selection3 minimizing td using mip;

display x.l;
report(selection3,
"RLT")


*------------------------------------------------------
* data for plotting
*------------------------------------------------------

parameter points(i,g,*);
points(ig(i,g),coord) = loc(i,coord);
display points;

parameter lines(i,j,*);
loop((i,j)$(pair.l(i,j)>0.5),
   lines(i,j,
'x1') = loc(i,'x');
   lines(i,j,
'y1') = loc(i,'y');
   lines(i,j,
'x2') = loc(j,'x');
   lines(i,j,
'y2') = loc(j,'y');
);
display lines;

5 comments:

  1. This solves quickly, but for a larger instance you might do better with a compact (RLT-based) linearization obtained by omitting the bothSelected constraints and multiplying both sides of the onePoint(g) constraints by x(j) for all points j that are not in group g, yielding something like this:

    con Linearize {g in GROUPS, j in POINTS: group[j] ne g}:
    sum { in PAIRS: group[i] = g} pair[i,j] + sum {<(j),i> in PAIRS: group[i] = g} pair[j,i] = x[j];

    This new formulation solved at the root node for me, whereas the original formulation has a root LP bound of 0.

    ReplyDelete
    Replies
    1. Yes, that also worked better for me. Even for Gurobi it is better, although it generates RLT cuts for the traditional linearization.

      Delete
    2. Thank you for the updates. It looks like you accidentally omitted the onePoint constraints from the RLT write-up but correctly included them in the code.

      Delete
  2. Really cool linearization formulation! I tested using HiGHS and the linearization solves pretty quick (~0.6 seconds), the RLT formulation solves faster (~0.16 seconds). This was for a problem with n=50, and groups=5.

    ReplyDelete