In [1], the following problem is posted:

I have multiple sets of data points. For example, set 1 contains 5 data points, set 2 contains 1 data point, set 3 contains 10, etc. I need to select one data point from each set so that distances between these selected points is minimal. Any Python based functions to be used will be very helpful

This can be stated as an optimization problem. Writing down the model is a useful exercise, not only to solve it and get solutions but also to define the problem a bit more precisely than a typical "word problem".

A super simple non-convex MIQP model is:

Non-convex MIQP Model |
---|

\[\begin{align}\min&\sum_{i,j|\color{darkblue}{\mathit{ok}}_{i,j}} \color{darkblue}{\mathit{dist}}_{i,j}\cdot\color{darkred}x_i \cdot\color{darkred}x_j \\ & \sum_{i|\color{darkblue}{\mathit{group}}_{i,g}} \color{darkred}x_i = 1 && \forall g\\ & \color{darkred}x_i \in \{0,1\} \end{align}\] |

Notes:

- The objective simply adds up all distances between the selected points.
- The constraint says: we want to select exactly one point in each group.
- Here the set \(\color{darkblue}{\mathit{ok}}_{i,j}\) indicates if points \(i\) and \(j\) are in different groups and if \(i\lt j\). These are the allowed pairs. The purpose of restriction \(i\lt j\) is to prevent double counting.
- \(\color{darkblue}{\mathit{group}}_{i,g}\) indicates if point \(i\) is in group \(g\).
- These boolean indicators can be implemented as sets or boolean parameters. When used as sets, we restrict the summations. When used as boolean parameters, we just can use multiplication. I.e., the objective would look like: \(\min\>\sum_{i,j} \color{darkblue}{\mathit{ok}}_{i,j} \cdot \color{darkblue}{\mathit{dist}}_{i,j}\cdot\color{darkred}x_i \cdot\color{darkred}x_j \). The mathematical models and the GAMS model below use sets.
- Some solvers, such as Cplex and Gurobi, may accept this model directly. They will
**convexify**or**linearize**the problem. We can also do such a linearization ourselves at the model level (this is demonstrated below). - An alternative objective could be:
*minimize the size of the rectangle that contains all selected points*. However, that would make the model non-linear (non-convex quadratic objective). Without an easy way to repair.

MIP Model |
---|

\[\begin{align}\min&\sum_{i,j|\color{darkblue}{\mathit{ok}}_{i,j}} \color{darkblue}{\mathit{dist}}_{i,j}\cdot\color{darkred}{\mathit{pair}}_{i,j} \\ & \sum_{i|\color{darkblue}{\mathit{group}}_{i,g}} \color{darkred}x_i = 1 && \forall g\\ & \color{darkred}{\mathit{pair}}_{i,j} \ge \color{darkred}x_i + \color{darkred}x_j - 1 && \forall i,j|\color{darkblue}{\mathit{ok}}_{i,j}\\ & \color{darkred}x_i \in \{0,1\}\\ & \color{darkred}{\mathit{pair}}_{i,j} \in \{0,1\} \end{align}\] |

- The second constraint implements the implication: \[\color{darkred}x_i = \color{darkred}x_j = 1 \Rightarrow \color{darkred}{\mathit{pair}}_{i,j}=1\]
- We can safely ignore the cases when either \(\color{darkred}x_i=0\) or \(\color{darkred}x_j=0\). As we are minimizing, the corresponding \(\color{darkred}{\mathit{pair}}_{i,j}\) will be zero automatically.

A more complex linearization is based on RLT (Reformulation-Linearization Technique) [2] and suggested by Rob Pratt in the comments. The basic idea is as follows. Suppose we have a model as follows (using notation from [2]): \[\begin{align}\min&\sum_{i,j}\color{darkblue}c_{i,j} \cdot \color{darkred}x_i\cdot \color{darkred}x_j\\ & \sum_{i\in \color{darkblue}I_k} \color{darkred}x_i = 1 && \forall k \\ & \color{darkred}x_i \in \{0,1\}\end{align}\] This is basically our model, except we used a little bit more indexing magic. The standard linearization for this model is: \[\begin{align}\min&\sum_{i,j}\color{darkblue}c_{i,j} \cdot \color{darkred}w_{i,j}\\ & \sum_{i\in \color{darkblue}I_k} \color{darkred}x_i = 1 && \forall k \\ & \color{darkred}w_{i,j} \le \color{darkred}x_i && \forall i,j\\ & \color{darkred}w_{i,j} \le \color{darkred}x_j && \forall i,j\\ & \color{darkred}w_{i,j} \ge \color{darkred}x_i+\color{darkred}x_j-1 && \forall i,j\\ & \color{darkred}x_i \in \{0,1\} \\ & \color{darkred}w_{i,j} \ge 0 \end{align}\] This is just our MIP model, except we could drop some the inequalities. We exploited knowledge of the signs of the objective coefficients. Using the RLT formulation, we arrive at: \[\begin{align}\min&\sum_{i,j}\color{darkblue}c_{i,j} \cdot \color{darkred}w_{i,j}\\ & \sum_{i\in \color{darkblue}I_k} \color{darkred}x_i = 1 && \forall k \\ & \sum_{i\in \color{darkblue}I_k} \color{darkred}w_{i,j} = \color{darkred}x_j && \forall k, \forall j \in J_k \\ & \color{darkred}x_i \in \{0,1\} \\ & \color{darkred}w_{i,j} \ge 0 \end{align}\] using a bit of simplified indexing. We can basically use this, apart from the fact that our variables \(\color{darkred}w_{i,j}\) are only defined over some \(i,j\). In the end our RLT model becomes:

RLT Model |
---|

\[\begin{align}\min&\sum_{i,j|\color{darkblue}{\mathit{ok}}_{i,j}}\color{darkblue}{\mathit{dist}}_{i,j}\cdot\color{darkred}{\mathit{pair}}_{i,j} \\ & \sum_{i|\color{darkblue}{\mathit{group}}_{i,g}} \color{darkred}x_i = 1 && \forall g \\ & \sum_{i| \color{darkblue}{\mathit{group}}_{i,g}} (\color{darkred}{\mathit{pair}}_{i,j}|\color{darkblue}{\mathit{ok}}_{i,j}) + (\color{darkred}{\mathit{pair}}_{j,i}|\color{darkblue}{\mathit{ok}}_{j,i}) = \color{darkred}x_j && \forall j,g|{\bf{not}}\>\color{darkblue}{\mathit{group}}_{j,g} \\ & \color{darkred}x_i \in \{0,1\}\\ & \color{darkred}{\mathit{pair}}_{i,j} \ge 0 \end{align}\] |

### Example data

---- 29 SETipointsi1 , i2 , i3 , i4 , i5 , i6 , i7 , i8 , i9 , i10, i11, i12, i13, i14, i15 i16, i17, i18, i19, i20, i21, i22, i23, i24, i25, i26, i27, i28, i29, i30 i31, i32, i33, i34, i35, i36, i37, i38, i39, i40, i41, i42, i43, i44, i45 i46, i47, i48, i49, i50 ---- 29 SETggroupsg1, g2, g3, g4, g5 ---- 29 SETiggroup membershipg1 g2 g3 g4 g5 i1 YES i2 YES i3 YES i4 YES i5 YES i6 YES i7 YES i8 YES i9 YES i10 YES i11 YES i12 YES i13 YES i14 YES i15 YES i16 YES i17 YES i18 YES i19 YES i20 YES i21 YES i22 YES i23 YES i24 YES i25 YES i26 YES i27 YES i28 YES i29 YES i30 YES i31 YES i32 YES i33 YES i34 YES i35 YES i36 YES i37 YES i38 YES i39 YES i40 YES i41 YES i42 YES i43 YES i44 YES i45 YES i46 YES i47 YES i48 YES i49 YES i50 YES ---- 50 PARAMETERloccoordinates of locationsx y i1 66.111 75.582 i2 62.745 28.386 i3 8.642 10.251 i4 64.125 54.531 i5 3.152 79.236 i6 7.277 17.566 i7 52.563 75.021 i8 17.812 3.414 i9 58.513 62.123 i10 38.936 35.871 i11 24.303 24.642 i12 13.050 93.345 i13 37.994 78.340 i14 30.003 12.548 i15 74.887 6.923 i16 20.202 0.507 i17 26.961 49.985 i18 15.129 17.417 i19 33.064 31.691 i20 32.209 96.398 i21 99.360 36.990 i22 37.289 77.198 i23 39.668 91.310 i24 11.958 73.548 i25 5.542 57.630 i26 5.141 0.601 i27 40.123 51.988 i28 62.888 22.575 i29 39.612 27.601 i30 15.237 93.632 i31 42.266 13.466 i32 38.606 37.463 i33 26.848 94.837 i34 18.894 29.751 i35 7.455 40.135 i36 10.169 38.389 i37 32.409 19.213 i38 11.237 59.656 i39 51.145 4.507 i40 78.310 94.575 i41 59.646 60.734 i42 36.251 59.407 i43 67.985 50.659 i44 15.925 65.689 i45 52.388 12.440 i46 98.672 22.812 i47 67.565 77.678 i48 93.245 20.124 i49 29.714 19.723 i50 24.635 64.648

---- 83 VARIABLEx.Lselect pointsi11 1.000, i18 1.000, i31 1.000, i37 1.000, i49 1.000

### Results

---- 146 PARAMETERresultsMIQP LINEARIZED RLT variables 51.000 1034.000 1034.000 discrete 50.000 50.000 50.000 equations 6.000 989.000 206.000 status Optimal Optimal Optimal obj 137.644 137.644 137.644 time 0.609 0.062 0.016 nodes 40501.000 iterations 50953.000 24.000 72.000

---- 146 PARAMETERresultsMIQP LINEARIZED RLT variables 101.000 4504.000 4504.000 discrete 100.000 100.000 100.000 equations 11.000 4414.000 911.000 status TimeLimit Optimal Optimal obj 686.463 686.463 686.463 time 1000.187 0.672 0.047 nodes 1.026863E+7 1149.000 iterations 1.865805E+7 33934.000 332.000

**Machine Learning**really means:

*Machine needs some more Learning*.

**QToLin**option.

The original problem has:29859 rows, 30084 columns and 89752 non-zero elements250 binariesPresolving the problemThe presolved problem has:29859 rows, 30084 columns and 89752 non-zero elements250 binaries

### References

- Find common data point in multiple sets, https://stackoverflow.com/questions/73904007/find-common-data-point-in-multiple-sets
- Liberti, L., Compact linearization for binary quadratic problems,
*4OR***5**, 231–245 (2007)

### Appendix: GAMS model

$onText $offText *------------------------------------------------------* options*------------------------------------------------------option mip=cplex,
miqcp=cplex, threads=0, reslim=1000;*------------------------------------------------------* size of the problem*------------------------------------------------------setsi 'points'
/i1*i50/g 'groups'
/g1*g5/; *------------------------------------------------------* grounp membership: random groups*------------------------------------------------------set ig(i,g) 'group membership';scalar ng 'random group number';loop(i,ng = uniformint(1, card(g));ig(i,g) = ord(g) = ng;); display i,g,ig;*------------------------------------------------------* derived data: allowed pairs ok(i,j)*------------------------------------------------------alias(i,j);set ok(i,j) 'allowed pairs';* prevent double coutingok(i,j) = ord(i)<ord(j);* exclude points in same groupok(i,j)$ sum(g$(ig(i,g) and ig(j,g)),1) = no;*------------------------------------------------------* distances from random locations*------------------------------------------------------set coord /x,y/;parameter loc(i,coord)
'locations';loc(i,coord) = uniform(0,100); display loc;parameter dist(i,j) 'distances';dist(ok(i,j)) = sqrt( sum(coord,sqr(loc(i,coord)-loc(j,coord))));display dist;*------------------------------------------------------------------* reporting macros*------------------------------------------------------------------parameter
results(*,*);acronym Optimal;acronym TimeLimit;* macros for reporting$macro report(m,label) \ results('variables',label) = m.numvar; \ results(' discrete',label) = m.numdvar; \ results('equations',label) = m.numequ; \ results('status',label)= m.solvestat; \ results('status',label)$(m.solvestat=1) = Optimal; \ results('status',label)$(m.solvestat=3) = TimeLimit; \ results('obj',label) = td.l; \ results('time',label) = m.resusd; \ results('nodes',label) = m.nodusd; \ results('iterations',label) = m.iterusd; \ display results;*------------------------------------------------------* optimization model 1: non-convex MIQP*------------------------------------------------------binary variables x(i) 'select points';variable td 'total distance';equationstotalDist1 'quadratic objective'onePoint(g) 'one point per group'; totalDist1.. td =e= sum(ok(i,j), dist(i,j)*x(i)*x(j));onePoint(g).. sum(ig(i,g), x(i)) =e= 1;model selection1 /all/;solve selection1
minimizing td using miqcp;display x.l;report(selection1,"MIQP") *------------------------------------------------------* optimization model 2: MIP (linearized version)*------------------------------------------------------* relax variable pairbinary variables pair(i,j) 'both i and j are selected';pair.prior(i,j) = INF;equationstotalDist2 'linear objective'bothSelected(i,j) 'points i and j are selected'; totalDist2.. td =e= sum(ok, pair(ok)*dist(ok));* x(i)=x(j)=1 ==> pair(i,j)=1bothSelected(ok(i,j)).. pair(i,j) =g= x(i) + x(j) - 1; model selection2 /totalDist2, onePoint, bothSelected/;solve selection2
minimizing td using mip;display x.l;report(selection2,"LINEARIZED") *------------------------------------------------------* optimization model 3: compact RLT*
(Reformulation-Linearization Technique)*------------------------------------------------------equation RLT(j,g) 'Reformulation-Linearization Technique';RLT(j,g)$( not ig(j,g))..sum(ig(i,g), pair(i,j)$ok(i,j) + pair(j,i)$ok(j,i)) =e= x(j);model selection3 /totalDist2, onePoint, RLT/;solve selection3
minimizing td using mip;display x.l;report(selection3,"RLT") *------------------------------------------------------* data for plotting*------------------------------------------------------parameter
points(i,g,*);points(ig(i,g),coord) = loc(i,coord); display points;parameter
lines(i,j,*);loop((i,j)$(pair.l(i,j)>0.5),lines(i,j,'x1') = loc(i,'x'); lines(i,j,'y1') = loc(i,'y'); lines(i,j,'x2') = loc(j,'x'); lines(i,j,'y2') = loc(j,'y'); ); display lines; |

This solves quickly, but for a larger instance you might do better with a compact (RLT-based) linearization obtained by omitting the bothSelected constraints and multiplying both sides of the onePoint(g) constraints by x(j) for all points j that are not in group g, yielding something like this:

ReplyDeletecon Linearize {g in GROUPS, j in POINTS: group[j] ne g}:

sum { in PAIRS: group[i] = g} pair[i,j] + sum {<(j),i> in PAIRS: group[i] = g} pair[j,i] = x[j];

This new formulation solved at the root node for me, whereas the original formulation has a root LP bound of 0.

Yes, that also worked better for me. Even for Gurobi it is better, although it generates RLT cuts for the traditional linearization.

DeleteThank you for the updates. It looks like you accidentally omitted the onePoint constraints from the RLT write-up but correctly included them in the code.

DeleteThat was sloppy.

Delete