Yet Another Math Programming Consultant: k out of n with smallest bandwidth

The problem from [1], slightly generalized, is easy to describe:

Given an $n$ vector $\color{darkblue}v_i$ with data, select $k\lt n$ elements that are closest to each other.

Of course, we can state the problem as a formal optimization model:

Optimization Model
\[\begin{align}\min\>&\color{darkred}U-\color{darkred}L \\ & \color{darkred}\delta_i=1 \Rightarrow \color{darkblue}v_i \in [\color{darkred}L,\color{darkred}U] \\ & \sum_i \color{darkred}\delta_i=\color{darkblue}k \\ & \color{darkred}\delta_i \in \{0,1\} \end{align}\]

We can model this as a standard Mixed-Integer Programming (MIP) problem as follows:

MIP Model
\[\begin{align}\min\>& \color{darkred}U-\color{darkred}L\\ & \color{darkblue}v_i \ge \color{darkred}L - (1-\color{darkred}\delta_i)\cdot\color{darkblue}M \\ & \color{darkblue}v_i \le \color{darkred}U + (1-\color{darkred}\delta_i)\cdot \color{darkblue}M \\ & \sum_i \color{darkred}\delta_i=\color{darkblue}k \\ & \color{darkred}\delta_i \in \{0,1\} \\ & \color{darkred}L,\color{darkred}U \in [ \min_i \color{darkblue}v_i, \max_i \color{darkblue}v_i] \end{align}\]

MIP Model

\[\begin{align}\min\>& \color{darkred}U-\color{darkred}L\\ & \color{darkblue}v_i \ge \color{darkred}L - (1-\color{darkred}\delta_i)\cdot\color{darkblue}M \\ & \color{darkblue}v_i \le \color{darkred}U + (1-\color{darkred}\delta_i)\cdot \color{darkblue}M \\ & \sum_i \color{darkred}\delta_i=\color{darkblue}k \\ & \color{darkred}\delta_i \in \{0,1\} \\ & \color{darkred}L,\color{darkred}U \in [ \min_i \color{darkblue}v_i, \max_i \color{darkblue}v_i] \end{align}\]

Here $\color{darkblue}M$ is a large enough, but as small as possible, constant. A reasonable value is: \[\color{darkblue}M := \max_i \color{darkblue}v_i - \min_i \color{darkblue}v_i\]

Data

To experiment, I chose $n=1000, k=100$ and generated random data \[\color{darkblue}v_i \sim U(-100,100)\] We would expect the optimal range to be a bit better than 20. Indeed the optimal solution has an objective of 15.923.

Performance

"k out of n" models can be difficult. Indeed the performance of this default model is not great.

We can improve the performance significantly by changing the model just slightly:

Improved MIP Model
\[\begin{align}\min\>& \color{darkred}R \\ & \color{darkred}R = \color{darkred}U-\color{darkred}L\\ & \color{darkblue}v_i \ge \color{darkred}L - (1-\color{darkred}\delta_i)\cdot\color{darkblue}M \\ & \color{darkblue}v_i \le \color{darkred}U + (1-\color{darkred}\delta_i)\cdot \color{darkblue}M \\ & \sum_i \color{darkred}\delta_i=\color{darkblue}k \\ & \color{darkred}\delta_i \in \{0,1\} \\ & \color{darkred}L,\color{darkred}U \in [ \min_i \color{darkblue}v_i, \max_i \color{darkblue}v_i] \\ & \color{darkred}R \ge 0\end{align}\]

Improved MIP Model

\[\begin{align}\min\>& \color{darkred}R \\ & \color{darkred}R = \color{darkred}U-\color{darkred}L\\ & \color{darkblue}v_i \ge \color{darkred}L - (1-\color{darkred}\delta_i)\cdot\color{darkblue}M \\ & \color{darkblue}v_i \le \color{darkred}U + (1-\color{darkred}\delta_i)\cdot \color{darkblue}M \\ & \sum_i \color{darkred}\delta_i=\color{darkblue}k \\ & \color{darkred}\delta_i \in \{0,1\} \\ & \color{darkred}L,\color{darkred}U \in [ \min_i \color{darkblue}v_i, \max_i \color{darkblue}v_i] \\ & \color{darkred}R \ge 0\end{align}\]

By imposing $\color{darkred}R \ge 0$, we implicitly added the constraint $\color{darkred}U\ge\color{darkred}L$.

----    114 PARAMETER results  

                   MIP    IMPROVED

n             1000.000    1000.000
k              100.000     100.000
vars          1003.000    1003.000
  discr       1000.000    1000.000
equs          2002.000    2002.000
status         Optimal     Optimal
obj             15.923      15.923
time            77.203       2.468
nodes        53387.000    6445.000
iterations  355004.000   88794.000

The effect of our change is rather dramatic. I have seen this before, so this is not just an accidental result. I hoped the presolver would be smart enough to deduce this, but alas, we need to worry about this ourselves.

Sorting-based Algorithm

A totally different approach is as follows:

Sort the vector $\color{darkblue}v$.
Calculate the ranges $r_j := U_j-L_j = \color{darkblue}v_{j+k-1} - \color{darkblue}v_j $ for $j=1,\dots,n-k+1$.
Pick the best $r_j$.

Sorting based algorithm

When I implement this, I see:

----    171 PARAMETER results  

                   MIP    IMPROVED     SORTING

n             1000.000    1000.000    1000.000
k              100.000     100.000     100.000
vars          1003.000    1003.000
  discr       1000.000    1000.000
equs          2002.000    2002.000
status         Optimal     Optimal
obj             15.923      15.923      15.923
time            77.203       2.468       0.090
nodes        53387.000    6445.000
iterations  355004.000   88794.000

Conclusion

The MIP models for this problem are interesting: minor changes in the model can improve the solution time by a large amount. However, an algorithm based on sorting is much faster than solving a discrete optimization problem. On our random data set, the performance differences are orders of magnitudes.

References

Find n-1 closest values based on criteria in a dataframe in R, https://stackoverflow.com/questions/73337722/find-n-1-closest-values-based-on-criteria-in-a-dataframe-in-r

Appendix: GAMS code

$ontext

   Given a data vector v(i) of length n, find k elements (k<n)
   that are closest to each other.

$offtext

*-------------------------------------------------------
* problem size
*-------------------------------------------------------

set i 'length of data vector' /i1*i1000/;

scalar k 'number of selected items' /100/;

*-------------------------------------------------------
* random data
*-------------------------------------------------------

parameter v(i) 'random data';
v(i) = uniform(-100,100);
display v,k;

*-------------------------------------------------------
* derived data
*-------------------------------------------------------

scalars
   vmin
   vmax
   M 'big-M'
;
vmin = smin(i,v(i));
vmax = smax(i,v(i));
M = vmax-vmin;

*-------------------------------------------------------
* reporting macros
*-------------------------------------------------------

acronym TimeLimit;
acronym Optimal;
acronym Error;

parameter results(*,*);
$macro report(m,label) \
    results('n',label) = card(i); \
    results('k',label) = k; \
    results('vars',label) = m.numVar; \
    results(' discr',label) = m.numDVar; \
    results('equs',label) = m.numEqu; \
    results('status',label) = Error; \
    results('status',label)$(m.solvestat=1) = Optimal; \
    results('status',label)$(m.solvestat=3) = TimeLimit; \
    results('obj',label) = m.objval; \
    results('time',label) = m.resusd; \
    results('nodes',label) = m.nodusd; \
    results('iterations',label) = m.iterusd; \
    results('gap%',label)$(m.solvestat=3) = 100*abs(m.objest - m.objval)/abs(m.objest); \
    display results;

*-------------------------------------------------------
* MIP Model
*
* Note: R will be substituted out. So the solver
* will see U-L as objective function.
*-------------------------------------------------------

variables
   L     'lower bound on selected values'
   U     'upper bound on selected values'
   R     'U-L'
;

U.lo = vmin; U.up = vmax;
L.lo = vmin; L.up = vmax;

binary variables delta(i) 'selection of data points';

equations
   obj          'objective'
    vminbound(i) 'delta(i)=1 ==> v(i)>=L'
    vmaxbound(i) 'delta(i)=1 ==> v(i)<=U'
    select       'k to be selected'
;

obj.. R =e= U-L;

vmaxbound(i).. v(i) =l= U + (1-delta(i))*M;
vminbound(i).. v(i) =g= L - (1-delta(i))*M;

select.. sum(i, delta(i)) =e= k;

model m1 /all/;
options threads=0;
solve m1 minimizing R using mip;

display delta.l,U.l,L.l,R.l;
report(m1,"MIP")

*-------------------------------------------------------
* Add bound R>=0
*
* R will no longer be substituted out.
*-------------------------------------------------------

* the next one is important for performance
R.lo = 0;

solve m1 minimizing R using mip;
report(m1,"IMPROVED")

*-------------------------------------------------------
* Algorithm
*-------------------------------------------------------

* in case we have zeros
v(i) = EPS+v(i);

set psol(i) 'Python solution';

scalar time;
time = timeElapsed;

embeddedCode Python:

#
# get GAMS data
#
k = int(list(gams.get("k"))[0])
v = list(gams.get("v"))

#
# sort v
#
v.sort(key = lambda x: x[1])

#
# find best range
#
n = len(v)
bestr = 1e10
bestj = 1e10
for j in range(0,n-k+1):
    r = v[j+k-1][1]-v[j][1]
    if r < bestr:
       bestr = r
       bestj = j
print(f"Best range:{bestr} at j={bestj}")

#
# report solution back as subset
#
sol = []
for j in range(bestj,bestj+k):
    sol.append(v[j][0])
gams.set("psol",sol)

endEmbeddedCode psol

time = timeElapsed-time;

display psol;
results('obj','SORTING') = smax(psol,v[psol])-smin(psol,v[psol]);
results('time','SORTING') = time;
results('n','SORTING') = card(i);
results('k','SORTING') = k;
display results;

Yet Another Math Programming Consultant

Saturday, August 13, 2022

k out of n with smallest bandwidth