## Saturday, August 13, 2022

### k out of n with smallest bandwidth

The problem from [1], slightly generalized, is easy to describe:

Given an $$n$$ vector $$\color{darkblue}v_i$$ with data, select $$k\lt n$$ elements that are closest to each other.

Of course, we can state the problem as a formal optimization model:

Optimization Model
\begin{align}\min\>&\color{darkred}U-\color{darkred}L \\ & \color{darkred}\delta_i=1 \Rightarrow \color{darkblue}v_i \in [\color{darkred}L,\color{darkred}U] \\ & \sum_i \color{darkred}\delta_i=\color{darkblue}k \\ & \color{darkred}\delta_i \in \{0,1\} \end{align}

We can model this as a standard Mixed-Integer Programming (MIP) problem as follows:

MIP Model
\begin{align}\min\>& \color{darkred}U-\color{darkred}L\\ & \color{darkblue}v_i \ge \color{darkred}L - (1-\color{darkred}\delta_i)\cdot\color{darkblue}M \\ & \color{darkblue}v_i \le \color{darkred}U + (1-\color{darkred}\delta_i)\cdot \color{darkblue}M \\ & \sum_i \color{darkred}\delta_i=\color{darkblue}k \\ & \color{darkred}\delta_i \in \{0,1\} \\ & \color{darkred}L,\color{darkred}U \in [ \min_i \color{darkblue}v_i, \max_i \color{darkblue}v_i] \end{align}

Here $$\color{darkblue}M$$ is a large enough, but as small as possible, constant. A reasonable value is: $\color{darkblue}M := \max_i \color{darkblue}v_i - \min_i \color{darkblue}v_i$

### Data

To experiment, I chose $$n=1000, k=100$$ and generated random data $\color{darkblue}v_i \sim U(-100,100)$ We would expect the optimal range to be a bit better than 20. Indeed the optimal solution has an objective of 15.923.

### Performance

"k out of n" models can be difficult. Indeed the performance of this default model is not great.

We can improve the performance significantly by changing the model just slightly:

Improved MIP Model
\begin{align}\min\>& \color{darkred}R \\ & \color{darkred}R = \color{darkred}U-\color{darkred}L\\ & \color{darkblue}v_i \ge \color{darkred}L - (1-\color{darkred}\delta_i)\cdot\color{darkblue}M \\ & \color{darkblue}v_i \le \color{darkred}U + (1-\color{darkred}\delta_i)\cdot \color{darkblue}M \\ & \sum_i \color{darkred}\delta_i=\color{darkblue}k \\ & \color{darkred}\delta_i \in \{0,1\} \\ & \color{darkred}L,\color{darkred}U \in [ \min_i \color{darkblue}v_i, \max_i \color{darkblue}v_i] \\ & \color{darkred}R \ge 0\end{align}

By imposing $$\color{darkred}R \ge 0$$, we implicitly added the constraint $$\color{darkred}U\ge\color{darkred}L$$.

----    114 PARAMETER results

MIP    IMPROVED

n             1000.000    1000.000
k              100.000     100.000
vars          1003.000    1003.000
discr       1000.000    1000.000
equs          2002.000    2002.000
status         Optimal     Optimal
obj             15.923      15.923
time            77.203       2.468
nodes        53387.000    6445.000
iterations  355004.000   88794.000


The effect of our change is rather dramatic. I have seen this before, so this is not just an accidental result. I hoped the presolver would be smart enough to deduce this, but alas, we need to worry about this ourselves.

### Sorting-based Algorithm

A totally different approach is as follows:

1. Sort the vector $$\color{darkblue}v$$.
2. Calculate the ranges $$r_j := U_j-L_j = \color{darkblue}v_{j+k-1} - \color{darkblue}v_j$$ for $$j=1,\dots,n-k+1$$.
3. Pick the best $$r_j$$.

 Sorting based algorithm

When I implement this, I see:

----    171 PARAMETER results

MIP    IMPROVED     SORTING

n             1000.000    1000.000    1000.000
k              100.000     100.000     100.000
vars          1003.000    1003.000
discr       1000.000    1000.000
equs          2002.000    2002.000
status         Optimal     Optimal
obj             15.923      15.923      15.923
time            77.203       2.468       0.090
nodes        53387.000    6445.000
iterations  355004.000   88794.000


### Conclusion

The MIP models for this problem are interesting: minor changes in the model can improve the solution time by a large amount. However, an algorithm based on sorting is much faster than solving a discrete optimization problem. On our random data set, the performance differences are orders of magnitudes.

### Appendix: GAMS code

 \$ontext    Given a data vector v(i) of length n, find k elements (k v(i)>=L'     vmaxbound(i) 'delta(i)=1 ==> v(i)<=U'     select       'k to be selected' ; obj.. R =e= U-L; vmaxbound(i).. v(i) =l= U + (1-delta(i))*M; vminbound(i).. v(i) =g= L - (1-delta(i))*M; select.. sum(i, delta(i)) =e= k; model m1 /all/; options threads=0; solve m1 minimizing R using mip; display delta.l,U.l,L.l,R.l; report(m1,"MIP") *------------------------------------------------------- * Add bound R>=0 * * R will no longer be substituted out. *------------------------------------------------------- * the next one is important for performance R.lo = 0; solve m1 minimizing R using mip; report(m1,"IMPROVED") *------------------------------------------------------- * Algorithm *------------------------------------------------------- * in case we have zeros v(i) = EPS+v(i); set psol(i) 'Python solution'; scalar time; time = timeElapsed; embeddedCode Python: # # get GAMS data # k = int(list(gams.get("k"))[0]) v = list(gams.get("v")) # # sort v # v.sort(key = lambda x: x[1]) # # find best range # n = len(v) bestr = 1e10 bestj = 1e10 for j in range(0,n-k+1):     r = v[j+k-1][1]-v[j][1]     if r < bestr:        bestr = r        bestj = j print(f"Best range:{bestr} at j={bestj}") # # report solution back as subset # sol = [] for j in range(bestj,bestj+k):     sol.append(v[j][0]) gams.set("psol",sol) endEmbeddedCode psol time = timeElapsed-time; display psol; results('obj','SORTING') = smax(psol,v[psol])-smin(psol,v[psol]); results('time','SORTING') = time; results('n','SORTING') = card(i); results('k','SORTING') = k; display results;