The problem from [1], slightly generalized, is easy to describe:
Given an \(n\) vector \(\color{darkblue}v_i\) with data, select \(k\lt n\) elements that are closest to each other.
Of course, we can state the problem as a formal optimization model:
Optimization Model |
\[\begin{align}\min\>&\color{darkred}U-\color{darkred}L \\ & \color{darkred}\delta_i=1 \Rightarrow \color{darkblue}v_i \in [\color{darkred}L,\color{darkred}U] \\ & \sum_i \color{darkred}\delta_i=\color{darkblue}k \\ & \color{darkred}\delta_i \in \{0,1\} \end{align}\] |
We can model this as a standard Mixed-Integer Programming (MIP) problem as follows:
MIP Model |
\[\begin{align}\min\>& \color{darkred}U-\color{darkred}L\\ & \color{darkblue}v_i \ge \color{darkred}L - (1-\color{darkred}\delta_i)\cdot\color{darkblue}M \\ & \color{darkblue}v_i \le \color{darkred}U + (1-\color{darkred}\delta_i)\cdot \color{darkblue}M \\ & \sum_i \color{darkred}\delta_i=\color{darkblue}k \\ & \color{darkred}\delta_i \in \{0,1\} \\ & \color{darkred}L,\color{darkred}U \in [ \min_i \color{darkblue}v_i, \max_i \color{darkblue}v_i] \end{align}\] |
Here \(\color{darkblue}M\) is a large enough, but as small as possible, constant. A reasonable value is: \[\color{darkblue}M := \max_i \color{darkblue}v_i - \min_i \color{darkblue}v_i\]
Data
To experiment, I chose \(n=1000, k=100\) and generated random data \[\color{darkblue}v_i \sim U(-100,100)\] We would expect the optimal range to be a bit better than 20. Indeed the optimal solution has an objective of 15.923.
Performance
"k out of n" models can be difficult. Indeed the performance of this default model is not great.
We can improve the performance significantly by changing the model just slightly:
Improved MIP Model |
\[\begin{align}\min\>& \color{darkred}R \\ & \color{darkred}R = \color{darkred}U-\color{darkred}L\\ & \color{darkblue}v_i \ge \color{darkred}L - (1-\color{darkred}\delta_i)\cdot\color{darkblue}M \\ & \color{darkblue}v_i \le \color{darkred}U + (1-\color{darkred}\delta_i)\cdot \color{darkblue}M \\ & \sum_i \color{darkred}\delta_i=\color{darkblue}k \\ & \color{darkred}\delta_i \in \{0,1\} \\ & \color{darkred}L,\color{darkred}U \in [ \min_i \color{darkblue}v_i, \max_i \color{darkblue}v_i] \\ & \color{darkred}R \ge 0\end{align}\] |
By imposing \(\color{darkred}R \ge 0\), we implicitly added the constraint \(\color{darkred}U\ge\color{darkred}L\).
---- 114 PARAMETER results
MIP IMPROVED
n 1000.000 1000.000
k 100.000 100.000
vars 1003.000 1003.000
discr 1000.000 1000.000
equs 2002.000 2002.000
status Optimal Optimal
obj 15.923 15.923
time 77.203 2.468
nodes 53387.000 6445.000
iterations 355004.000 88794.000
The effect of our change is rather dramatic. I have seen this before, so this is not just an accidental result. I hoped the presolver would be smart enough to deduce this, but alas, we need to worry about this ourselves.
Sorting-based Algorithm
A totally different approach is as follows:
- Sort the vector \(\color{darkblue}v\).
- Calculate the ranges \(r_j := U_j-L_j = \color{darkblue}v_{j+k-1} - \color{darkblue}v_j \) for \(j=1,\dots,n-k+1\).
- Pick the best \(r_j\).
|
Sorting based algorithm |
When I implement this, I see:
---- 171 PARAMETER results
MIP IMPROVED SORTING
n 1000.000 1000.000 1000.000
k 100.000 100.000 100.000
vars 1003.000 1003.000
discr 1000.000 1000.000
equs 2002.000 2002.000
status Optimal Optimal
obj 15.923 15.923 15.923
time 77.203 2.468 0.090
nodes 53387.000 6445.000
iterations 355004.000 88794.000
Conclusion
The MIP models for this problem are interesting: minor changes in the model can improve the solution time by a large amount. However, an algorithm based on sorting is much faster than solving a discrete optimization problem. On our random data set, the performance differences are orders of magnitudes.
References
Appendix: GAMS code
$ontext
Given a
data vector v(i) of length n, find k elements (k<n)
that
are closest to each other.
$offtext
*-------------------------------------------------------
* problem size
*-------------------------------------------------------
set i 'length of data vector' /i1*i1000/;
scalar k 'number of selected items' /100/;
*-------------------------------------------------------
* random data
*-------------------------------------------------------
parameter v(i) 'random data';
v(i) = uniform(-100,100);
display v,k;
*-------------------------------------------------------
* derived data
*-------------------------------------------------------
scalars
vmin
vmax
M 'big-M'
;
vmin = smin(i,v(i));
vmax = smax(i,v(i));
M = vmax-vmin;
*-------------------------------------------------------
* reporting macros
*-------------------------------------------------------
acronym TimeLimit;
acronym Optimal;
acronym Error;
parameter
results(*,*);
$macro report(m,label) \
results('n',label) = card(i); \
results('k',label) = k; \
results('vars',label) = m.numVar; \
results(' discr',label) = m.numDVar; \
results('equs',label) = m.numEqu; \
results('status',label) = Error; \
results('status',label)$(m.solvestat=1) = Optimal;
\
results('status',label)$(m.solvestat=3) =
TimeLimit; \
results('obj',label) = m.objval; \
results('time',label) = m.resusd; \
results('nodes',label) = m.nodusd; \
results('iterations',label) = m.iterusd; \
results('gap%',label)$(m.solvestat=3) = 100*abs(m.objest
- m.objval)/abs(m.objest); \
display results;
*-------------------------------------------------------
* MIP Model
*
* Note: R will be substituted out. So the solver
* will see U-L as objective function.
*-------------------------------------------------------
variables
L
'lower bound on selected values'
U
'upper bound on selected values'
R
'U-L'
;
U.lo = vmin; U.up = vmax;
L.lo = vmin; L.up = vmax;
binary variables delta(i) 'selection of data points';
equations
obj
'objective'
vminbound(i) 'delta(i)=1 ==> v(i)>=L'
vmaxbound(i) 'delta(i)=1 ==> v(i)<=U'
select 'k to be
selected'
;
obj.. R =e= U-L;
vmaxbound(i).. v(i) =l= U + (1-delta(i))*M;
vminbound(i).. v(i) =g= L - (1-delta(i))*M;
select.. sum(i, delta(i)) =e= k;
model m1 /all/;
options threads=0;
solve m1
minimizing R using mip;
display
delta.l,U.l,L.l,R.l;
report(m1,"MIP")
*-------------------------------------------------------
* Add bound R>=0
*
* R will no longer be substituted out.
*-------------------------------------------------------
* the next one is important for performance
R.lo = 0;
solve m1
minimizing R using mip;
report(m1,"IMPROVED")
*-------------------------------------------------------
* Algorithm
*-------------------------------------------------------
* in case we have zeros
v(i) = EPS+v(i);
set psol(i) 'Python solution';
scalar time;
time = timeElapsed;
embeddedCode Python:
#
# get GAMS data
#
k = int(list(gams.get("k"))[0])
v = list(gams.get("v"))
#
# sort v
#
v.sort(key = lambda x: x[1])
#
# find best range
#
n = len(v)
bestr = 1e10
bestj = 1e10
for j in range(0,n-k+1):
r = v[j+k-1][1]-v[j][1]
if r < bestr:
bestr = r
bestj = j
print(f"Best range:{bestr} at j={bestj}")
#
# report solution back as subset
#
sol = []
for j in range(bestj,bestj+k):
sol.append(v[j][0])
gams.set("psol",sol)
endEmbeddedCode psol
time = timeElapsed-time;
display psol;
results('obj','SORTING') = smax(psol,v[psol])-smin(psol,v[psol]);
results('time','SORTING') = time;
results('n','SORTING') = card(i);
results('k','SORTING') = k;
display results;
|
What solver did you use for this? I would imagine that Gurobi's presolve can infer R>=0.
ReplyDeleteGurobi is not doing so great on this model. First: No it will not deduce R>=0: these two models are different for Gurobi. The timings are 3520 and 2368 seconds and the node counts are: 1,339,948 and 9,024,449. (Default settings).
Delete