Often I see people who have read or heard about NP-completeness dismissing a MIP formulation even before trying. This is certainly a misunderstanding of complexity theory. To be clear:

Complexity theory says (almost)nothingabout how fast a particular MIP solver can solve your particular problem on your particular dataset.

I think we can even drop the "*almost*." So, let's not fall into this trap, and just try things out!

In [1] a simple problem is stated:

Given \(n=40,000\) numbers, select a subset that sums up to 100,000 or as closely as possible.

A small MIP model can do the job:

MIP model |
---|

\[\begin{align}\min\>&\left|\mathit{\color{darkred}{Deviation}}\right| \\ & \sum_i \color{darkblue}v_i \cdot \color{darkred}x_i = \mathit{\color{darkblue}{Target}} + \mathit{\color{darkred}{Deviation}}\\ & \color{darkred}x_i \in \{0,1\}\end{align} \] |

The absolute value can be linearized using textbook methods. The data in [1] seems integer-valued. So I generated 40,000 random integers between 0 and 10000. If we solve this with the open-source MIP solver CBC we see:

------------------------------------------------------------------------------- Summary ------------------------------------------------------------------------------- (CVXPY) Dec 27 02:04:49 PM: Problem status: optimal (CVXPY) Dec 27 02:04:49 PM: Optimal value: 1.438e-11 (CVXPY) Dec 27 02:04:49 PM: Compilation took 1.025e-01 seconds (CVXPY) Dec 27 02:04:49 PM: Solver (including time spent in interface) took 1.137e+00 seconds i:6 value:9273 i:9 value:6112 i:11 value:7093 i:13 value:9209 i:15 value:9063 i:3857 value:9209 i:9341 value:2035 i:12413 value:9209 i:15914 value:2035 i:15989 value:2126 i:16558 value:2035 i:16965 value:2035 i:24271 value:63 i:24290 value:7093 i:26791 value:624 i:30330 value:6752 i:32612 value:6752 i:34308 value:219 i:36798 value:9063 sum=100000

This is not bad at all!

A major reason the solver had an easy time here is that the data is **integer-valued**. The solver quickly finds a combination that adds up to the target and can stop then. If the data is** real-valued**, CBC has a much more difficult time. After 1,000 seconds, it still is not closer than 0.00312158 from the target. However, a commercial solver like Cplex can solve this more difficult problem to proven optimality in less than 5 seconds (final deviation is 9.094947E-13, i.e. noise). If you are happy with a good solution, CBC, after just 100 seconds, has found a solution with a deviation of 0.22514964. Here we see performance can heavily depend on the (structure of the) data and on the solver.

Here are some other reasons why it is a good idea to try a MIP solver first:

- It is a good tool to verify we understand the problem. We can think of it as an executable form of a mathematical model. This will check if our model is correct. MIP solvers are quite picky and tend to fail miserably if the model or data are incorrect. We can use this to our advantage.
- Who knows, maybe the MIP model solves the problem very quickly.
- Even if the MIP solver can not find proven optimal solutions in an acceptable time, it can often (but not always) find good solutions very quickly.

### References

- Find the exact or the most approximate sum by several numbers in a column, https://stackoverflow.com/questions/74883420/find-exact-or-the-most-aproximate-sum-by-several-numbers-in-a-column

### Appendix: CVXPY model

import cvxpy as cp import random # # size of the problem # N = 40000 I = range(N) # # data # target = 100000 random.seed(12345) v = [random.randrange(0,10000) for i in I] # # optimization model # x = cp.Variable(N,boolean=True) deviation = cp.Variable(1) prob = cp.Problem(cp.Minimize(cp.abs(deviation)), [v@x==target+deviation]) # # solve and reporting # prob.solve(solver=cp.CBC,verbose=True) xsum = 0 for i in I: if x[i].value > 0.5: print(f"i:{i} value:{v[i]}") xsum += v[i] print(f"sum={xsum}")

=============================================================================== CVXPY v1.2.1 =============================================================================== (CVXPY) Dec 27 02:04:48 PM: Your problem has 40001 variables, 1 constraints, and 0 parameters. (CVXPY) Dec 27 02:04:48 PM: It is compliant with the following grammars: DCP, DQCP (CVXPY) Dec 27 02:04:48 PM: (If you need to solve this problem multiple times, but with different data, consider using parameters.) (CVXPY) Dec 27 02:04:48 PM: CVXPY will first compile your problem; then, it will invoke a numerical solver to obtain a solution. ------------------------------------------------------------------------------- Compilation ------------------------------------------------------------------------------- (CVXPY) Dec 27 02:04:48 PM: Compiling problem (target solver=CBC). (CVXPY) Dec 27 02:04:48 PM: Reduction chain: Dcp2Cone -> CvxAttr2Constr -> ConeMatrixStuffing -> CBC (CVXPY) Dec 27 02:04:48 PM: Applying reduction Dcp2Cone (CVXPY) Dec 27 02:04:48 PM: Applying reduction CvxAttr2Constr (CVXPY) Dec 27 02:04:48 PM: Applying reduction ConeMatrixStuffing (CVXPY) Dec 27 02:04:48 PM: Applying reduction CBC (CVXPY) Dec 27 02:04:48 PM: Finished problem compilation (took 1.025e-01 seconds). ------------------------------------------------------------------------------- Numerical solver ------------------------------------------------------------------------------- (CVXPY) Dec 27 02:04:48 PM: Invoking solver CBC to obtain a solution. Welcome to the CBC MILP Solver Version: 2.10.5 Build Date: Feb 9 2021 command line - ICbcModel -solve -quit (default strategy 1) Continuous objective value is 0 - 0.12 seconds Cgl0004I processed model has 2 rows, 9835 columns (9834 integer (722 of which binary)) and 19670 elements Cbc0012I Integer solution of 722 found by DiveCoefficient after 0 iterations and 0 nodes (0.29 seconds) Cbc0038I Full problem 2 rows 9835 columns, reduced to 2 rows 3 columns Cbc0012I Integer solution of 65 found by DiveCoefficient after 0 iterations and 0 nodes (0.33 seconds) Cbc0013I At root node, 0 cuts changed objective from 0 to 0 in 1 passes Cbc0014I Cut generator 0 (Probing) - 0 row cuts average 0.0 elements, 1 column cuts (1 active) in 0.004 seconds - new frequency is 1 Cbc0014I Cut generator 1 (Gomory) - 0 row cuts average 0.0 elements, 0 column cuts (0 active) in 0.001 seconds - new frequency is -100 Cbc0014I Cut generator 2 (Knapsack) - 0 row cuts average 0.0 elements, 0 column cuts (0 active) in 0.000 seconds - new frequency is -100 Cbc0014I Cut generator 3 (Clique) - 0 row cuts average 0.0 elements, 0 column cuts (0 active) in 0.000 seconds - new frequency is -100 Cbc0014I Cut generator 4 (MixedIntegerRounding2) - 0 row cuts average 0.0 elements, 0 column cuts (0 active) in 0.001 seconds - new frequency is -100 Cbc0014I Cut generator 5 (FlowCover) - 0 row cuts average 0.0 elements, 0 column cuts (0 active) in 0.000 seconds - new frequency is -100 Cbc0010I After 0 nodes, 1 on tree, 65 best solution, best possible 0 (0.35 seconds) Cbc0012I Integer solution of 49 found by DiveCoefficient after 1 iterations and 1 nodes (0.37 seconds) Cbc0012I Integer solution of 13 found by DiveCoefficient after 2 iterations and 2 nodes (0.46 seconds) Cbc0012I Integer solution of 7 found by DiveCoefficient after 3 iterations and 3 nodes (0.79 seconds) Cbc0012I Integer solution of 6 found by DiveCoefficient after 4 iterations and 4 nodes (1.00 seconds) Cbc0012I Integer solution of 0 found by DiveCoefficient after 5 iterations and 5 nodes (1.02 seconds) Cbc0001I Search completed - best objective 0, took 5 iterations and 5 nodes (1.02 seconds) Cbc0032I Strong branching done 10 times (10 iterations), fathomed 0 nodes and fixed 0 variables Cbc0035I Maximum depth 4, 0 variables fixed on reduced cost Cuts at root node changed objective from 0 to 0 Probing was tried 6 times and created 1 cuts of which 0 were active after adding rounds of cuts (0.004 seconds) Gomory was tried 1 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.001 seconds) Knapsack was tried 1 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds) Clique was tried 1 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds) MixedIntegerRounding2 was tried 1 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.001 seconds) FlowCover was tried 1 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds) TwoMirCuts was tried 1 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.001 seconds) ZeroHalf was tried 1 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds) Result - Optimal solution found Objective value: 0.00000000 Enumerated nodes: 5 Total iterations: 5 Time (CPU seconds): 1.08 Time (Wallclock seconds): 1.08 Total time (CPU seconds): 1.09 (Wallclock seconds): 1.09 ------------------------------------------------------------------------------- Summary ------------------------------------------------------------------------------- (CVXPY) Dec 27 02:04:49 PM: Problem status: optimal (CVXPY) Dec 27 02:04:49 PM: Optimal value: 1.438e-11 (CVXPY) Dec 27 02:04:49 PM: Compilation took 1.025e-01 seconds (CVXPY) Dec 27 02:04:49 PM: Solver (including time spent in interface) took 1.137e+00 seconds i:6 value:9273 i:9 value:6112 i:11 value:7093 i:13 value:9209 i:15 value:9063 i:3857 value:9209 i:9341 value:2035 i:12413 value:9209 i:15914 value:2035 i:15989 value:2126 i:16558 value:2035 i:16965 value:2035 i:24271 value:63 i:24290 value:7093 i:26791 value:624 i:30330 value:6752 i:32612 value:6752 i:34308 value:219 i:36798 value:9063 sum=100000

Important is to choose right solver for your problem. With free solvers I was many times disappointed since they found no solutions in hours compared to commercial solvers finding solutions in seconds or a few minutes.

ReplyDeleteYes, the difference between open source solvers and commercial ones is very large (and it is increasing). In many other mathematical fields this is not the case. So this comes sometimes as a surprise to users from other fields.

Delete