In this post an interesting problem is posted. Too bad there is little or no background given on this problem. I am trying to suggest an optimization model opposed to some heuristic algorithm in Matlab. Looking at the zero points I get for my answer, without much success. I still think the MIP model is rather elegant. Here is the problem:
We have a large matrix with 10 columns and around 250k rows. The values are –1 and +1. We need to select N rows such that if we add up the columns of the selected rows, the average of the absolute values of those sums is minimized.
We can formulate this as very large MIP model with about 250k binary variables indicating if a row is selected.
There is however a data trick we can use. Observe that each cell has two possibilities, so there are only 2^10=1024 different possible rows. Our data is now essentially a fixed 1024 x 10 matrix and a vector of length 1024 of counts that tell us how many rows of each type we saw.
Now we only have to use 1024 integer variables. The upper bounds on the integer variables are the counts. Actually we can tighten that a little bit: it is the minimum of the count and N. The complete MIP model can look like:
We use a traditional variable splitting technique to model the absolute value. Note also that instead of the average or mean we just minimize the sum.
To test this formulation first we need to generate this 1,024 row matrix. In GAMS you can do this as follows:
val / '-1','+1'/
ps(i,j,val) / system.PowerSetRight /
The set ps is populated with:
Using this set, we form the parameter a as follows:
This is exactly the format we are looking for.
For some large values for N, the MIP solver will solve this problem surprisingly fast. Here is a log with the open source solver CBC:
COIN-OR Branch and Cut (CBC Library 2.8)
Calling CBC main solution routine...
Solved to optimality.
Very impressive results (other good MIP solvers perform similar).