The R package leaps (http://cran.r-project.org/web/packages/leaps/leaps.pdf) provided a convenient way to perform feature selection. The methods are based on work by Alan Miller, described in:
Alan Miller, “Subset Selection in Regression”, Chapman and Hall/CRC; 2 edition (April 15, 2002)
In a related paper: Alan Miller, “Subsets of Regression Variables”, J. R. Statist. Soc. A (1984), 147, Part 3, pp. 389-425 (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.474.3223&rep=rep1&type=pdf) we even see some comments from Beale.
We should be able to reproduce the “exhaustive” method from the leaps package using an MIQP formulation. Something like:
Given a value for n we want to minimize SSQ. The binary variables δ indicate whether a feature j can appear in the model (we exclude the constant term: that one remains in the model). We need to solve this model for n=1,2,3,… To be complete, the other symbols are β: the quantities we want to estimate, data X,y and our big M constant.
I would guess that say Gurobi would do a better job than leaps. As it turns out, it was not easy to beat leaps.
To generate a random model we use a very simplistic scheme. Not sure if this has much in common with real world data sets, but a similar approach was used in Gatu, Kontoghiorghes, “Branch-and-Bound Algorithms for Computing the Best-Subset Regression Models”, Journal of Computational and Graphical Statistics, Volume 15, Number 1, Pages 139–156:
We use N=50 variables in total of which we want to select n=1,2,3,..,10 variables.
Here are the results:
---- 87 PARAMETER t = 112.897 elapsed time of solve loop
---- 87 SET selected selected variables
v5 v6 v8 v11 v13 v19 v31 v32 v35
+ v40 v43
---- 87 PARAMETER rss residual sum of squares
n1 836.651, n2 836.342, n3 836.060, n4 835.869, n5 835.677, n6 835.495, n7 835.330
---- 87 PARAMETER solvetime time for each solve
n1 0.080, n2 0.135, n3 0.609, n4 1.951, n5 3.876, n6 6.878, n7 15.971, n8 21.016
A few notes:
- The model is very sensitive to the size of M. I used M=1000. With M=1e5 we get into trouble. We can see messages like: Warning: max constraint violation (1.2977e-02) exceeds tolerance (possibly due to large matrix coefficient range).This QP formulation is not so easy to solve for Gurobi from a numerical standpoint.
- The RSS is going down when we add variables. As expected.
- Also as expected: the time to solve each sub-problem goes up when we have more variables in the model. “pick n out of N” constructs often cause performance problems.
- I used a small optcr (gap) and 4 threads to make Gurobi fast enough to beat Leaps. (Leaps takes 3 minutes and a loop of 10 solves with Gurobi takes 2 minutes).
- I often like to formulate a separate fit equation as in:
This turns out to be very bad w.r.t. performance. Of course the fit equation forms a completely dense block in the coefficient matrix, something Gurobi was not designed for (it likes sparse structures much better).
The R code is simply:
d<-dbGetQuery(db,"select * from data") %>% dcast(i ~ v) %>% select(-i)
system.time(r <- regsubsets(y~.,data=d,nvmax=10,really.big=T))
The first line gets the data from the database and applies a pivot operation (from a long, skinny data frame with three columns i,v,value to a wide one with 10 columns v1,v2,v3,…,v10). This trick is useful to remember is this happens a lot when reading data from a database (such data is typically organized in long, skinny tables opposed to a “matrix” organization). The second line calls the regsubsets function from the Leaps package.
The output is:
v1 v10 v11 v12 v13 v14 v15 v16 v17 v18 v19 v2 v20 v21 v22 v23 v24 v25 v26 v27 v28
user system elapsed
The sum of squares start out the same but they diverge just a little bit.
The selected variables are much the same as using Gurobi (but there are a few differences here and there). The time is actually close to that of Gurobi. This performance is much better than I expected.