## Tuesday, June 7, 2016

### Optimizing some R code

I have a somewhat largish data set with $$n=5291$$ rows:

 > str(df) 'data.frame': 5291 obs. of 5 variables: $ID : int 1 2 3 4 5 6 7 8 9 10 ...$ Performance: num 0.705 0.707 0.709 0.711 0.713 ... $Cost : int 4365766 4366262 4367526 4368147 4369411 4372412 4372908 4374172 4374793 4376057 ...$ Weight : int 49595 50335 51091 50242 50998 53125 53865 54621 53772 54528 ... $Distance : num 0.997 0.991 0.987 0.957 0.953 ... Three of the columns indicate objectives of an underlying optimization problem: Cost, Weight and Performance. The rows correspond to the Pareto optimal set. In (1) an algorithm is shown to find a subset, called k-optimal solutions. For this we need to calculate the following. First define a function $$\Gamma(v)$$ which compares one objective function value $$k$$ between two different points $$i$$ and $$j$$. Let’s assume minimization for all objectives, then we have: $\Gamma(v)=\begin{cases} 0 & \>\text{if v>0}\\ 1 & \>\text{if v\le 0} \end{cases}$ Basically it says that $$v=f_i^k-f_j^k$$, $$\Gamma(v)=1$$ if objective $$k$$ in point $$i$$ is better (or the same) than the corresponding objective in point $$j$$ and $$\Gamma(v)=0$$ otherwise. The goal of this exercise is to calculate a matrix $K_{i,j} = \sum_k \Gamma(f_i^k-f_j^k) – 1$ Obviously this is an operation that is quadratic in $$n$$. We have another small vector objs that as the objective names and the direction ($$1$$=minimize, $$-1$$=maximize):  > objs Performance Cost Weight -1 1 1  ##### Attempt 1: use the data frame directly The first try, just operates on the data frame df directly.  # allow to be called with smaller ncalcK1 <- function(n=nrow(df)) { gamma <- function(i,j,k) { v <- objs[k]*(df[i,k] - df[j,k]) if (v>0) 0L else 1L } K<-matrix(0L,n,n) for (i in 1:n) { for (j in 1:n) if (i!=j) K[i,j] <- sum(sapply(names(objs),function(k) gamma(i,j,k))) – 1L } K} We try for different subsets of $$n$$:  size $$n$$ time (sec) 50 0.9 100 3.7 150 8.3 200 14.9 5291 2.9 hours (estimated) That is really terrible. ##### Attempt 2: use a matrix One bottleneck could be the use of a data frame. The most inner loop jumps from one column to the next, and that is a relatively expensive operation. Let’s try to convert the data first to a matrix and then operate on the matrix:  calcK2 <- function(n=nrow(df)) { # calculate matrix A m <- length(objs) A <- matrix(0,n,m) k <- 0L for (c in names(objs)) { k <- k + 1L A[,k] <- df[1:n,c]*objs[c] } gamma <- function(i,j,k) { if ((A[i,k] - A[j,k])>0) 0L else 1L } sumgamma <- function(i,j) { s <- 0L for (k in 1:m) { s <- s + gamma(i,j,k) } s-1L } K<-matrix(0L,n,n) for (i in 1:n) { for (j in 1:n) if (i!=j) K[i,j] <- sumgamma(i,j) } K} This is much better:  size $$n$$ time (sec) 50 0.06 100 0.25 150 0.58 200 1.03 5291 0.2 hours (estimated) To get better performance we should get rid of these loops somehow. In R loops are very slow. In C++ loops don’t have this disadvantage, so instead of trying to vectorize this (besides I don’t see immediately how), let’s use some C++ code. ##### Attempt 3: use some C++ code The Rcpp package makes it very easy to develop C++ functions for use inside R. We assume again we first make a matrix $$A$$. The C++ code looks like:  #include using namespace Rcpp;// [[Rcpp::export]]IntegerVector calcKcpp(NumericVector A, int n, int nk) { IntegerVector K(n*n); // will be initialized to zero for (int i=0; i Note that R matrices are stored column-wise (the old Fortran way). We use this function as follows:  Rcpp::sourceCpp('c:/tmp/kepsilon/kepsc.cpp')calcK3 <- function(n=nrow(df)) { # calculate matrix A m <- length(objs) A <- matrix(0,n,m) k <- 0L for (c in names(objs)) { k <- k + 1L A[,k] <- df[1:n,c]*objs[c] } # call C++ code K <- calcKcpp(A,n,m) dim(K) = c(n,n) K} The timings are now:  size $$n$$ time (sec) 50 0 100 0 150 0 200 0 5291 0.9 seconds This is more than fast enough to allow for some experimentation with this data set. ##### Comparison: computation in GAMS Here we do the same computation in GAMS.  sets i /1*5291/ obj /performance,weight,cost/;alias (i,j);parameter d(i,obj) 'data';$gdxin testsolution.gdx$loaddc dparameter objmin(obj) /performance -1, weight 1, cost 1/;* make all objectives minimizationd(i,obj) = objmin(obj)*d(i,obj);parameter K(i,j);K(i,j)$(ord(i)<>ord(j)) = sum(obj\$(d(i,obj)<=d(j,obj)),1)-1;

The time needed for the full problem is:

 size $$n$$ time (sec) 5291 34 seconds

GAMS is actually doing very well on this.
##### References

(1) Gobbi, Massimiliano. 2013. “A K, K-$$\varepsilon$$ Optimality Selection Based Multi Objective Genetic Algorithm with Applications to Vehicle Engineering.” Optimization and Engineering 14 (2). Springer: 345–60. doi:10.1007/s11081-011-9185-8.