Tuesday, June 7, 2016

Optimizing some R code

I have a somewhat largish data set with \(n=5291\) rows:

> str(df)
'data.frame':	5291 obs. of  5 variables:
 $ ID         : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Performance: num  0.705 0.707 0.709 0.711 0.713 ...
 $ Cost       : int  4365766 4366262 4367526 4368147 4369411 4372412 4372908 4374172 4374793 4376057 ...
 $ Weight     : int  49595 50335 51091 50242 50998 53125 53865 54621 53772 54528 ...
 $ Distance   : num  0.997 0.991 0.987 0.957 0.953 ...

Three of the columns indicate objectives of an underlying optimization problem: Cost, Weight and Performance. The rows correspond to the Pareto optimal set.

In (1) an algorithm is shown to find a subset, called k-optimal solutions. For this we need to calculate the following. First define a function \(\Gamma(v)\) which compares one objective function value \(k\) between two different points \(i\) and \(j\). Let’s assume minimization for all objectives, then we have:

\[\Gamma(v)=\begin{cases} 0 & \>\text{if $v>0$}\\ 1 & \>\text{if $v\le 0$} \end{cases} \]

Basically it says that \(v=f_i^k-f_j^k\), \(\Gamma(v)=1\) if objective \(k\) in point \(i\) is better (or the same) than the corresponding objective in point \(j\) and \(\Gamma(v)=0\) otherwise. The goal of this exercise is to calculate a matrix

\[K_{i,j} =  \sum_k \Gamma(f_i^k-f_j^k) – 1\]

Obviously this is an operation that is quadratic in \(n\).

We have another small vector objs that as the objective names and the direction (\(1\)=minimize, \(-1\)=maximize):

> objs
Performance        Cost      Weight 
         -1           1           1 

Attempt 1: use the data frame directly

The first try, just operates on the data frame df directly.

# allow to be called with smaller n
calcK1 <- function(n=nrow(df)) {
 
  gamma <- function(i,j,k) {
    v <- objs[k]*(df[i,k] - df[j,k])
    if (v>0)
      0L
    else
      1L
  }
 
  K<-matrix(0L,n,n)
  for (i in 1:n) {
    for (j in 1:n)
      if (i!=j) K[i,j] <- sum(sapply(names(objs),function(k) gamma(i,j,k))) – 1L
  }
  K
}

We try for different subsets of \(n\):

size \(n\) time (sec)
50 0.9
100 3.7
150 8.3
200 14.9
5291 2.9 hours (estimated)

That is really terrible.

Attempt 2: use a matrix

One bottleneck could be the use of a data frame. The most inner loop jumps from one column to the next, and that is a relatively expensive operation. Let’s try to convert the data first to a matrix and then operate on the matrix:

calcK2 <- function(n=nrow(df)) {
 
  # calculate matrix A
  m <- length(objs)
  A <- matrix(0,n,m)
  k <- 0L
  for (c in names(objs)) {
    k <- k + 1L
    A[,k] <- df[1:n,c]*objs[c]
  }

  gamma <- function(i,j,k) {
    if ((A[i,k] - A[j,k])>0)
      0L
    else
      1L
  }
 
  sumgamma <- function(i,j) {
    s <- 0L
    for (k in 1:m) {
      s <- s + gamma(i,j,k)
    }
    s-1L
  }
 
  K<-matrix(0L,n,n)
  for (i in 1:n) {
    for (j in 1:n)
      if (i!=j) K[i,j] <- sumgamma(i,j)
  }
  K
}

This is much better:

size \(n\) time (sec)
50 0.06
100 0.25
150 0.58
200 1.03
5291 0.2 hours (estimated)

To get better performance we should get rid of these loops somehow. In R loops are very slow. In C++ loops don’t have this disadvantage, so instead of trying to vectorize this (besides I don’t see immediately how), let’s use some C++ code.

Attempt 3: use some C++ code

The Rcpp package makes it very easy to develop C++ functions for use inside R. We assume again we first make a matrix \(A\). The C++ code looks like:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
IntegerVector calcKcpp(NumericVector A, int n, int nk) {

  IntegerVector K(n*n); // will be initialized to zero
  for (int i=0; i<n; ++i)
    for (int j=0; j<n; ++j)
      if (i!=j) {
        int s = 0;
        for (int k=0; k<nk; ++k)
          if (A[k*n+i]<=A[k*n+j])
            ++s;

        K[j*n+i] = s-1;
      } 
  return K;
}

Note that R matrices are stored column-wise (the old Fortran way). We use this function as follows:

Rcpp::sourceCpp('c:/tmp/kepsilon/kepsc.cpp')

calcK3 <- function(n=nrow(df)) {

  # calculate matrix A
  m <- length(objs)
  A <- matrix(0,n,m)
  k <- 0L
  for (c in names(objs)) {
    k <- k + 1L
    A[,k] <- df[1:n,c]*objs[c]
  }
 
  # call C++ code
  K <- calcKcpp(A,n,m)
  dim(K) = c(n,n)
  K
}

The timings are now:

size \(n\) time (sec)
50 0
100 0
150 0
200 0
5291 0.9 seconds

This is more than fast enough to allow for some experimentation with this data set.

Comparison: computation in GAMS

Here we do the same computation in GAMS.

sets
  i
/1*5291/
  obj
/performance,weight,cost/
;
alias (i,j);

parameter d(i,obj) 'data'
;
$gdxin testsolution.gdx
$loaddc d

parameter objmin(obj) /performance -1, weight 1, cost 1/
;

* make all objectives minimization

d(i,obj) = objmin(obj)*d(i,obj);

parameter
K(i,j);
K(i,j)$(
ord(i)<>ord(j)) = sum
(obj$(d(i,obj)<=d(j,obj)),1)-1;

The time needed for the full problem is:

size \(n\) time (sec)
5291 34 seconds

GAMS is actually doing very well on this.
References

(1) Gobbi, Massimiliano. 2013. “A K, K-\(\varepsilon\) Optimality Selection Based Multi Objective Genetic Algorithm with Applications to Vehicle Engineering.” Optimization and Engineering 14 (2). Springer: 345–60. doi:10.1007/s11081-011-9185-8.