Yet Another Math Programming Consultant: Optimizing some R code

I have a somewhat largish data set with $n=5291$ rows:

> str(df)
'data.frame':	5291 obs. of  5 variables:
 $ ID         : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Performance: num  0.705 0.707 0.709 0.711 0.713 ...
 $ Cost       : int  4365766 4366262 4367526 4368147 4369411 4372412 4372908 4374172 4374793 4376057 ...
 $ Weight     : int  49595 50335 51091 50242 50998 53125 53865 54621 53772 54528 ...
 $ Distance   : num  0.997 0.991 0.987 0.957 0.953 ...

Three of the columns indicate objectives of an underlying optimization problem: Cost, Weight and Performance. The rows correspond to the Pareto optimal set.

In (1) an algorithm is shown to find a subset, called k-optimal solutions. For this we need to calculate the following. First define a function $\Gamma(v)$ which compares one objective function value $k$ between two different points $i$ and $j$. Let’s assume minimization for all objectives, then we have:

\[\Gamma(v)=\begin{cases} 0 & \>\text{if $v>0$}\\ 1 & \>\text{if $v\le 0$} \end{cases} \]

Basically it says that $v=f_i^k-f_j^k$, $\Gamma(v)=1$ if objective $k$ in point $i$ is better (or the same) than the corresponding objective in point $j$ and $\Gamma(v)=0$ otherwise. The goal of this exercise is to calculate a matrix

\[K_{i,j} = \sum_k \Gamma(f_i^k-f_j^k) – 1\]

Obviously this is an operation that is quadratic in $n$.

We have another small vector objs that as the objective names and the direction ($1$=minimize, $-1$=maximize):

> objs
Performance        Cost      Weight 
         -1           1           1

Attempt 1: use the data frame directly

The first try, just operates on the data frame df directly.

# allow to be called with smaller n
calcK1 <- function(n=nrow(df)) {

gamma <- function(i,j,k) {
    v <- objs[k]*(df[i,k] - df[j,k])
    if (v>0)
      0L
    else
      1L
}

K<-matrix(0L,n,n)
for (i in 1:n) {
    for (j in 1:n)
      if (i!=j) K[i,j] <- sum(sapply(names(objs),function(k) gamma(i,j,k))) – 1L
}
K
}

We try for different subsets of $n$:

size $n$	time (sec)
50	0.9
100	3.7
150	8.3
200	14.9
5291	2.9 hours (estimated)

That is really terrible.

Attempt 2: use a matrix

One bottleneck could be the use of a data frame. The most inner loop jumps from one column to the next, and that is a relatively expensive operation. Let’s try to convert the data first to a matrix and then operate on the matrix:

calcK2 <- function(n=nrow(df)) {

# calculate matrix A
m <- length(objs)
A <- matrix(0,n,m)
k <- 0L
for (c in names(objs)) {
    k <- k + 1L
    A[,k] <- df[1:n,c]*objs[c]
}

gamma <- function(i,j,k) {
    if ((A[i,k] - A[j,k])>0)
      0L
    else
      1L
}

sumgamma <- function(i,j) {
    s <- 0L
    for (k in 1:m) {
      s <- s + gamma(i,j,k)
    }
    s-1L
}

K<-matrix(0L,n,n)
for (i in 1:n) {
    for (j in 1:n)
      if (i!=j) K[i,j] <- sumgamma(i,j)
}
K
}

This is much better:

size $n$	time (sec)
50	0.06
100	0.25
150	0.58
200	1.03
5291	0.2 hours (estimated)

To get better performance we should get rid of these loops somehow. In R loops are very slow. In C++ loops don’t have this disadvantage, so instead of trying to vectorize this (besides I don’t see immediately how), let’s use some C++ code.

Attempt 3: use some C++ code

The Rcpp package makes it very easy to develop C++ functions for use inside R. We assume again we first make a matrix $A$. The C++ code looks like:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
IntegerVector calcKcpp(NumericVector A, int n, int nk) {

IntegerVector K(n*n); // will be initialized to zero
  for (int i=0; i<n; ++i)
    for (int j=0; j<n; ++j)
      if (i!=j) {
        int s = 0;
        for (int k=0; k<nk; ++k)
          if (A[k*n+i]<=A[k*n+j])
            ++s;
        K[j*n+i] = s-1;
      }
return K;
}

Note that R matrices are stored column-wise (the old Fortran way). We use this function as follows:

Rcpp::sourceCpp('c:/tmp/kepsilon/kepsc.cpp')

calcK3 <- function(n=nrow(df)) {

# calculate matrix A
m <- length(objs)
A <- matrix(0,n,m)
k <- 0L
for (c in names(objs)) {
k <- k + 1L
A[,k] <- df[1:n,c]*objs[c]
}

# call C++ code
K <- calcKcpp(A,n,m)
dim(K) = c(n,n)
K
}

The timings are now:

size $n$	time (sec)
50	0
100	0
150	0
200	0
5291	0.9 seconds

This is more than fast enough to allow for some experimentation with this data set.

Comparison: computation in GAMS

Here we do the same computation in GAMS.

sets
i /1*5291/
obj /performance,weight,cost/
;
alias (i,j);

parameter d(i,obj) 'data';
$gdxin testsolution.gdx
$loaddc d

parameter objmin(obj) /performance -1, weight 1, cost 1/;

* make all objectives minimization
d(i,obj) = objmin(obj)*d(i,obj);

parameter K(i,j);
K(i,j)$(ord(i)<>ord(j)) = sum(obj$(d(i,obj)<=d(j,obj)),1)-1;

The time needed for the full problem is:

size $n$	time (sec)
5291	34 seconds

GAMS is actually doing very well on this.

References

(1) Gobbi, Massimiliano. 2013. “A K, K-$\varepsilon$ Optimality Selection Based Multi Objective Genetic Algorithm with Applications to Vehicle Engineering.” Optimization and Engineering 14 (2). Springer: 345–60. doi:10.1007/s11081-011-9185-8.

Yet Another Math Programming Consultant

Tuesday, June 7, 2016

Optimizing some R code

Attempt 1: use the data frame directly

Attempt 2: use a matrix

Attempt 3: use some C++ code

Comparison: computation in GAMS

References

No comments:

Post a Comment

size \(n\)	time (sec)
50	0.9
100	3.7
150	8.3
200	14.9
5291	2.9 hours (estimated)