I have a somewhat largish data set with \(n=5291\) rows:
> str(df) 'data.frame': 5291 obs. of 5 variables: $ ID : int 1 2 3 4 5 6 7 8 9 10 ... $ Performance: num 0.705 0.707 0.709 0.711 0.713 ... $ Cost : int 4365766 4366262 4367526 4368147 4369411 4372412 4372908 4374172 4374793 4376057 ... $ Weight : int 49595 50335 51091 50242 50998 53125 53865 54621 53772 54528 ... $ Distance : num 0.997 0.991 0.987 0.957 0.953 ... |
Three of the columns indicate objectives of an underlying optimization problem: Cost, Weight and Performance. The rows correspond to the Pareto optimal set.
In (1) an algorithm is shown to find a subset, called k-optimal solutions. For this we need to calculate the following. First define a function \(\Gamma(v)\) which compares one objective function value \(k\) between two different points \(i\) and \(j\). Let’s assume minimization for all objectives, then we have:
\[\Gamma(v)=\begin{cases} 0 & \>\text{if $v>0$}\\ 1 & \>\text{if $v\le 0$} \end{cases} \]
Basically it says that \(v=f_i^k-f_j^k\), \(\Gamma(v)=1\) if objective \(k\) in point \(i\) is better (or the same) than the corresponding objective in point \(j\) and \(\Gamma(v)=0\) otherwise. The goal of this exercise is to calculate a matrix
\[K_{i,j} = \sum_k \Gamma(f_i^k-f_j^k) – 1\]
Obviously this is an operation that is quadratic in \(n\).
We have another small vector objs that as the objective names and the direction (\(1\)=minimize, \(-1\)=maximize):
> objs Performance Cost Weight -1 1 1 |
Attempt 1: use the data frame directly
The first try, just operates on the data frame df directly.
# allow to be called with smaller n calcK1 <- function(n=nrow(df)) { gamma <- function(i,j,k) { v <- objs[k]*(df[i,k] - df[j,k]) if (v>0) 0L else 1L } K<-matrix(0L,n,n) for (i in 1:n) { for (j in 1:n) if (i!=j) K[i,j] <- sum(sapply(names(objs),function(k) gamma(i,j,k))) – 1L } K } |
We try for different subsets of \(n\):
size \(n\) | time (sec) |
50 | 0.9 |
100 | 3.7 |
150 | 8.3 |
200 | 14.9 |
5291 | 2.9 hours (estimated) |
That is really terrible.
Attempt 2: use a matrix
One bottleneck could be the use of a data frame. The most inner loop jumps from one column to the next, and that is a relatively expensive operation. Let’s try to convert the data first to a matrix and then operate on the matrix:
calcK2 <- function(n=nrow(df)) { # calculate matrix A m <- length(objs) A <- matrix(0,n,m) k <- 0L for (c in names(objs)) { k <- k + 1L A[,k] <- df[1:n,c]*objs[c] } gamma <- function(i,j,k) { if ((A[i,k] - A[j,k])>0) 0L else 1L } sumgamma <- function(i,j) { s <- 0L for (k in 1:m) { s <- s + gamma(i,j,k) } s-1L } K<-matrix(0L,n,n) for (i in 1:n) { for (j in 1:n) if (i!=j) K[i,j] <- sumgamma(i,j) } K } |
This is much better:
size \(n\) | time (sec) |
50 | 0.06 |
100 | 0.25 |
150 | 0.58 |
200 | 1.03 |
5291 | 0.2 hours (estimated) |
To get better performance we should get rid of these loops somehow. In R loops are very slow. In C++ loops don’t have this disadvantage, so instead of trying to vectorize this (besides I don’t see immediately how), let’s use some C++ code.
Attempt 3: use some C++ code
The Rcpp package makes it very easy to develop C++ functions for use inside R. We assume again we first make a matrix \(A\). The C++ code looks like:
#include <Rcpp.h> using namespace Rcpp; // [[Rcpp::export]] IntegerVector calcKcpp(NumericVector A, int n, int nk) { IntegerVector K(n*n); // will be initialized to zero for (int i=0; i<n; ++i) for (int j=0; j<n; ++j) if (i!=j) { int s = 0; for (int k=0; k<nk; ++k) if (A[k*n+i]<=A[k*n+j]) ++s; K[j*n+i] = s-1; } return K; } |
Note that R matrices are stored column-wise (the old Fortran way). We use this function as follows:
Rcpp::sourceCpp('c:/tmp/kepsilon/kepsc.cpp') calcK3 <- function(n=nrow(df)) { # calculate matrix A m <- length(objs) A <- matrix(0,n,m) k <- 0L for (c in names(objs)) { k <- k + 1L A[,k] <- df[1:n,c]*objs[c] } # call C++ code K <- calcKcpp(A,n,m) dim(K) = c(n,n) K } |
The timings are now:
size \(n\) | time (sec) |
50 | 0 |
100 | 0 |
150 | 0 |
200 | 0 |
5291 | 0.9 seconds |
This is more than fast enough to allow for some experimentation with this data set.
Comparison: computation in GAMS
Here we do the same computation in GAMS.
sets i /1*5291/ obj /performance,weight,cost/ ; alias (i,j); parameter d(i,obj) 'data'; $gdxin testsolution.gdx $loaddc d parameter objmin(obj) /performance -1, weight 1, cost 1/; * make all objectives minimization d(i,obj) = objmin(obj)*d(i,obj); parameter K(i,j); K(i,j)$(ord(i)<>ord(j)) = sum(obj$(d(i,obj)<=d(j,obj)),1)-1; |
The time needed for the full problem is:
size \(n\) | time (sec) |
5291 | 34 seconds |
GAMS is actually doing very well on this.
References
(1) Gobbi, Massimiliano. 2013. “A K, K-\(\varepsilon\) Optimality Selection Based Multi Objective Genetic Algorithm with Applications to Vehicle Engineering.” Optimization and Engineering 14 (2). Springer: 345–60. doi:10.1007/s11081-011-9185-8.
No comments:
Post a Comment