I have a somewhat largish data set with \(n=5291\) rows:

> str(df) 'data.frame': 5291 obs. of 5 variables: $ ID : int 1 2 3 4 5 6 7 8 9 10 ... $ Performance: num 0.705 0.707 0.709 0.711 0.713 ... $ Cost : int 4365766 4366262 4367526 4368147 4369411 4372412 4372908 4374172 4374793 4376057 ... $ Weight : int 49595 50335 51091 50242 50998 53125 53865 54621 53772 54528 ... $ Distance : num 0.997 0.991 0.987 0.957 0.953 ... |

Three of the columns indicate objectives of an underlying optimization problem: Cost, Weight and Performance. The rows correspond to the Pareto optimal set.

In (1) an algorithm is shown to find a subset, called k-optimal solutions. For this we need to calculate the following. First define a function \(\Gamma(v)\) which compares one objective function value \(k\) between two different points \(i\) and \(j\). Let’s assume minimization for all objectives, then we have:

\[\Gamma(v)=\begin{cases} 0 & \>\text{if $v>0$}\\ 1 & \>\text{if $v\le 0$} \end{cases} \]

Basically it says that \(v=f_i^k-f_j^k\), \(\Gamma(v)=1\) if objective \(k\) in point \(i\) is better (or the same) than the corresponding objective in point \(j\) and \(\Gamma(v)=0\) otherwise. The goal of this exercise is to calculate a matrix

\[K_{i,j} = \sum_k \Gamma(f_i^k-f_j^k) – 1\]

Obviously this is an operation that is quadratic in \(n\).

We have another small vector objs that as the objective names and the direction (\(1\)=minimize, \(-1\)=maximize):

> objs Performance Cost Weight -1 1 1 |

##### Attempt 1: use the data frame directly

The first try, just operates on the data frame **df** directly.

# allow to be called with smaller ncalcK1 <- function(n= nrow(df)) {gamma <- function(i,j,k) { v <- objs[k]*(df[i,k] - df[j,k]) if (v>0) 0L else 1L } K<- matrix(0L,n,n)for (i in 1:n) { for (j in 1:n) if (i!=j) K[i,j] <- sum(sapply(names(objs),function(k) gamma(i,j,k))) – 1L} K } |

We try for different subsets of \(n\):

size \(n\) | time (sec) |

50 | 0.9 |

100 | 3.7 |

150 | 8.3 |

200 | 14.9 |

5291 | 2.9 hours (estimated) |

That is really terrible.

##### Attempt 2: use a matrix

One bottleneck could be the use of a data frame. The most inner loop jumps from one column to the next, and that is a relatively expensive operation. Let’s try to convert the data first to a matrix and then operate on the matrix:

calcK2 <- function(n=nrow(df)) {# calculate matrix Am <- length(objs)A <- matrix(0,n,m)k <- 0L for (c in names(objs)) {k <- k + 1L A[,k] <- df[1:n,c]*objs[c] } gamma <- function(i,j,k) { if ((A[i,k] - A[j,k])>0) 0L else 1L } sumgamma <- function(i,j) { s <- 0L for (k in 1:m) { s <- s + gamma(i,j,k)} s-1L } K<- matrix(0L,n,n)for (i in 1:n) { for (j in 1:n) if (i!=j) K[i,j] <- sumgamma(i,j)} K } |

This is much better:

size \(n\) | time (sec) |

50 | 0.06 |

100 | 0.25 |

150 | 0.58 |

200 | 1.03 |

5291 | 0.2 hours (estimated) |

To get better performance we should get rid of these loops somehow. In R loops are very slow. In C++ loops don’t have this disadvantage, so instead of trying to vectorize this (besides I don’t see immediately how), let’s use some C++ code.

##### Attempt 3: use some C++ code

The Rcpp package makes it very easy to develop C++ functions for use inside R. We assume again we first make a matrix \(A\). The C++ code looks like:

#include <Rcpp.h>using namespace Rcpp;// [[Rcpp::export]]IntegerVector calcKcpp(NumericVector A, int n, int nk) { IntegerVector K(n*n); // will be initialized to zerofor (int i=0; i<n; ++i)for (int j=0; j<n; ++j)if (i!=j) {int s = 0; for (int k=0; k<nk; ++k) if (A[k*n+i]<=A[k*n+j])++s; K[j*n+i] = s-1; } return K;} |

Note that R matrices are stored column-wise (the old Fortran way). We use this function as follows:

Rcpp::sourceCpp('c:/tmp/kepsilon/kepsc.cpp')calcK3 <- function(n= nrow(df)) {# calculate matrix Am <- length(objs)A <- matrix(0,n,m)k <- 0L for (c in names(objs)) {k <- k + 1L A[,k] <- df[1:n,c]*objs[c] } # call C++ codeK <- calcKcpp(A,n,m)dim(K) = c(n,n)K } |

The timings are now:

size \(n\) | time (sec) |

50 | 0 |

100 | 0 |

150 | 0 |

200 | 0 |

5291 | 0.9 seconds |

This is more than fast enough to allow for some experimentation with this data set.

##### Comparison: computation in GAMS

Here we do the same computation in GAMS.

setsi /1*5291/ obj /performance,weight,cost/ ; alias (i,j);parameter d(i,obj) 'data';$gdxin testsolution.gdx $loaddc d parameter objmin(obj) /performance -1, weight 1, cost 1/;* make all objectives minimizationd(i,obj) = objmin(obj)*d(i,obj); parameter K(i,j);K(i,j)$( ord(i)<>ord(j)) = sum(obj$(d(i,obj)<=d(j,obj)),1)-1; |

The time needed for the full problem is:

size \(n\) | time (sec) |

5291 | 34 seconds |

GAMS is actually doing very well on this.

##### References

(1) Gobbi, Massimiliano. 2013. “A K, K-\(\varepsilon\) Optimality Selection Based Multi Objective Genetic Algorithm with Applications to Vehicle Engineering.” *Optimization and Engineering* 14 (2). Springer: 345–60. doi:10.1007/s11081-011-9185-8.