Yet Another Math Programming Consultant: R factoid: matrices are faster than data frames (when accessed element by element)

Friday, April 28, 2017

R factoid: matrices are faster than data frames (when accessed element by element)

Matrix vs Data Frame

In R there are obvious ways to store a rectangular grid of numbers: a matrix of a data frame. A data frame can handle different types in different columns, so it is richer than a matrix which has a single type. Internally a matrix is just a vector with a dimension atribute.

A dataframe can be accessed as if it is a matrix using the notation df[i,j].

When we want to access a large number of elements a[i,j] (unvectorized), there is a difference in timing however.

Lets first create a 100x100 matrix.

# create a 100 x 100 matrix with random numbers
n <- 100
m <- 100
a <- matrix(runif(n*m,0,100),nrow=m,ncol=n)
str(a)

##  num [1:100, 1:100] 85.1 17.7 34.9 54.2 51 ...

Now copy the data into a data frame:

df <- as.data.frame(a)
tibble::glimpse(df[,1:10])

## Observations: 100
## Variables: 10
## $ V1  <dbl> 85.065933, 17.728731, 34.860332, 54.248920, 50.977486, 99....
## $ V2  <dbl> 5.813586, 56.404924, 41.151549, 32.029510, 40.857083, 81.8...
## $ V3  <dbl> 4.452198, 34.878083, 34.398440, 90.276525, 26.451508, 48.9...
## $ V4  <dbl> 43.496058, 95.162642, 65.250058, 83.941005, 55.507790, 17....
## $ V5  <dbl> 58.70423852, 39.08486762, 73.75686767, 51.61834429, 72.739...
## $ V6  <dbl> 34.55152, 36.22242, 12.87432, 74.17734, 32.98368, 93.05312...
## $ V7  <dbl> 96.112254, 55.291468, 8.653270, 28.452267, 18.926653, 80.6...
## $ V8  <dbl> 45.4440507, 20.2528389, 98.3861469, 90.8775842, 14.8820597...
## $ V9  <dbl> 3.2462605, 10.2364068, 90.3974374, 5.3393112, 41.1585281, ...
## $ V10 <dbl> 39.686012, 74.177476, 67.095580, 96.879985, 16.446060, 97....

Here we compare the memory use. They are roughly the same.

pryr::object_size(a)
pryr::object_size(df)

## 80.2 kB
## 91 kB

Let’s create a function that randomly access one million elements. To be able to check that we do the same thing we sum these elements.

K <- 1e6
rowi <- sample(n,size=K,replace=T)
colj <- sample(m,size=K,replace=T)
f <- function(a) {
   s <- 0
   for(k in 1:K) {
      s <- s + a[rowi[k],colj[k]]
   }
   return(s)
}
# same result: we compute the same thing
f(a)
f(df)

## [1] 49894303
## [1] 49894303

The results of calling f(a) and f(df) are the same, but the data frame version takes much more time:

system.time(f(a))
system.time(f(df))

##    user  system elapsed 
##    2.74    0.00    2.75 
##    user  system elapsed 
##   50.71    0.02   50.81

Matrix vs Data Frame

A data frame can be accessed as if it is a matrix using the notation df[i,j].

When we want to access a large number of elements a[i,j] (unvectorized), there is a difference in timing however.

Let's first create a 100 x 100 matrix.

# create a 100 x 100 matrix with random numbers
m <- 100
n <- 100
a <- matrix(runif(m*n,0,100),nrow=m,ncol=n)
str(a)

## num [1:100, 1:100] 16.6 43.1 86.7 44.4 11.1 ...

Now copy the data into a data frame using as.data.frame(a). I could have used data.frame(a) instead (that would yield different column names X1,X2,...).

df <- as.data.frame(a)
df[1:5,1:5]

## V1 V2 V3 V4 V5
## 1 16.63825 37.27615 91.70009 31.16568013 60.35773
## 2 43.07365 26.23530 45.98191 27.83072027 94.46351
## 3 86.72209 14.70443 55.62293 51.58801596 83.38001
## 4 44.37759 93.33136 41.71945 0.06141574 75.64287
## 5 11.13246 14.10785 75.97597 51.34233858 54.39535

Here we compare the memory use. They are roughly the same.

pryr::object_size(a)
pryr::object_size(df)

## 80.2 kB
## 91 kB

Let's create a function that randomly accesses one million elements. To be able to check that we do the same thing we sum these elements.

K <- 1e6
rowi <- sample(m,size=K,replace=T)
colj <- sample(n,size=K,replace=T)
f <- function(a) {
   s <- 0
   for(k in 1:K) {
      s <- s + a[rowi[k],colj[k]]
   }
   return(s)
}
# same result: we compute the same thing
f(a)
f(df)

## [1] 49990017
## [1] 49990017

The results of calling f(a) and f(df) are the same, but the data frame version takes much more time:

system.time(f(a))
system.time(f(df))

##    user system elapsed
##    3.18    0.00    3.20
##    user system elapsed
##   58.39    0.07   59.31

Note: don't try this function on a Data Table. The behavior and rules of indexing on a Data Table are slightly different. Although we can use:

dt <- data.table(df)
dt[1,2]

## V2
## 1: 37.27615

the indexing as used in function f() is not identical to what we are used to when working with data frames and matrices.

Yet Another Math Programming Consultant

Friday, April 28, 2017

R factoid: matrices are faster than data frames (when accessed element by element)

Matrix vs Data Frame

Matrix vs Data Frame

No comments:

Post a Comment