Friday, April 28, 2017

R factoid: matrices are faster than data frames (when accessed element by element)

Matrix vs Data Frame

In R there are obvious ways to store a rectangular grid of numbers: a matrix of a data frame. A data frame can handle different types in different columns, so it is richer than a matrix which has a single type. Internally a matrix is just a vector with a dimension atribute.

A dataframe can be accessed as if it is a matrix using the notation df[i,j].

When we want to access a large number of elements a[i,j] (unvectorized), there is a difference in timing however.

Lets first create a 100x100 matrix.

# create a 100 x 100 matrix with random numbers
n <- 100
m <- 100
a <- matrix(runif(n*m,0,100),nrow=m,ncol=n)
str(a)
##  num [1:100, 1:100] 85.1 17.7 34.9 54.2 51 ...

Now copy the data into a data frame:

df <- as.data.frame(a)
tibble::glimpse(df[,1:10])
## Observations: 100
## Variables: 10
## $ V1  <dbl> 85.065933, 17.728731, 34.860332, 54.248920, 50.977486, 99....
## $ V2  <dbl> 5.813586, 56.404924, 41.151549, 32.029510, 40.857083, 81.8...
## $ V3  <dbl> 4.452198, 34.878083, 34.398440, 90.276525, 26.451508, 48.9...
## $ V4  <dbl> 43.496058, 95.162642, 65.250058, 83.941005, 55.507790, 17....
## $ V5  <dbl> 58.70423852, 39.08486762, 73.75686767, 51.61834429, 72.739...
## $ V6  <dbl> 34.55152, 36.22242, 12.87432, 74.17734, 32.98368, 93.05312...
## $ V7  <dbl> 96.112254, 55.291468, 8.653270, 28.452267, 18.926653, 80.6...
## $ V8  <dbl> 45.4440507, 20.2528389, 98.3861469, 90.8775842, 14.8820597...
## $ V9  <dbl> 3.2462605, 10.2364068, 90.3974374, 5.3393112, 41.1585281, ...
## $ V10 <dbl> 39.686012, 74.177476, 67.095580, 96.879985, 16.446060, 97....

Here we compare the memory use. They are roughly the same.

pryr::object_size(a)
pryr::object_size(df)
## 80.2 kB
## 91 kB

Let’s create a function that randomly access one million elements. To be able to check that we do the same thing we sum these elements.

K <- 1e6
rowi <- sample(n,size=K,replace=T)
colj <- sample(m,size=K,replace=T)
f <- function(a) {
   s <- 0
   for(k in 1:K) {
      s <- s + a[rowi[k],colj[k]]
   }
   return(s)
}
# same result: we compute the same thing
f(a)
f(df)
## [1] 49894303
## [1] 49894303

The results of calling f(a) and f(df) are the same, but the data frame version takes much more time:

system.time(f(a))
system.time(f(df))
##    user  system elapsed 
##    2.74    0.00    2.75 
##    user  system elapsed 
##   50.71    0.02   50.81

Matrix vs Data Frame

In R there are obvious ways to store a rectangular grid of numbers: a matrix of a data frame. A data frame can handle different types in different columns, so it is richer than a matrix which has a single type. Internally a matrix is just a vector with a dimension atribute.

A data frame can be accessed as if it is a matrix using the notation df[i,j].

When we want to access a large number of elements a[i,j] (unvectorized), there is a difference in timing however.

Let's first create a 100 x 100 matrix.

# create a 100 x 100 matrix with random numbers
m <- 100
n <- 100
a <- matrix(runif(m*n,0,100),nrow=m,ncol=n)
str
(a)

##  num [1:100, 1:100] 16.6 43.1 86.7 44.4 11.1 ...

Now copy the data into a data frame using as.data.frame(a). I could have used data.frame(a) instead (that would yield different column names X1,X2,...).

df <- as.data.frame(a)
df[1:5,1:5
]

##         V1       V2       V3          V4       V5
## 1 16.63825 37.27615 91.70009 31.16568013 60.35773
## 2 43.07365 26.23530 45.98191 27.83072027 94.46351
## 3 86.72209 14.70443 55.62293 51.58801596 83.38001
## 4 44.37759 93.33136 41.71945  0.06141574 75.64287
## 5 11.13246 14.10785 75.97597 51.34233858 54.39535

Here we compare the memory use. They are roughly the same.

pryr::object_size(a)
pryr::object_size
(df)

## 80.2 kB
## 91 kB

Let's create a function that randomly accesses one million elements. To be able to check that we do the same thing we sum these elements.

K <- 1e6
rowi <- sample(m,size=K,replace=T)
colj <- sample(n,size=K,replace=T)
f <- function(a) {
   s <- 0
   for(k in 1:K) {
      s <- s + a[rowi[k],colj[k]]
   }
   return(s)
}
# same result: we compute the same thing
f(a)
f
(df)

## [1] 49990017
## [1] 49990017

The results of calling f(a) and f(df) are the same, but the data frame version takes much more time:

system.time(f(a))
system.time(f
(df))

##    user  system elapsed
##    3.18    0.00    3.20
##    user  system elapsed
##   58.39    0.07   59.31

Note: don't try this function on a Data Table. The behavior and rules of indexing on a Data Table are slightly different. Although we can use:

dt <- data.table(df)
dt[1,2
]

##          V2
## 1: 37.27615

the indexing as used in function f() is not identical to what we are used to when working with data frames and matrices.