Matrix vs Data Frame
In R there are obvious ways to store a rectangular grid of numbers: a matrix of a data frame. A data frame can handle different types in different columns, so it is richer than a matrix which has a single type. Internally a matrix is just a vector with a dimension atribute.
A data frame can be accessed as if it is a matrix using the notation df[i,j].
When we want to access a large number of elements a[i,j] (unvectorized), there is a difference in timing however.
Let's first create a 100 x 100 matrix.
# create a 100 x 100 matrix with random numbers
m <- 100
n <- 100
a <- matrix(runif(m*n,0,100),nrow=m,ncol=n)
str(a)
## num [1:100, 1:100] 16.6 43.1 86.7 44.4 11.1 ...
Now copy the data into a data frame using as.data.frame(a). I could have used data.frame(a) instead (that would yield different column names X1,X2,...).
df <- as.data.frame(a)
df[1:5,1:5]
## V1 V2 V3 V4 V5
## 1 16.63825 37.27615 91.70009 31.16568013 60.35773
## 2 43.07365 26.23530 45.98191 27.83072027 94.46351
## 3 86.72209 14.70443 55.62293 51.58801596 83.38001
## 4 44.37759 93.33136 41.71945 0.06141574 75.64287
## 5 11.13246 14.10785 75.97597 51.34233858 54.39535
Here we compare the memory use. They are roughly the same.
pryr::object_size(a)
pryr::object_size(df)
## 80.2 kB
## 91 kB
Let's create a function that randomly accesses one million elements. To be able to check that we do the same thing we sum these elements.
K <- 1e6
rowi <- sample(m,size=K,replace=T)
colj <- sample(n,size=K,replace=T)
f <- function(a) {
s <- 0
for(k in 1:K) {
s <- s + a[rowi[k],colj[k]]
}
return(s)
}
# same result: we compute the same thing
f(a)
f(df)
## [1] 49990017
## [1] 49990017
The results of calling f(a) and f(df) are the same, but the data frame version takes much more time:
system.time(f(a))
system.time(f(df))
## user system elapsed
## 3.18 0.00 3.20
## user system elapsed
## 58.39 0.07 59.31
Note: don't try this function on a Data Table. The behavior and rules of indexing on a Data Table are slightly different. Although we can use:
dt <- data.table(df)
dt[1,2]
## V2
## 1: 37.27615
the indexing as used in function f() is not identical to what we are used to when working with data frames and matrices.
No comments:
Post a Comment