## Matrix vs Data Frame

In R there are obvious ways to store a rectangular grid of numbers: a matrix of a data frame. A data frame can handle different types in different columns, so it is richer than a matrix which has a single type. Internally a matrix is just a vector with a dimension atribute.

A data frame can be accessed as if it is a matrix using the notation df[i,j].

When we want to access a large number of elements a[i,j] (unvectorized), there is a difference in timing however.

Let's first create a 100 x 100 matrix.

*# create a 100 x 100 matrix with random numbers*

m <- 100

n <- 100

a <- **matrix**(**runif**(m*n,0,100),nrow=m,ncol=n)**str**(a)

## num [1:100, 1:100] 16.6 43.1 86.7 44.4 11.1 ...

Now copy the data into a data frame using as.data.frame(a). I could have used data.frame(a) instead (that would yield different column names X1,X2,...).

df <- **as.data.frame**(a)

df[1:5,1:5]

## V1 V2 V3 V4 V5

## 1 16.63825 37.27615 91.70009 31.16568013 60.35773

## 2 43.07365 26.23530 45.98191 27.83072027 94.46351

## 3 86.72209 14.70443 55.62293 51.58801596 83.38001

## 4 44.37759 93.33136 41.71945 0.06141574 75.64287

## 5 11.13246 14.10785 75.97597 51.34233858 54.39535

Here we compare the memory use. They are roughly the same.

pryr::**object_size**(a)

pryr::**object_size**(df)

## 80.2 kB

## 91 kB

Let's create a function that randomly accesses one million elements. To be able to check that we do the same thing we sum these elements.

K <- 1e6

rowi <- **sample**(m,size=K,replace=T)

colj <- **sample**(n,size=K,replace=T)

f <- function(a) {

s <- 0

for(k in 1:K) {

s <- s + a[rowi[k],colj[k]]

}

**return**(s)

}*# same result: we compute the same thing***f**(a)**f**(df)

## [1] 49990017

## [1] 49990017

The results of calling f(a) and f(df) are the same, but the data frame version takes much more time:

**system.time**(**f**(a))**system.time**(**f**(df))

## user system elapsed

## 3.18 0.00 3.20

## user system elapsed

## 58.39 0.07 59.31

Note: don't try this function on a Data Table. The behavior and rules of indexing on a Data Table are slightly different. Although we can use:

dt <- **data.table**(df)

dt[1,2]

## V2

## 1: 37.27615

the indexing as used in function f() is not identical to what we are used to when working with data frames and matrices.

## No comments:

## Post a Comment