Tuesday, February 23, 2016

R: lazy load DB files

R can read data very fast and conveniently from .Rdata files. E.g.

> load("indus89.rdata")
> length(ls())
[1] 275

For this data set we have a lot of data: 275 objects are loaded:

image

If we just want to inspect a few of these objects, it may be more convenient to use a lazy load DB format. Such a database consists of two files: an .RDB file with data and an .RDX file with an index indicating where each object is located in the .RDB data file. The .RDB data file is slightly larger than the corresponding .Rdata file:

image

This is because each object is stored and compressed individually.

Loading is super fast as we don’t really load the data:

> lazyLoad("indus89")
NULL
> length(ls())
[1] 275

This will only load the index of the data. The RStudio environment shows:

image

Now, as soon as we do anything with the data, it will load it. This is the “lazy” loading concept. E.g. lets just print alpha:

> head(alpha)
       cq   z1 value
1 basmati nwfp     6
2 basmati  pmw     6
3 basmati  pcw    21
4 basmati  psw    21
5 basmati  prw    21
6 basmati scwn     6

Now suddenly the Rstudio environment shows:

image

We can also lazy load a subset of the objects. E.g. if we want to load all symbols starting with the letter ‘a’ we can do:

> lazyLoad("indus89",filter=function(s){grepl("^a",s)})
NULL
> length(ls())
[1] 8

image

All symbols containing ‘water’ can be loaded as follows:

> lazyLoad("indus89",filter=function(s){grepl("water",s)})
NULL

image

This format seems an interesting alternative to store larger data sets, allowing more selective and lazy loading.

Update

As indicated in the comments below, the lazyLoad function is described as being for internal use. So usage may require a certain braveness. I am sure  renaming the function to myLazyLoad does not count as a solution for this.

Alternatives mentioned in the comments (of course if the venerable Bill Venables makes a comment I better pay attention):

Also make note that the auto-completion facility in Rstudio can lead to spurious loading of data.