Friday, February 5, 2016

R: The RData File Format

R has a quite efficient way to store data in a file using its own RData file format. You can save objects to the file using save() and load data from a file using load(). The file format is largely undocumented, and as a result it is not much used as a way to exchange data with other software. In many cases CSV files are used for this. Here I make the argument to use a SQLite database for this purpose.

So what is this RData file format? It is a binary format and not so easy to inspect, but there is an option to save a file in ASCII:

> ivec <- 1:3
> str(ivec)
 int [1:3] 1 2 3
> save(ivec,file="ivec.ascii",ascii=T)

So how does this file look like? Here is an annotated listing:

RDA2        Header: file type
A             
Ascii format
2             
Format version 2
197123        
R version information
131840        
more R version information
1026       
LISTSXP object: whole thing is packaged in a dotted pair list
1            
SYMSXP object: symbol
262153         
CHARSXP object: string
4                
Length of string
ivec             
String: symbol name
13           
INTSXP: integer vector
3              
Length of integer vector
1              
First element
2              
Second element
3              
Third element
254        
NILVALUESXP: end of information

Using this information we could re-engineer writing R objects to an RData file. E.g. writing a string vector looks like:

image

(The tRDataBase name reflects this is a base class; we derive tRDataAscii, tRDataBinary and tRDataNetwork from this).

When we save objects without the “ascii=TRUE” flag, basically a compressed binary network format is used. The idea behind a network format is to write all binary data in a standardized big endian network byte ordering. This will allow a binary file written on one machine (e.g. with an Intel architecture) to be read on a different machine (actually there are not that many big-endian computer architectures left). This whole thing is then compressed using gzip.

Using an RDB2 header I can write a pure native binary format (that is without reordering bytes to a network byte ordering). It looks like R has decided not to support this format anymore:

> load("test.bin")
Warning message:
file ‘test.bin’ has magic number 'RDB2'
  Use of save versions prior to 2 is deprecated 

So binary files always use the network byte ordering and have an RDX2 header.

Notes
  1. The load() function works perfectly fine with remote Rdata files:
    > load(url("http://www.amsterdamoptimization.com/downloads/rvec.rdata"),verbose=T)
    Loading objects:
      x
  2. The goal of this exercise is to be able to generate .Rdata data sets from other environments. We don’t use R itself for this but rather write .Rdata files directly. Another approach would be to launch R, import the data set (e.g. using a CSV file) and then call save() to generate the .Rdata file. When doing this from a different programming language, it is possible to automate this using the R.dll. This is in fact how this interface in F# works (same thing for the rpy2 Python interface). In my setup I don’t need an R DLL and write .Rdata directly from the Delphi and C programming languages. So it is a little bit more lightweight and also does not require an installed R system.
  3. It is time for RData files to become the standard for Data Transfer.
  4. R Internals.
  5. Experiments with some small data sets are shown here.