Thursday, February 11, 2016

R: Factors vs Strings on large data sets

When exporting data sets in R’s .Rdata format, one of the things to consider is how string vectors are exported. I can now write factors in my .Rdata writer, so we can do some experiments on exporting a string column just as strings or as a factor.

image

image

Here is an illustration how things are stored in an .Rdata file:

image

There is some overhead with each string: 8 bytes. This can add up. An integer vector takes much less space.

When we generate a large dataframe using gdx2r we see the following.

image

The “short” data sets are as follows. For the “short” data set:

set i /i1*i200/;
alias
(i,j,k);
parameter
p(i,j,k);
p(i,j,k) = uniform(0,1);

which exported to an .Rdata file (with StringsAsFactors=F) imported in R will look like:

> load("p.rdata")
> head(p)
   i  j  k     value
1 i1 i1 i1 0.1717471
2 i1 i1 i2 0.8432667
3 i1 i1 i3 0.5503754
4 i1 i1 i4 0.3011379
5 i1 i1 i5 0.2922121
6 i1 i1 i6 0.2240529
> str(p)
'data.frame':	8000000 obs. of  4 variables:
 $ i    : chr  "i1" "i1" "i1" "i1" ...
 $ j    : chr  "i1" "i1" "i1" "i1" ...
 $ k    : chr  "i1" "i2" "i3" "i4" ...
 $ value: num  0.172 0.843 0.55 0.301 0.292 ...

The “large” data sets look like:

set i /amuchlongernamefortesting1*amuchlongernamefortesting200/;
alias
(i,j,k);
parameter
p(i,j,k);
p(i,j,k) = uniform(0,1);

Here we export to an .Rdata file with StringsAsFactors=T:

> load("p2.rdata")
> head(p)
                           i                          j                          k     value
1 amuchlongernamefortesting1 amuchlongernamefortesting1 amuchlongernamefortesting1 0.1717471
2 amuchlongernamefortesting1 amuchlongernamefortesting1 amuchlongernamefortesting2 0.8432667
3 amuchlongernamefortesting1 amuchlongernamefortesting1 amuchlongernamefortesting3 0.5503754
4 amuchlongernamefortesting1 amuchlongernamefortesting1 amuchlongernamefortesting4 0.3011379
5 amuchlongernamefortesting1 amuchlongernamefortesting1 amuchlongernamefortesting5 0.2922121
6 amuchlongernamefortesting1 amuchlongernamefortesting1 amuchlongernamefortesting6 0.2240529
> str(p)
'data.frame':	8000000 obs. of  4 variables:
 $ i    : Factor w/ 200 levels "amuchlongernamefortesting1",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ j    : Factor w/ 200 levels "amuchlongernamefortesting1",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ k    : Factor w/ 200 levels "amuchlongernamefortesting1",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ value: num  0.172 0.843 0.55 0.301 0.292 ...

The timings are indeed what we expected:

  • If not compressed then the .Rdata files are much smaller when using factors. Longer strings make this effect more pronounced.
  • If compressed the .Rdata files are about the same size whether using strings or factors. But with factors we can do the compression faster (fewer bytes to compress).
  • Dataframes with StringsAsFactors=T use up less memory inside R. They also load faster.
  • Conclusion: making StringsAsFactors=T the default make sense (just as with R’s read.csv).

I updated the defaults:

image