Yet Another Math Programming Consultant: R: Factors vs Strings on large data sets

Thursday, February 11, 2016

R: Factors vs Strings on large data sets

When exporting data sets in R’s .Rdata format, one of the things to consider is how string vectors are exported. I can now write factors in my .Rdata writer, so we can do some experiments on exporting a string column just as strings or as a factor.

Here is an illustration how things are stored in an .Rdata file:

There is some overhead with each string: 8 bytes. This can add up. An integer vector takes much less space.

When we generate a large dataframe using gdx2r we see the following.

The “short” data sets are as follows. For the “short” data set:

set i /i1*i200/;
alias (i,j,k);
parameter p(i,j,k);
p(i,j,k) = uniform(0,1);

which exported to an .Rdata file (with StringsAsFactors=F) imported in R will look like:

> load("p.rdata")
> head(p)
   i  j  k     value
1 i1 i1 i1 0.1717471
2 i1 i1 i2 0.8432667
3 i1 i1 i3 0.5503754
4 i1 i1 i4 0.3011379
5 i1 i1 i5 0.2922121
6 i1 i1 i6 0.2240529
> str(p)
'data.frame':	8000000 obs. of  4 variables:
 $ i    : chr  "i1" "i1" "i1" "i1" ...
 $ j    : chr  "i1" "i1" "i1" "i1" ...
 $ k    : chr  "i1" "i2" "i3" "i4" ...
 $ value: num  0.172 0.843 0.55 0.301 0.292 ...

The “large” data sets look like:

set i /amuchlongernamefortesting1*amuchlongernamefortesting200/;
alias (i,j,k);
parameter p(i,j,k);
p(i,j,k) = uniform(0,1);

Here we export to an .Rdata file with StringsAsFactors=T:

> load("p2.rdata")
> head(p)
                           i                          j                          k     value
1 amuchlongernamefortesting1 amuchlongernamefortesting1 amuchlongernamefortesting1 0.1717471
2 amuchlongernamefortesting1 amuchlongernamefortesting1 amuchlongernamefortesting2 0.8432667
3 amuchlongernamefortesting1 amuchlongernamefortesting1 amuchlongernamefortesting3 0.5503754
4 amuchlongernamefortesting1 amuchlongernamefortesting1 amuchlongernamefortesting4 0.3011379
5 amuchlongernamefortesting1 amuchlongernamefortesting1 amuchlongernamefortesting5 0.2922121
6 amuchlongernamefortesting1 amuchlongernamefortesting1 amuchlongernamefortesting6 0.2240529
> str(p)
'data.frame':	8000000 obs. of  4 variables:
 $ i    : Factor w/ 200 levels "amuchlongernamefortesting1",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ j    : Factor w/ 200 levels "amuchlongernamefortesting1",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ k    : Factor w/ 200 levels "amuchlongernamefortesting1",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ value: num  0.172 0.843 0.55 0.301 0.292 ...

The timings are indeed what we expected:

If not compressed then the .Rdata files are much smaller when using factors. Longer strings make this effect more pronounced.
If compressed the .Rdata files are about the same size whether using strings or factors. But with factors we can do the compression faster (fewer bytes to compress).
Dataframes with StringsAsFactors=T use up less memory inside R. They also load faster.
Conclusion: making StringsAsFactors=T the default make sense (just as with R’s read.csv).

I updated the defaults:

Yet Another Math Programming Consultant

Thursday, February 11, 2016

R: Factors vs Strings on large data sets

No comments:

Post a Comment