When exporting data sets in R’s .Rdata format, one of the things to consider is how string vectors are exported. I can now write factors in my .Rdata writer, so we can do some experiments on exporting a string column just as strings or as a factor.
Here is an illustration how things are stored in an .Rdata file:
There is some overhead with each string: 8 bytes. This can add up. An integer vector takes much less space.
When we generate a large dataframe using gdx2r we see the following.
The “short” data sets are as follows. For the “short” data set:
set i /i1*i200/;
alias (i,j,k);
parameter p(i,j,k);
p(i,j,k) = uniform(0,1);
which exported to an .Rdata file (with StringsAsFactors=F) imported in R will look like:
> load("p.rdata") > head(p) i j k value 1 i1 i1 i1 0.1717471 2 i1 i1 i2 0.8432667 3 i1 i1 i3 0.5503754 4 i1 i1 i4 0.3011379 5 i1 i1 i5 0.2922121 6 i1 i1 i6 0.2240529 > str(p) 'data.frame': 8000000 obs. of 4 variables: $ i : chr "i1" "i1" "i1" "i1" ... $ j : chr "i1" "i1" "i1" "i1" ... $ k : chr "i1" "i2" "i3" "i4" ... $ value: num 0.172 0.843 0.55 0.301 0.292 ...
The “large” data sets look like:
set i /amuchlongernamefortesting1*amuchlongernamefortesting200/;
alias (i,j,k);
parameter p(i,j,k);
p(i,j,k) = uniform(0,1);
Here we export to an .Rdata file with StringsAsFactors=T:
> load("p2.rdata") > head(p) i j k value 1 amuchlongernamefortesting1 amuchlongernamefortesting1 amuchlongernamefortesting1 0.1717471 2 amuchlongernamefortesting1 amuchlongernamefortesting1 amuchlongernamefortesting2 0.8432667 3 amuchlongernamefortesting1 amuchlongernamefortesting1 amuchlongernamefortesting3 0.5503754 4 amuchlongernamefortesting1 amuchlongernamefortesting1 amuchlongernamefortesting4 0.3011379 5 amuchlongernamefortesting1 amuchlongernamefortesting1 amuchlongernamefortesting5 0.2922121 6 amuchlongernamefortesting1 amuchlongernamefortesting1 amuchlongernamefortesting6 0.2240529 > str(p) 'data.frame': 8000000 obs. of 4 variables: $ i : Factor w/ 200 levels "amuchlongernamefortesting1",..: 1 1 1 1 1 1 1 1 1 1 ... $ j : Factor w/ 200 levels "amuchlongernamefortesting1",..: 1 1 1 1 1 1 1 1 1 1 ... $ k : Factor w/ 200 levels "amuchlongernamefortesting1",..: 1 2 3 4 5 6 7 8 9 10 ... $ value: num 0.172 0.843 0.55 0.301 0.292 ...
The timings are indeed what we expected:
- If not compressed then the .Rdata files are much smaller when using factors. Longer strings make this effect more pronounced.
- If compressed the .Rdata files are about the same size whether using strings or factors. But with factors we can do the compression faster (fewer bytes to compress).
- Dataframes with StringsAsFactors=T use up less memory inside R. They also load faster.
- Conclusion: making StringsAsFactors=T the default make sense (just as with R’s read.csv).
I updated the defaults:
No comments:
Post a Comment