Wednesday, January 20, 2016

Dataframe Pivoting: some timings

In Pivot a table: GAMS, SQLite, R and Python we described a pivoting operation on a small example. In the comments it was mentioned that for larger tables or dataframes, especially tydir is very fast. Here we try to confirm this using an artificial example:

image

Below we time each step (see here for more information).

GAMS Step Code Time (seconds)
GAMS populate a 2D parameter with 1 million entries set i /i1*i100000/
    j
/j1*j10/
;
parameter
p(i,j);
p(i,j) = uniform(0,1);
0.4
Write a GDX file execute_unload 'x.gdx'; 0.3
Store in SQLite database

execute 'gdx2sqlite -i x.gdx -o x.db -fast';

5
R Step Code Time (seconds)
Read database table
df1<-dbGetQuery(db,"select * from p")
2
Pivot using SQL (no indices)
df2<- dbGetQuery(db,"select tj1.i as i, tj1.value as j1, tj2.value as j2, 
+ tj3.value as j3, tj4.value as j4,
tj5.value as j5, + tj6.value as j6, tj7.value as j7, tj8.value as j8, + tj9.value as j9, tj10.value as j10 + from (select i,value from p where j='j1') as tj1, + (select i,value from p where j='j2') as tj2, + (select i,value from p where j='j3') as tj3, + (select i,value from p where j='j4') as tj4, + (select i,value from p where j='j5') as tj5, + (select i,value from p where j='j6') as tj6, + (select i,value from p where j='j7') as tj7, + (select i,value from p where j='j8') as tj8, + (select i,value from p where j='j9') as tj9, + (select i,value from p where j='j10') as tj10 + where tj1.i=tj2.i and tj1.i=tj3.i and tj1.i=tj4.i and tj1.i=tj5.i + and tj1.i=tj6.i and tj1.i=tj7.i and tj1.i=tj8.i + and tj1.i=tj9.i and tj1.i=tj10.i")
50
Pivot using SQL after creating indices on columns i and j
id (same query as above)
15
Pivot using SQL (no indices, alternative query)
> df1 <- dbGetQuery(db,"
+    select i,
+           sum(case when j='j1' then value else null end) as j1,
+           sum(case when j='j2' then value else null end) as j2,
+           sum(case when j='j3' then value else null end) as j3,
+           sum(case when j='j4' then value else null end) as j4,
+           sum(case when j='j5' then value else null end) as j5,
+           sum(case when j='j6' then value else null end) as j6,
+           sum(case when j='j7' then value else null end) as j7,
+           sum(case when j='j8' then value else null end) as j8,
+           sum(case when j='j9' then value else null end) as j9,
+           sum(case when j='j10' then value else null end) as j10
+    from p
+    group by i")
5
Pivot using reshape2/dcast
df2<-dcast(df1,i~j,value.var="value")
2
Pivot using tidyr/spread
df2<-spread(df1,j,value)
2
Python Step Code Time (seconds)
pandas/pivot df2=df1.pivot(index='i',columns='j',values='value') 2
Conclusion

The methods that work on in-memory datastructures are faster than SQL but depending on the SQL formulation we can come close.  The big advantage of the pivot methods in R and Python is that they require much less typing (and less thought).