Yet Another Math Programming Consultant: Dataframe Pivoting: some timings

Wednesday, January 20, 2016

Dataframe Pivoting: some timings

In Pivot a table: GAMS, SQLite, R and Python we described a pivoting operation on a small example. In the comments it was mentioned that for larger tables or dataframes, especially tydir is very fast. Here we try to confirm this using an artificial example:

Below we time each step (see here for more information).

GAMS Step	Code	Time (seconds)
GAMS populate a 2D parameter with 1 million entries	set i /i1i100000/ j /j1j10/ ; parameter p(i,j); p(i,j) = uniform(0,1);	0.4
Write a GDX file	execute_unload 'x.gdx';	0.3
Store in SQLite database	execute 'gdx2sqlite -i x.gdx -o x.db -fast';	5
R Step	Code	Time (seconds)
Read database table	df1<-dbGetQuery(db,"select * from p")	2
Pivot using SQL (no indices)	df2<- dbGetQuery(db,"select tj1.i as i, tj1.value as j1, tj2.value as j2, + tj3.value as j3, tj4.value as j4, tj5.value as j5, + tj6.value as j6, tj7.value as j7, tj8.value as j8, + tj9.value as j9, tj10.value as j10 + from (select i,value from p where j='j1') as tj1, + (select i,value from p where j='j2') as tj2, + (select i,value from p where j='j3') as tj3, + (select i,value from p where j='j4') as tj4, + (select i,value from p where j='j5') as tj5, + (select i,value from p where j='j6') as tj6, + (select i,value from p where j='j7') as tj7, + (select i,value from p where j='j8') as tj8, + (select i,value from p where j='j9') as tj9, + (select i,value from p where j='j10') as tj10 + where tj1.i=tj2.i and tj1.i=tj3.i and tj1.i=tj4.i and tj1.i=tj5.i + and tj1.i=tj6.i and tj1.i=tj7.i and tj1.i=tj8.i + and tj1.i=tj9.i and tj1.i=tj10.i")	50
Pivot using SQL after creating indices on columns i and j	id (same query as above)	15
Pivot using SQL (no indices, alternative query)	> df1 <- dbGetQuery(db," + select i, + sum(case when j='j1' then value else null end) as j1, + sum(case when j='j2' then value else null end) as j2, + sum(case when j='j3' then value else null end) as j3, + sum(case when j='j4' then value else null end) as j4, + sum(case when j='j5' then value else null end) as j5, + sum(case when j='j6' then value else null end) as j6, + sum(case when j='j7' then value else null end) as j7, + sum(case when j='j8' then value else null end) as j8, + sum(case when j='j9' then value else null end) as j9, + sum(case when j='j10' then value else null end) as j10 + from p + group by i")	5
Pivot using reshape2/dcast	df2<-dcast(df1,i~j,value.var="value")	2
Pivot using tidyr/spread	df2<-spread(df1,j,value)	2
Python Step	Code	Time (seconds)
pandas/pivot	df2=df1.pivot(index='i',columns='j',values='value')	2

Conclusion

The methods that work on in-memory datastructures are faster than SQL but depending on the SQL formulation we can come close. The big advantage of the pivot methods in R and Python is that they require much less typing (and less thought).

Yet Another Math Programming Consultant

Wednesday, January 20, 2016

Dataframe Pivoting: some timings

Conclusion

No comments:

Post a Comment