Calculating quantiles is a standard way to summarize data. In the example below we have some small random data p(i,j,k) where we want to calculate quantiles at the 0, 25, 50, 75, and 100% level for each (i,j):
---- 9 PARAMETER p k1 k2 k3 k4 k5 k6 k7 k8 i1.j1 18.003 84.483 55.487 30.813 29.929 23.181 35.633 85.771 ---- 143 PARAMETER quantiles 0% 25% 50% 75% 100% i1.j1 18.003 28.242 33.223 62.736 85.771 |
In GAMS this is not so easy. Here is one way:
set i /i1*i2/ j /j1*j3/ k /k1*k8/ q /'0%', '25%', '50%', '75%', '100%'/ ; parameter p(i,j,k); p(i,j,k) = uniform(1,100); display p; parameter v(k) 'vector to be sorted' r(k) 'sorting rank' perc0(q) /'0%' EPS, '25%' 25, '50%' 50, '75%' 75, '100%' 100 / perc(q) 'percentiles' quantiles(i,j,q) ; $libinclude rank loop((i,j), v(k) = p(i,j,k); perc(q) = perc0(q); $ libinclude rank v k r perc quantiles(i,j,q) = perc(q); ); display quantiles; |
We use here the rank utility by Tom Rutherford. A disadvantage is that it is designed to work on vectors. That means we need to loop to extract a k-vector. Under the hood a GDX file is written for each vector, a program called gdxrank is called, and another GDX file is written with the ranks of each vector element. For very large data sets this turns out to be very slow.
Note that in the data statement for perc0 we use EPS instead of 0. This on purpose: EPS will survive the trip to gdxrank, while a pure zero is not stored and not transmitted. If we would have used 0 instead of EPS we would have seen:
---- 142 PARAMETER quantiles 25% 50% 75% 100% i1.j1 28.242 33.223 62.736 85.771 |
How would this look in R?
R has the quantile function to calculate percentiles of a vector. We can use the by function to apply this on a slice of the data. This can look like:
|
We can compute all quantiles in a oneliner script. Although the results are correct, it would be nice if we can organize the results in a proper data frame for further processing. Here is my first attempt:
> cbind(unique(df[,1:2]), + matrix(unlist(q),ncol=5,byrow=T,dimnames=list(NULL,names(q[[1,1]]))) + ) i j 0% 25% 50% 75% 100% 1 i1 j1 18.002966 28.24206 33.22294 62.73622 85.77076 9 i1 j2 11.938737 29.25933 55.02918 69.63326 83.25839 17 i1 j3 7.644259 41.37528 61.31338 82.12764 99.81365 25 i2 j1 16.857103 28.78330 53.35881 65.53476 87.37377 33 i2 j2 14.017667 16.55922 30.77533 38.48282 67.22393 41 i2 j3 5.608600 17.43385 33.31175 45.56650 64.92699 |
The unique function populates the leftmost columns i and j. The unlist operation will make one long vector of the quantiles. We make the matrix of the correct size by using ncol=5. Finally, as the column names are lost in this operation, we re-establish them by dimnames.
Unfortunately this is wrong as unique populates the columns i and j in a different order than by does. The easy fix is as follows:
> q <- by(df$value,list(df$j,df$i),quantile) |
The change is in the order of list(df$j,df$i) passed on to by. Now unique and q have the same order.
Performance
If we change the data to:
set i /i1*i50/ |
we see the following timings:
- GAMS rank loop: 161 seconds
- R calculate quantiles: 1.5 seconds
- R convert results into a data frame: 0.08 seconds
No comments:
Post a Comment