When we want to compare different (economic) data, an often used approach is indexing. We choose one year (often the beginning of the time series) as the base year. We then normalize each time series such that the value at the base year is 100. Or:
\[\hat{x}_t = 100 \frac{x_t}{x_0}\] |
When doing this in R it is interesting to see the implementation can depend much on the chosen data structure.
Below I consider three different data structures: the time series are stored (1) row-wise in a matrix, (2) column-wise in a matrix, and (3) unordered in a long-format data frame.
Matrix: row wise
Below is some (artificial) data organized as a matrix with the rows being the series. We have three series: a,b and c. The columns represent the years. Row and column names are used to make this visible:
str(A)
## num [1:3, 1:20] 67.235 4743.289 0.871 64.69 5006.955 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:3] "a" "b" "c"
## ..$ : chr [1:20] "2011" "2012" "2013" "2014" ...
A
## 2011 2012 2013 2014 2015
## a 67.2353402 64.6901151 63.6518902 69.1449240 71.8494470
## b 4743.2887884 5006.9547226 5623.0654046 5912.6498320 5736.7239960
## c 0.8710604 0.9193449 0.8711697 0.8451556 0.8440797
## 2016 2017 2018 2019 2020
## a 77.2782169 75.5709375 83.1405320 82.0417121 85.6433929
## b 6302.6091905 6241.5527226 6140.4721640 5542.9254530 5627.5502851
## c 0.8496936 0.8739366 0.8955205 0.8681451 0.8202612
## 2021 2022 2023 2024 2025
## a 89.1381161 89.8803998 90.5211813 91.411429 94.4488615
## b 6006.0691966 5998.6549344 6121.5326208 6378.963851 6628.1064720
## c 0.8639629 0.8477383 0.8235255 0.887063 0.8998234
## 2026 2027 2028 2029 2030
## a 94.2000739 89.9766167 90.4627291 87.2632337 86.2154671
## b 7115.7111566 6410.3639302 7038.3387976 6765.7867813 7076.7998102
## c 0.8547652 0.9033014 0.9224435 0.9127362 0.9541839
To index this we can use the the following R code:
Aindx <- 100*A/A[,1]
Aindx
## 2011 2012 2013 2014 2015 2016 2017 2018
## a 100 96.21445 94.67029 102.84015 106.86262 114.93690 112.3976 123.6560
## b 100 105.55872 118.54782 124.65296 120.94402 132.87425 131.5870 129.4560
## c 100 105.54318 100.01254 97.02606 96.90254 97.54703 100.3302 102.8081
## 2019 2020 2021 2022 2023 2024 2025
## a 122.02171 127.37854 132.57628 133.68029 134.63334 135.9574 140.4750
## b 116.85827 118.64237 126.62247 126.46615 129.05671 134.4840 139.7365
## c 99.66531 94.16812 99.18518 97.32255 94.54286 101.8371 103.3021
## 2026 2027 2028 2029 2030
## a 140.10500 133.8234 134.5464 129.7877 128.2294
## b 150.01640 135.1460 148.3852 142.6391 149.1961
## c 98.12926 103.7013 105.8989 104.7845 109.5428
Note that the expression 100*A/A[,1] is not as trivial as it seems. We divide a \(3 \times 20\) matrix by a vector of length 3. The division is done element-wise and column-by-column. We sometimes say the elements of A[,1] are recycled.
The recycling mechanism can be illustrated with a small example:
c(1,2,3,4)+c(1,2)
## [1] 2 4 4 6
I try to have a picture in each post, so here we go:
Matrix column wise
If the matrix is organized column-wise (e.g. by taking the transpose), we have:
A
## a b c
## 2011 67.23534 4743.289 0.8710604
## 2012 64.69012 5006.955 0.9193449
## 2013 63.65189 5623.065 0.8711697
## 2014 69.14492 5912.650 0.8451556
## 2015 71.84945 5736.724 0.8440797
## 2016 77.27822 6302.609 0.8496936
## 2017 75.57094 6241.553 0.8739366
## 2018 83.14053 6140.472 0.8955205
## 2019 82.04171 5542.925 0.8681451
## 2020 85.64339 5627.550 0.8202612
## 2021 89.13812 6006.069 0.8639629
## 2022 89.88040 5998.655 0.8477383
## 2023 90.52118 6121.533 0.8235255
## 2024 91.41143 6378.964 0.8870630
## 2025 94.44886 6628.106 0.8998234
## 2026 94.20007 7115.711 0.8547652
## 2027 89.97662 6410.364 0.9033014
## 2028 90.46273 7038.339 0.9224435
## 2029 87.26323 6765.787 0.9127362
## 2030 86.21547 7076.800 0.9541839
The expression to index the series becomes now much more complicated:
Aindx <- 100*A/rep(A[1,],each=nrow(A))
Aindx
## a b c
## 2011 100.00000 100.0000 100.00000
## 2012 96.21445 105.5587 105.54318
## 2013 94.67029 118.5478 100.01254
## 2014 102.84015 124.6530 97.02606
## 2015 106.86262 120.9440 96.90254
## 2016 114.93690 132.8742 97.54703
## 2017 112.39764 131.5870 100.33019
## 2018 123.65600 129.4560 102.80808
## 2019 122.02171 116.8583 99.66531
## 2020 127.37854 118.6424 94.16812
## 2021 132.57628 126.6225 99.18518
## 2022 133.68029 126.4662 97.32255
## 2023 134.63334 129.0567 94.54286
## 2024 135.95741 134.4840 101.83714
## 2025 140.47503 139.7365 103.30206
## 2026 140.10500 150.0164 98.12926
## 2027 133.82340 135.1460 103.70135
## 2028 134.54640 148.3852 105.89891
## 2029 129.78775 142.6391 104.78448
## 2030 128.22939 149.1961 109.54279
In this case the automatic recycling is not working the way we want, and we have to do this by hand. Basically, in terms of our little example, before we were happy with c(1,2) being extended automatically to c(1,2,1,2) while we need now something like c(1,1,2,2).
Data frame long format
Often data comes in a “long” format. Here is a picture to illustrate the difference between a “wide” and a “long” format:
Often wide format data comes from spreadsheets while long format is often used in databases. Sometimes the operation to convert from long to wide is called “pivot” (and the reverse “unpivot”).
A long format data frame with the above data can look like (I show the first part only):
head(df)
## series year value
## 1 a 2011 67.2353402
## 2 b 2011 4743.2887884
## 3 c 2011 0.8710604
## 4 a 2012 64.6901151
## 5 b 2012 5006.9547226
## 6 c 2012 0.9193449
How can we index this?
Here is my solution:
# get first year
y0 <- min(df$year)
y0
## [1] 2011
# get values at first year
x0 <- df[df$year==y0,"value"]
x0
## [1] 67.2353402 4743.2887884 0.8710604
# allow x0 to be indexed by series name
names(x0) <- df[df$year==y0,"series"]
x0
## a b c
## 67.2353402 4743.2887884 0.8710604
# indexing of the series
df$indexedvalue <- 100*df$value/x0[df$series]
head(df)
## series year value indexedvalue
## 1 a 2011 67.2353402 100.00000
## 2 b 2011 4743.2887884 100.00000
## 3 c 2011 0.8710604 100.00000
## 4 a 2012 64.6901151 96.21445
## 5 b 2012 5006.9547226 105.55872
## 6 c 2012 0.9193449 105.54318
The trick I used was to make the vector of values of the first year addressable by the series name. E.g.:
x0["a"]
## a
## 67.23534
This allows us to calculate the column with indexed values in one vectorized operation.
dplyr
In the comments below, Ricardo Sanchez, offered another, rather clean, approach for the last operation:
library(dplyr)
df <- df %>%
group_by(series) %>%
arrange(year) %>%
mutate(indexedvalue = 100 * value / first(value))
df
## Source: local data frame [60 x 4]
## Groups: series [3]
##
## series year value indexedvalue
## <fctr> <int> <dbl> <dbl>
## 1 a 2011 67.2353402 100.00000
## 2 b 2011 4743.2887884 100.00000
## 3 c 2011 0.8710604 100.00000
## 4 a 2012 64.6901151 96.21445
## 5 b 2012 5006.9547226 105.55872
## 6 c 2012 0.9193449 105.54318
## 7 a 2013 63.6518902 94.67029
## 8 b 2013 5623.0654046 118.54782
## 9 c 2013 0.8711697 100.01254
## 10 a 2014 69.1449240 102.84015
## # ... with 50 more rows
sqldf
Of course if you are familiar with SQL we can also use that:
library(sqldf)
df <- sqldf("
select df.series,year,value,100*value/v0 as indexedvalue
from df
join (select min(year),value as v0, series
from df
group by series) df0
on df.series = df0.series
")
head(df)
## series year value indexedvalue
## 1 a 2011 67.2353402 100.00000
## 2 b 2011 4743.2887884 100.00000
## 3 c 2011 0.8710604 100.00000
## 4 a 2012 64.6901151 96.21445
## 5 b 2012 5006.9547226 105.55872
## 6 c 2012 0.9193449 105.54318
References
- Federal Reserve Bank of Dallas, Indexing to a Common Starting Point, https://www.dallasfed.org/research/basics/indexing.aspx
No comments:
Post a Comment