Sunday, May 21, 2017

Indexing economic time series in R

When we want to compare different (economic) data, an often used approach is indexing. We choose one year (often the beginning of the time series) as the base year. We then normalize each time series such that the value at the base year is 100. Or:

\[\hat{x}_t = 100 \frac{x_t}{x_0}\]

When doing this in R it is interesting to see the implementation can depend much on the chosen data structure.

Below I consider three different data structures: the time series are stored (1) row-wise in a matrix, (2) column-wise in a matrix, and (3) unordered in a long-format data frame.

Matrix: row wise

Below is some (artificial) data organized as a matrix with the rows being the series. We have three series: a,b and c. The columns represent the years. Row and column names are used to make this visible:

str(A)

##  num [1:3, 1:20] 67.235 4743.289 0.871 64.69 5006.955 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:3] "a" "b" "c"
##   ..$ : chr [1:20] "2011" "2012" "2013" "2014" ...

A

##           2011         2012         2013         2014         2015
## a   67.2353402   64.6901151   63.6518902   69.1449240   71.8494470
## b 4743.2887884 5006.9547226 5623.0654046 5912.6498320 5736.7239960
## c    0.8710604    0.9193449    0.8711697    0.8451556    0.8440797
##           2016         2017         2018         2019         2020
## a   77.2782169   75.5709375   83.1405320   82.0417121   85.6433929
## b 6302.6091905 6241.5527226 6140.4721640 5542.9254530 5627.5502851
## c    0.8496936    0.8739366    0.8955205    0.8681451    0.8202612
##           2021         2022         2023         2024         2025
## a   89.1381161   89.8803998   90.5211813    91.411429   94.4488615
## b 6006.0691966 5998.6549344 6121.5326208  6378.963851 6628.1064720
## c    0.8639629    0.8477383    0.8235255     0.887063    0.8998234
##           2026         2027         2028         2029         2030
## a   94.2000739   89.9766167   90.4627291   87.2632337   86.2154671
## b 7115.7111566 6410.3639302 7038.3387976 6765.7867813 7076.7998102
## c    0.8547652    0.9033014    0.9224435    0.9127362    0.9541839

To index this we can use the the following R code:

Aindx <- 100*A/A[,1]
Aindx

##   2011      2012      2013      2014      2015      2016     2017     2018
## a  100  96.21445  94.67029 102.84015 106.86262 114.93690 112.3976 123.6560
## b  100 105.55872 118.54782 124.65296 120.94402 132.87425 131.5870 129.4560
## c  100 105.54318 100.01254  97.02606  96.90254  97.54703 100.3302 102.8081
##        2019      2020      2021      2022      2023     2024     2025
## a 122.02171 127.37854 132.57628 133.68029 134.63334 135.9574 140.4750
## b 116.85827 118.64237 126.62247 126.46615 129.05671 134.4840 139.7365
## c  99.66531  94.16812  99.18518  97.32255  94.54286 101.8371 103.3021
##        2026     2027     2028     2029     2030
## a 140.10500 133.8234 134.5464 129.7877 128.2294
## b 150.01640 135.1460 148.3852 142.6391 149.1961
## c  98.12926 103.7013 105.8989 104.7845 109.5428

Note that the expression 100*A/A[,1] is not as trivial as it seems. We divide a \(3 \times 20\) matrix by a vector of length 3. The division is done element-wise and column-by-column. We sometimes say the elements of A[,1] are recycled.

The recycling mechanism can be illustrated with a small example:

c(1,2,3,4)+c(1,2)

## [1] 2 4 4 6

I try to have a picture in each post, so here we go:

image

Matrix column wise

If the matrix is organized column-wise (e.g. by taking the transpose), we have:

A

##             a        b         c
## 2011 67.23534 4743.289 0.8710604
## 2012 64.69012 5006.955 0.9193449
## 2013 63.65189 5623.065 0.8711697
## 2014 69.14492 5912.650 0.8451556
## 2015 71.84945 5736.724 0.8440797
## 2016 77.27822 6302.609 0.8496936
## 2017 75.57094 6241.553 0.8739366
## 2018 83.14053 6140.472 0.8955205
## 2019 82.04171 5542.925 0.8681451
## 2020 85.64339 5627.550 0.8202612
## 2021 89.13812 6006.069 0.8639629
## 2022 89.88040 5998.655 0.8477383
## 2023 90.52118 6121.533 0.8235255
## 2024 91.41143 6378.964 0.8870630
## 2025 94.44886 6628.106 0.8998234
## 2026 94.20007 7115.711 0.8547652
## 2027 89.97662 6410.364 0.9033014
## 2028 90.46273 7038.339 0.9224435
## 2029 87.26323 6765.787 0.9127362
## 2030 86.21547 7076.800 0.9541839

The expression to index the series becomes now much more complicated:

Aindx <- 100*A/rep(A[1,],each=nrow(A))
Aindx

##              a        b         c
## 2011 100.00000 100.0000 100.00000
## 2012  96.21445 105.5587 105.54318
## 2013  94.67029 118.5478 100.01254
## 2014 102.84015 124.6530  97.02606
## 2015 106.86262 120.9440  96.90254
## 2016 114.93690 132.8742  97.54703
## 2017 112.39764 131.5870 100.33019
## 2018 123.65600 129.4560 102.80808
## 2019 122.02171 116.8583  99.66531
## 2020 127.37854 118.6424  94.16812
## 2021 132.57628 126.6225  99.18518
## 2022 133.68029 126.4662  97.32255
## 2023 134.63334 129.0567  94.54286
## 2024 135.95741 134.4840 101.83714
## 2025 140.47503 139.7365 103.30206
## 2026 140.10500 150.0164  98.12926
## 2027 133.82340 135.1460 103.70135
## 2028 134.54640 148.3852 105.89891
## 2029 129.78775 142.6391 104.78448
## 2030 128.22939 149.1961 109.54279

In this case the automatic recycling is not working the way we want, and we have to do this by hand. Basically, in terms of our little example, before we were happy with c(1,2) being extended automatically to c(1,2,1,2) while we need now something like c(1,1,2,2).

Data frame long format

Often data comes in a “long” format. Here is a picture to illustrate the difference between a “wide” and a “long” format:

widelong

Often wide format data comes from spreadsheets while long format is often used in databases. Sometimes the operation to convert from long to wide is called “pivot” (and the reverse “unpivot”).

A long format data frame with the above data can look like (I show the first part only):

head(df)

##   series year        value
## 1      a 2011   67.2353402
## 2      b 2011 4743.2887884
## 3      c 2011    0.8710604
## 4      a 2012   64.6901151
## 5      b 2012 5006.9547226
## 6      c 2012    0.9193449

How can we index this?

Here is my solution:

# get first year
y0 <- min(df$year)
y0

## [1] 2011

# get values at first year
x0 <- df[df$year==y0,"value"]
x0

## [1]   67.2353402 4743.2887884    0.8710604

# allow x0 to be indexed by series name
names(x0) <- df[df$year==y0,"series"]
x0

##            a            b            c
##   67.2353402 4743.2887884    0.8710604

# indexing of the series
df$indexedvalue <- 100*df$value/x0[df$series]
head
(df)

##   series year        value indexedvalue
## 1      a 2011   67.2353402    100.00000
## 2      b 2011 4743.2887884    100.00000
## 3      c 2011    0.8710604    100.00000
## 4      a 2012   64.6901151     96.21445
## 5      b 2012 5006.9547226    105.55872
## 6      c 2012    0.9193449    105.54318

The trick I used was to make the vector of values of the first year addressable by the series name. E.g.:

x0["a"]

##        a
## 67.23534

This allows us to calculate the column with indexed values in one vectorized operation.

dplyr

In the comments below, Ricardo Sanchez, offered another, rather clean, approach for the last operation:

library(dplyr)
df <- df %>%
      group_by(series) %>%
      arrange(year) %>%
      mutate(indexedvalue = 100 * value / first(value))
df

## Source: local data frame [60 x 4]
## Groups: series [3]
##
##    series  year        value indexedvalue
##    <fctr> <int>        <dbl>        <dbl>
## 1       a  2011   67.2353402    100.00000
## 2       b  2011 4743.2887884    100.00000
## 3       c  2011    0.8710604    100.00000
## 4       a  2012   64.6901151     96.21445
## 5       b  2012 5006.9547226    105.55872
## 6       c  2012    0.9193449    105.54318
## 7       a  2013   63.6518902     94.67029
## 8       b  2013 5623.0654046    118.54782
## 9       c  2013    0.8711697    100.01254
## 10      a  2014   69.1449240    102.84015
## # ... with 50 more rows

sqldf

Of course if you are familiar with SQL we can also use that:

library(sqldf)
df <-  sqldf("
         select df.series,year,value,100*value/v0 as indexedvalue
         from df
         join (select min(year),value as v0, series
             from df
             group by series) df0
             on df.series = df0.series
       "
)
head
(df)

##   series year        value indexedvalue
## 1      a 2011   67.2353402    100.00000
## 2      b 2011 4743.2887884    100.00000
## 3      c 2011    0.8710604    100.00000
## 4      a 2012   64.6901151     96.21445
## 5      b 2012 5006.9547226    105.55872
## 6      c 2012    0.9193449    105.54318

References
  1. Federal Reserve Bank of Dallas, Indexing to a Common Starting Point, https://www.dallasfed.org/research/basics/indexing.aspx

No comments:

Post a Comment