Friday, December 9, 2016

Reading CSV files in R: read.csv vs read_csv

There are a number of very fast CSV file readers available in R and Python. Lets have a quick test to see how they compare.

Generating CSV file

I generated a very simple, but large CSV file with 100 million records using a GAMS script as follows:


parameter d(i,j,k,l);
d(i,j,k,l) = uniform(0,1);

$setenv gdxcompress 1
execute_unload "d.gdx"
execute "gdxdump d.gdx output=d.csv symb=d delim=comma format=csv"

The generated CSV file looks like:

D:\tmp\csv>head d.csv

D:\tmp\csv>dir d.*
Volume in drive D is My Passport
Volume Serial Number is 74B7-6DCC

Directory of D:\tmp\csv

12/08/2016  03:42 PM     3,656,869,678 d.csv
12/08/2016  03:30 PM       806,199,476 d.gdx
               2 File(s)  4,463,069,154 bytes
               0 Dir(s)  1,099,214,491,648 bytes free


We also see the CSV file is much larger than the intermediate (compressed) GAMS GDX file.

R read.csv

This is the default CSV reader in R.

> system.time(d<-read.csv("d.csv"))
   user  system elapsed 
1361.61   50.56 1434.39 

R read_csv

read_csv is from the readr package, and it is much faster for large CSV files:

> system.time(d<-read_csv("d.csv"))
Parsed with column specification:
  i = col_character(),
  j = col_character(),
  k = col_character(),
  l = col_character(),
  Val = col_double()
|================================================================================| 100% 3487 MB
   user  system elapsed 
 186.23    5.66  196.20 

Would it help to read a compressed CSV file?

> system.time(d<-read_csv("d2.csv.gz"))
Error in .Call("readr_read_connection_", PACKAGE = "readr", con, chunk_size) : 
  negative length vectors are not allowed
Timing stopped at: 57.53 4.43 62.29 

Bummer. I have no idea what went wrong here. May be we hit some size limit (note the CSV file is larger than 2 gb; other compression formats gave the same result).

R fread

As mentioned in the comments, the package data.table has a function fread.

> system.time(dt<-fread("d.csv"))
Read 100000000 rows and 5 (of 5) columns from 3.406 GB file in 00:04:40
   user  system elapsed 
 275.19    4.33  281.05 
Python pandas.read_csv

Quite fast:


The paratext library should be even faster.

  1. readr 1.0.0,
  2. Damian Eads, ParaText: CSV parsing at 2.5 GB per second,