There are a number of very fast CSV file readers available in R and Python. Lets have a quick test to see how they compare.
Generating CSV file
I generated a very simple, but large CSV file with 100 million records using a GAMS script as follows:
The generated CSV file looks like:
Directory of D:\tmp\csv
12/08/2016 03:42 PM 3,656,869,678 d.csv
We also see the CSV file is much larger than the intermediate (compressed) GAMS GDX file.
This is the default CSV reader in R.
> system.time(d<-read.csv("d.csv")) user system elapsed 1361.61 50.56 1434.39
read_csv is from the readr package, and it is much faster for large CSV files:
Would it help to read a compressed CSV file?
> system.time(d<-read_csv("d2.csv.gz")) Error in .Call("readr_read_connection_", PACKAGE = "readr", con, chunk_size) : negative length vectors are not allowed Timing stopped at: 57.53 4.43 62.29
Bummer. I have no idea what went wrong here. May be we hit some size limit (note the CSV file is larger than 2 gb; other compression formats gave the same result).
As mentioned in the comments, the package data.table has a function fread.
> system.time(dt<-fread("d.csv")) Read 100000000 rows and 5 (of 5) columns from 3.406 GB file in 00:04:40 user system elapsed 275.19 4.33 281.05
t0=pc() df=pd.read_csv("d.csv") print(pc()-t0)
The paratext library should be even faster.