GAMS has three easy ways to populate a SQLite database:
- Using the tool gdx2sqlite. This tool populates a SQLite database with data from a GDX file. This means we first have to export GAMS data to a GDX file. As there is quite some file I/O going on here (writing GDX file, reading GDX file, writing database), I would expect this to be slower than the next method.
- The new GAMS-connect facility. This does not use intermediate files, and directly copies records from in-memory data. This should be the fastest.
- Old fashioned CSV files. We first export data as a GDX file, and then use gdxdump to convert the data to a CSV file. Then sqlite can import the CSV file, and populate the database. There is much file I/O here, so this should be slow.
---- 86 PARAMETER timings elapsed time (seconds) gdx2sqlite 6.329, GAMSConnect 39.480, CSV 17.053
GDX2SQLITE v 0.7 GDX Library 41.4.0 caab8bc0 Dec 14, 2022 WEI x86 64bit/MS Window Current directory:C:\Users\erwin\Downloads InputFile:pdata.gdx OutputFile:p1.sqlite Symbols:1 Uels:50 Loading Uels Processing Symbols p(6250000 records) Inserts:6250000 Elapsed:5.96 seconds Inserts/second:1048130.14 Done
About a million inserts per second! Most other databases are way slower. That may also mean that these timing issues are likely not to be very pronounced when using other databases: slow inserts may be dominating. "Big databases" typically have special facilities to do BULK INSERTs.
Some notes
We can speed up the gdx2sqlite performance further by using the -fast and -small options:
execute "gdx2sqlite -i pdata.gdx -o p1.sqlite -fast -small";
This improves the time to just 4.45 seconds for gdx2sqlite itself.:
GDX2SQLITE v 0.7 GDX Library 41.4.0 caab8bc0 Dec 14, 2022 WEI x86 64bit/MS Window Current directory:C:\Users\erwin\Downloads InputFile:pdata.gdx OutputFile:p1.sqlite Symbols:1 Uels:50 Loading Uels Processing Symbols p(6250000 records) Inserts:6250050 Elapsed:4.45 seconds Inserts/second:1404821.31 Done
This speed is now 1.4 million inserts per second. The total time for method 1 becomes 4.7 seconds. The -fast option will turn off using the journal log and will use asynchronous I/O (less syncing). The -small option will use a super-sparsity storage scheme (basically storing the unique strings in a separate, small table, kind of a normalization step in database terms).
Another smaller issue with GAMS-connect is that it generates slightly strange column names:
C:\Users\erwin\Downloads>sqlite3 p2.sqlite SQLite version 3.8.5 2014-06-04 14:06:34 Enter ".help" for usage hints. sqlite> .schema p CREATE TABLE p ( i1_0 TEXT, i2_1 TEXT, i3_2 TEXT, i4_3 TEXT, value FLOAT ); sqlite>
As the names for the GAMS indices are already unique, there is no need to mangle them.
Finally, I am not sure how to tell GAMS-connect to use a database in the current working directory. I think we always need to specify a full path using URL notation (with a lot of / characters).
The CSV method always uses TEXT as column type:
C:\Users\erwin\Downloads>sqlite3 p3.sqlite SQLite version 3.8.5 2014-06-04 14:06:34 Enter ".help" for usage hints. sqlite> .schema p CREATE TABLE p( "i1" TEXT, "i2" TEXT, "i3" TEXT, "i4" TEXT, "Val" TEXT ); sqlite>
In SQLite, that is not a problem: column types are merely hints. Double-precision numbers are stored as floats (even in a character column). SQLite can have a different type for each value. Putting it differently: types are associated with values instead of columns [2]. However, it may cause problems with other software. If we want to have proper column types, we can do something like:
$onecho > import.sql CREATE TABLE p(i1 TEXT, i2 TEXT, i3 TEXT, i4 TEXT, val FLOAT); .import --csv --skip 1 p.csv p $offecho |
Conclusion
SQLite can be very fast. Achieving a million inserts per second is a phenomenal performance. Only gdx2sqlite seems to achieve this. I would have guessed that the new tool GAMS-connect would be a little bit faster.
GAMS-connect is quite new, so it is understandable that things are not fully optimized yet. Future releases may be much faster.
References
- SQLAlchemy, The Python SQL Toolkit and Object Relational Mapper, https://www.sqlalchemy.org/.
- Data types in SQLite, https://www.sqlite.org/datatype3.html
Appendix: GAMS code
$onText
Populate a SQLite database with a large parameter Number of records: 6.25m
$offText
*---------------------------------------------------- * clean up before we start *---------------------------------------------------- $call rm -f p1.sqlite $call rm -f p2.sqlite $call rm -f p3.sqlite $call rm -f p.csv
*---------------------------------------------------- * random data *---------------------------------------------------- Set i /index1*index50/; alias(i,i1,i2,i3,i4); Parameter p(i1,i2,i3,i4); p(i1,i2,i3,i4) = uniform(0,1);
*---------------------------------------------------- * timing *----------------------------------------------------
Parameter timings(*) 'elapsed time (seconds)'; Scalar t 'current time';
*---------------------------------------------------- * method1: gdx2sqlite via GDX file *----------------------------------------------------
t = timeElapsed;
execute_unload "pdata",p; execute "gdx2sqlite -i pdata.gdx -o p1.sqlite";
timings('gdx2sqlite') = timeElapsed - t;
*---------------------------------------------------- * method2: new GAMS Connect facility *----------------------------------------------------
t = timeElapsed;
embeddedCode Connect:
- GAMSReader: symbols: - name: p - PandasSQLWriter: connection: "sqlite:////tmp//p2.sqlite" symbols: - name: p tableName: p endEmbeddedCode
timings('GAMSConnect') = timeElapsed - t;
*---------------------------------------------------- * method3: CSV files *----------------------------------------------------
$onecho > import.sql .mode csv .import p.csv p $offecho
t = timeElapsed;
execute_unload "pdata",p; execute "gdxdump pdata.gdx symb=p format=csv output=p.csv"; execute "sqlite3 p3.sqlite < import.sql";
timings('CSV') = timeElapsed - t;
display timings; |
Just based on the inline YAML, I suspect GAMs connect is going from GAMS -> pandas -> sqlite (I might actually suspect the native sqlite python library is faster than SQLAlchemy, but SQLAlchemy will be able to connect to more endpoints). It may be the "GAMS -> pandas" part is where the time is, not the write part.
ReplyDeleteNot a database expert, but very large bulk inserts from python for different DBs (postgres and teradata at my workplace), tend to be slow. That is, I tend to have better results manually chunking up my uploads into smaller subsets. (At some point SQLAlchemy I believe needs to do this under the hood as well, but I have never looked too much into the details.)
The reason that splitting bulk inserts into smaller chunks is beneficial could be that the transaction log becomes large. In gdx2sqlite I turned off writing the rollback journal (I assume we populate a new database). That may help. Interestingly, sqlite has no special facilities for bulk insert, so I just do standard inserts.
Delete