Monday, February 13, 2023

Populating SQLite databases

 GAMS has three easy ways to populate a SQLite database:

  1. Using the tool gdx2sqlite. This tool populates a SQLite database with data from a GDX file. This means we first have to export GAMS data to a GDX file. As there is quite some file I/O going on here (writing GDX file, reading GDX file, writing database), I would expect this to be slower than the next method.
  2. The new GAMS-connect facility. This does not use intermediate files, and directly copies records from in-memory data. This should be the fastest.
  3. Old fashioned CSV files. We first export data as a GDX file, and then use gdxdump to convert the data to a CSV file. Then sqlite can import the CSV file, and populate the database. There is much file I/O here, so this should be slow.

When trying these methods on a multidimensional parameter with 6,250,000 records, we actually see the following timings:

----     86 PARAMETER timings  elapsed time (seconds)

gdx2sqlite   6.329,    GAMSConnect 39.480,    CSV         17.053

Contrary to my predictions, GAMS-connect is much slower than gdx2sqlite and even slower than using CSV files. A factor of 6 compared to gdx2sqlite, and a factor of 2 compared to CSV files is quite a large margin considering we are using the same underlying database technology. GAMS-connect is using the SQLAlchemy database tools, which are advertised as providing "high-performing database access" [1]. Obviously, we don't see this in these numbers. 

The gdx2sqlite log shows some great statistics on the inserts:

GDX2SQLITE v 0.7
GDX Library      41.4.0 caab8bc0 Dec 14, 2022          WEI x86 64bit/MS Window
Current directory:C:\Users\erwin\Downloads

   InputFile:pdata.gdx
   OutputFile:p1.sqlite
   Symbols:1
   Uels:50
   Loading Uels
   Processing Symbols
     p(6250000 records)
   Inserts:6250000
   Elapsed:5.96 seconds
   Inserts/second:1048130.14
   Done

About a million inserts per second! Most other databases are way slower. That may also mean that these timing issues are likely not to be very pronounced when using other databases: slow inserts may be dominating. "Big databases" typically have special facilities to do BULK INSERTs.  


Some notes


We can speed up the gdx2sqlite performance further by using the -fast and -small options:

execute "gdx2sqlite -i pdata.gdx -o p1.sqlite -fast -small";

This improves the time to just 4.45 seconds for gdx2sqlite itself.:


GDX2SQLITE v 0.7
GDX Library      41.4.0 caab8bc0 Dec 14, 2022          WEI x86 64bit/MS Window
Current directory:C:\Users\erwin\Downloads

   InputFile:pdata.gdx
   OutputFile:p1.sqlite
   Symbols:1
   Uels:50
   Loading Uels
   Processing Symbols
     p(6250000 records)
   Inserts:6250050
   Elapsed:4.45 seconds
   Inserts/second:1404821.31
   Done


This speed is now 1.4 million inserts per second. The total time for method 1 becomes 4.7 seconds. The -fast option will turn off using the journal log and will use asynchronous I/O (less syncing). The -small option will use a super-sparsity storage scheme (basically storing the unique strings in a separate, small table, kind of a normalization step in database terms).


Another smaller issue with GAMS-connect is that it generates slightly strange column names:


C:\Users\erwin\Downloads>sqlite3 p2.sqlite
SQLite version 3.8.5 2014-06-04 14:06:34
Enter ".help" for usage hints.
sqlite> .schema p
CREATE TABLE p (
        i1_0 TEXT,
        i2_1 TEXT,
        i3_2 TEXT,
        i4_3 TEXT,
        value FLOAT
);
sqlite>


As the names for the GAMS indices are already unique, there is no need to mangle them. 

Finally, I am not sure how to tell GAMS-connect to use a database in the current working directory. I think we always need to specify a full path using URL notation (with a lot of / characters).

The CSV method always uses TEXT as column type:

C:\Users\erwin\Downloads>sqlite3 p3.sqlite
SQLite version 3.8.5 2014-06-04 14:06:34
Enter ".help" for usage hints.
sqlite> .schema p
CREATE TABLE p(
  "i1" TEXT,
  "i2" TEXT,
  "i3" TEXT,
  "i4" TEXT,
  "Val" TEXT
);
sqlite>


In SQLite, that is not a problem: column types are merely hints. Double-precision numbers are stored as floats (even in a character column). SQLite can have a different type for each value. Putting it differently: types are associated with values instead of columns [2].  However, it may cause problems with other software. If we want to have proper column types, we can do something like:


$onecho > import.sql

CREATE TABLE p(i1 TEXT, i2 TEXT, i3 TEXT, i4 TEXT, val FLOAT);

.import --csv --skip 1 p.csv p

$offecho


Note: this requires a newer version of sqlite3.exe than the one that is shipped with GAMS. 

 

Conclusion

 

SQLite can be very fast. Achieving a million inserts per second is a phenomenal performance. Only gdx2sqlite seems to achieve this. I would have guessed that the new tool GAMS-connect would be a little bit faster.

GAMS-connect is quite new, so it is understandable that things are not fully optimized yet. Future releases may be much faster.  


References


  1. SQLAlchemy, The Python SQL Toolkit and Object Relational Mapper,  https://www.sqlalchemy.org/.
  2. Data types in SQLite, https://www.sqlite.org/datatype3.html 


Appendix: GAMS code

$onText

 

  Populate a SQLite database with a large parameter

  Number of records: 6.25m 

 

$offText

 

 

*----------------------------------------------------

clean up before we start

*----------------------------------------------------

$call rm -f p1.sqlite

$call rm -f p2.sqlite

$call rm -f p3.sqlite

$call rm -f p.csv

 

*----------------------------------------------------

random data

*----------------------------------------------------

Set  i /index1*index50/;

alias(i,i1,i2,i3,i4);

Parameter p(i1,i2,i3,i4);

p(i1,i2,i3,i4) = uniform(0,1);

 

*----------------------------------------------------

timing

*----------------------------------------------------

 

Parameter timings(*) 'elapsed time (seconds)';

Scalar 'current time';

 

*----------------------------------------------------

* method1: gdx2sqlite via GDX file

*----------------------------------------------------

 

t = timeElapsed;

 

execute_unload "pdata",p;

execute "gdx2sqlite -i pdata.gdx -o p1.sqlite";

 

timings('gdx2sqlite') = timeElapsed - t;

 

 

*----------------------------------------------------

* method2: new GAMS Connect facility

*----------------------------------------------------

 

t = timeElapsed;

 

embeddedCode Connect:

 

GAMSReader:

    symbols:

      - name: p

PandasSQLWriter:

    connection: "sqlite:////tmp//p2.sqlite"

    symbols:

      - name: p

        tableName: p

     

endEmbeddedCode

 

timings('GAMSConnect') = timeElapsed - t;

 

 

*----------------------------------------------------

* method3: CSV files

*----------------------------------------------------

 

$onecho > import.sql

.mode csv

.import p.csv p

$offecho

 

 

t = timeElapsed;

 

execute_unload "pdata",p;

execute "gdxdump pdata.gdx symb=p format=csv output=p.csv";

execute "sqlite3 p3.sqlite < import.sql";

 

timings('CSV') = timeElapsed - t;

 

display timings;

 

2 comments:

  1. Just based on the inline YAML, I suspect GAMs connect is going from GAMS -> pandas -> sqlite (I might actually suspect the native sqlite python library is faster than SQLAlchemy, but SQLAlchemy will be able to connect to more endpoints). It may be the "GAMS -> pandas" part is where the time is, not the write part.

    Not a database expert, but very large bulk inserts from python for different DBs (postgres and teradata at my workplace), tend to be slow. That is, I tend to have better results manually chunking up my uploads into smaller subsets. (At some point SQLAlchemy I believe needs to do this under the hood as well, but I have never looked too much into the details.)

    ReplyDelete
    Replies
    1. The reason that splitting bulk inserts into smaller chunks is beneficial could be that the transaction log becomes large. In gdx2sqlite I turned off writing the rollback journal (I assume we populate a new database). That may help. Interestingly, sqlite has no special facilities for bulk insert, so I just do standard inserts.

      Delete