Thursday, May 12, 2016

gdx2sqlserver: more optimizations

The bulk import facility of SQL Server allows both ascii (text) files or binary files. The documentation states this can make some differences in performance. To test this I added a binary flag (-b) to the gdx2sqlserver tool so it can create a temporary file in native binary format.

To measure the difference we first execute the following GAMS model:

set i /i1*i100/;
alias
(i,j,k);
parameter
a(i,j,k);
a(i,j,k) = uniform(0,100);
execute_unload "test"
,a;
execute "gdx2sqlserver -i test.gdx -S .\SQL1 -db tmp -s tmp -v"
;

I.e. create a three dimensional parameter a with 100 x 100 x 100 = 1 million entries. The values are random. Then export this as gdx file test.gdx and finally import the gdx file into SQL Server using default settings (except for a verbose –v flag). We see:

a (1000000 records)
   select object_id('[tmp].[a]','U')
   drop table [tmp].[a]
   create table [tmp].[a]([i] varchar(4),[j] varchar(4),[k] varchar(4),[value] double precision)
   temp file: [C:\Users\Erwin\AppData\Local\Temp\tmp2B08.tmp]
   writing C:\Users\Erwin\AppData\Local\Temp\tmp2B08.tmp
   bulk insert [tmp].[a] from 'C:\Users\Erwin\AppData\Local\Temp\tmp2B08.tmp'
   rows affected: 1000000
   commit
   time: 16.0 seconds
   deleting [C:\Users\Erwin\AppData\Local\Temp\tmp2B08.tmp]

This was using a text file (actually a tab delimited file). Next we do the same thing but add the –b flag. The result is:

a (1000000 records)
   select object_id('[tmp].[a]','U')
   drop table [tmp].[a]
   create table [tmp].[a]([i] varchar(4),[j] varchar(4),[k] varchar(4),[value] double precision)
   temp file: [C:\Users\Erwin\AppData\Local\Temp\tmp9F9.tmp]
   writing C:\Users\Erwin\AppData\Local\Temp\tmp9F9.tmp
   bulk insert [tmp].[a] from 'C:\Users\Erwin\AppData\Local\Temp\tmp9F9.tmp'
with (datafiletype='native')
   rows affected: 1000000
   commit
  
time: 10.0 seconds
   deleting [C:\Users\Erwin\AppData\Local\Temp\tmp9F9.tmp]

The difference in performance is much more pronounced than I expected. A second advantage is that the floating point numbers are not suffering from round-off during the conversion from and to ascii.

A number of optimizations has enabled to move data quickly from a GAMS GDX file into SQL server and each optimization gave us some extra performance. Here is a short review of things I tried, with timings on transferring the same large parameter ‘a’, so the numbers are comparable:

  1. Multiple inserts. Done reasonably smart: no auto-commit after each insert, and a prepared statement to reduce some overhead: 258.3 seconds.
  2. Bulk insert with ascii file: 17.8 seconds.
  3. Improved buffering when writing ascii file: 16.0 seconds.
  4. Bulk insert with buffered binary file: 10.0 seconds.