Sunday, May 1, 2016

gdx2sqlserver: bulk inserts into SQL Server

In this post we describe how a bulk insert method can help get better performance when importing large data sets into an MySQL database. The same thing is true for SQL Server. In the tool gdx2sqlserver we dynamically switch to using bulk inserts when the number of records to insert is larger than a threshold. Using this we get good performance on GDX file with 100 element set and a 1 million element parameter:

set i /i1*i100/;
alias
(i,j,k);
parameter
a(i,j,k);
a(i,j,k) = uniform(0,100);
execute_unload "test"
,i,a;
execute "gdx2sqlserver -i test.gdx -S .\SQL1 -db tmp -s tmp"
;

The default output is:


--- Job Untitled_1.gms Start 05/01/16 12:46:29 24.6.1 r55820 WEX-WEI x86 64bit/MS Windows
GAMS 24.6.1   Copyright (C) 1987-2016 GAMS Development. All rights reserved
Licensee: Erwin Kalvelagen                               G150803/0001CV-GEN
          Amsterdam Optimization Modeling Group                     DC10455
--- Starting compilation
--- Untitled_1.gms(6) 3 Mb
--- Starting execution: elapsed 0:00:00.010
--- Untitled_1.gms(5) 36 Mb
--- GDX File C:\Users\Erwin\Documents\Embarcadero\Studio\Projects\gdx2sqlserver\Win32\Debug\test.gdx
--- Untitled_1.gms(6) 36 Mb
GDX2SQLServer v 0.1
Copyright (c) 2015-2016 Amsterdam Optimization Modeling Group LLC

   GDX Library      24.6.1 r55820 Released Jan 18, 2016 VS8 x86 32bit/MS Windows
   GDX:Input file: test.gdx
   GDX:Symbols: 2
   GDX:Uels: 100
   GDX:Loading Uels
   SQL:Selected driver: ODBC Driver 11 for SQL Server
   SQL:Connection string: Driver={ODBC Driver 11 for SQL Server};Server=.\SQL1;Trusted_Connection=Yes
   SQL:RDBMS version: Microsoft SQL Server 2014 - 12.0.2269.0 (X64)
   SQL:Database: tmp
   SQL:Schema: tmp
   -----------------------
   i (100 records)
   a (1000000 records)
   Done (17.2 seconds)

*** Status: Normal completion
--- Job Untitled_1.gms Stop 05/01/16 12:46:47 elapsed 0:00:17.862

With the –v (verbose) flag we see that the second, larger symbol is using BULK INSERT:

GDX2SQLServer v 0.1
Copyright (c) 2015-2016 Amsterdam Optimization Modeling Group LLC

   GDX Library      24.6.1 r55820 Released Jan 18, 2016 VS8 x86 32bit/MS Windows
   GDX:Input file: test.gdx
   GDX:Symbols: 2
   GDX:Uels: 100
   GDX:Loading Uels
   SQL:Selected driver: ODBC Driver 11 for SQL Server
   SQL:Connection string: Driver={ODBC Driver 11 for SQL Server};Server=.\SQL1;Trusted_Connection=Yes
   SQLInfo:[Microsoft][ODBC Driver 11 for SQL Server][SQL Server]Changed database context to 'master'.
   SQLInfo:[Microsoft][ODBC Driver 11 for SQL Server][SQL Server]Changed language setting to us_english.
      select @@VERSION
   SQL:RDBMS version: Microsoft SQL Server 2014 - 12.0.2269.0 (X64)
        Jun 10 2015 03:35:45
        Copyright (c) Microsoft Corporation
        Developer Edition (64-bit) on Windows NT 6.3 <X64> (Build 10586: )
      select db_id('tmp')
      use [tmp]
   SQLInfo:[Microsoft][ODBC Driver 11 for SQL Server][SQL Server]Changed database context to 'tmp'.
   SQL:Database: tmp
      select schema_id('tmp')
   SQL:Schema: tmp
   -----------------------
   i (100 records)
      select object_id('[tmp].[i]','U')
      drop table [tmp].[i]
      create table [tmp].[i]([i] varchar(4))
      insert into [tmp].[i] values (?)
      sqlexecute(100 times)
      commit
      Time: 0.1 seconds
   a (1000000 records)
      select object_id('[tmp].[a]','U')
      drop table [tmp].[a]
      create table [tmp].[a]([i] varchar(4),[j] varchar(4),[k] varchar(4),[value] double precision)
      temp file: [C:\Users\Erwin\AppData\Local\Temp\tmp908A.tmp]
      writing C:\Users\Erwin\AppData\Local\Temp\tmp908A.tmp
      bulk insert [tmp].[a] from 'C:\Users\Erwin\AppData\Local\Temp\tmp908A.tmp'
      rows affected: 1000000
      commit
      Time: 17.8 seconds
      deleting [C:\Users\Erwin\AppData\Local\Temp\tmp908A.tmp]
   Done (18.0 seconds)

Here we deem symbol ‘i’ to be small (100 records) and we use a prepared insert statement. Symbol ‘a’ is considered large and we use bulk insert method. Indeed when we would have used normal inserts for the large symbol ‘a', we would have seen:

a (1000000 records)
   select object_id('[tmp].[a]','U')
   drop table [tmp].[a]
   create table [tmp].[a]([i] varchar(4),[j] varchar(4),[k] varchar(4),[value] double precision)
   insert into [tmp].[a] values (?,?,?,?)
   sqlexecute(1000000 times)
   commit (100 times)
   Time: 258.3 seconds

which is much slower.

There are still some improvements possible:

  • Better buffering of the intermediate text files. This reduces the time from 17.8 to 16 seconds.
  • Use a binary data file (maybe slightly faster, and maintains precision) This has now been implemented (see here), and reduced the time to move symbol ‘a’ into SQL Server to 10 seconds.
  • Use bulk data API. This requires us to bypass ODBC and use some SQL Server Client DLL that may not be installed by the user.

No comments:

Post a Comment