Monday, March 6, 2023

Some approaches for moving data between MS Access and GAMS

Moving data between different environments is always more difficult than we hope. Here I list some approaches and actually try them out on a small dataset. We hit some bugs along the way and also a few conceptual stumbling blocks (even for this stylized example). We had some issues with Access as well as GAMS Connect. 

This question came up in an actual project. My idea was: "Let me show you how this can be done". I am afraid, I got carried away a bit. But it demonstrates that we should not underestimate these, at first sight, menial tasks. When the data set becomes larger, the problems compound. We can't eyeball the data, and statistically, it is more likely we encounter some problems.

I also did a quick test on some larger data sets. GAMS Connect is extremely slow when inserting lots of data. It is much better to use a CSV file and import that into Access. That can be 50 times as fast as using GAMS Connect. 


Word Document


This document describes different methods using GAMS scripts to move data between GAMS and MS Access. Although this is about Access, some of the concepts are also applicable to other databases.

 

Powerpoint Presentation


This is a presentation on the same subject.

 


My conclusions are:
  1. When writing to a database, standard inserts are very expensive. For larger data sets we need to use specialized BULK INSERT methods.
  2. When reading, it can be very beneficial to support wide tables. Exporting a wide table from a database is much more efficient than using the corresponding long table.

4 comments:

  1. The GAMS/Connect facility was introduced with GAMS 39 about a year ago (May 2022). The software is still in beta mode and under heavy development. We had the idea to unify the zoo of utilities under one roof with a single interface (the YAML file). Since we introduced it to the public we learned a few things:

    - Relying on pandas data tools (e.g. to_sql) is not a good idea. While pandas works well for many large scale data analytics tasks, it lacks functionality and performance in its IO tools we are used to from current GAMS data tools. We have started to move away from pandas IO tools for the database agents and will do the same for the Excel agents. Some of the performance gains with the new database agents (will become available with GAMS 43 in May 2023) are reported below.

    - Even though YAML is widely accepted by computer geeks as human-friendly (we copied the introductory sentence from the YAML web site (https://yaml.org/)) we also learned that regular users struggle with that format. In addition to the YAML syntax we will introduce an interface to data utilities (with a Connect backend) that use a “command line argument” style interface similar to existing tools. This will also make the migration from existing tools easier.

    - The heart and soul of Connect is a GAMS Transfer container (also still in beta). The Connect agents perform transactions against this container. We (mostly Adam Christensen) have been working on the GAMS Transfer to Python since GAMS 37.1 (November 2021). Besides a natural integration of GAMS data in Python, the Transfer module focuses on performance. Transfer utilizes the categorical data type and numpy’s C interface for the actual transfer of data to and from GAMS and GDX. The next iteration of Transfer (for GAMS 43) will get another performance boost.

    - Connect is a framework with some generality (there is this big idea of an ETL facility) and with that one in mind there will be an overhead compared to a highly specialized tool. Nevertheless, the loss in performance because of this generality needs to be smallish, not what you experience today when comparing Connect and some existing data utilities.

    - GAMS embedded code allows the exchange of data with Python in memory. We use our gmd library to organize this exchange. The data structure to store GAMS data in gmd is similar to a C++ ordered map (std::map) often implemented as a red-black tree (https://en.wikipedia.org/wiki/Red%E2%80%93black_tree). While this data structure is great for many things (including random lookups and access in a sorted manner) it is way too expensive for streaming data (and that’s what we do between GAMS and Python). This is why creating a GDX file and reading a GDX file with all its disk overhead can still outperform the in-memory communication (i.e. Connect’s GAMSReader/Writer). We have some experimental implementations that use alternative data structures in gmd and hope to get this ready for production before we remove the beta flag of Connect and Transfer.

    ReplyDelete
  2. The experiment using your data from a previous post (http://yetanothermathprogrammingconsultant.blogspot.com/2023/02/populating-sqlite-database.html) shows the difference between gdx (disk) and gmd (memory):

    Set i /index1*index50/;
    alias(i,i1,i2,i3,i4);
    Parameter p(i1,i2,i3,i4);
    p(i1,i2,i3,i4) = uniform(0,1);


    * loading pandas the first time takes some time, so let's do that
    * before we time the operations we are interested in
    embeddedCode Python:
    import pandas
    endEmbeddedCode


    scalar tstart, tend;
    tstart = jnow;
    execute_unload 'p', p;
    embeddedCode Python:
    import gams.transfer as gt
    m = gt.Container('p.gdx')
    endEmbeddedCode
    tend = (jnow - tstart)*24*3600;
    display 'gdx', tend;

    tstart = jnow;
    embeddedCode Python:
    import gams.transfer as gt
    m = gt.Container(gams._gmd)
    endEmbeddedCode
    tend = (jnow - tstart)*24*3600;
    display 'gmd', tend;


    $exit
    ---- 20 gdx
    PARAMETER tend = 2.692
    ---- 28 gmd
    PARAMETER tend = 4.019


    ReplyDelete
  3. Triggered by your previous post on SQLite (http://yetanothermathprogrammingconsultant.blogspot.com/2023/02/populating-sqlite-database.html) we improved Connect’s SQL reader and writer agents (available with GAMS 43). Here is some timing information for some popular databases including server based systems. We used a single table with 6.25 million records as in your post on SQLite.

    SQLite (Reading): Connect: 17.6s (old), 7.5s (new)
    SQLite (Writing): Connect: 34.4s (old), 15.0s (new), 6.2s (gdx2sqlite)
    MS Access (Reading): Connect: 39.2s (old), 28.2 (new)
    MS Access (Writing): Connect: 2640s (old), 48.4s (new). 37.6s (gdx2access)
    Postgres (Reading): localhost: 35.4s (old), 23.0s (new), remote: 62.4s (old), 57.7s (new)
    Postgres (Writing) localhost: 118.0s (old), 25.8s (new), remote: 325.0s (old), 118.0s (new)
    MySQL (Reading): localhost: 41.5s (old), 35.2s (new), remote: 51.7s (old), 46.3s (new)
    MySQL (Writing) localhost: 83.0s (old), 123.0s (new), remote: 639.0s (old), 85.4s (new)
    SQL Server (Reading): localhost: 83.0s (old), 81.0s (new), remote: 101.5s (old), 96.3s (new)
    SQL Server (Writing) localhost: 198.0s (old), 94.0s (new), remote: 556.0s (old), 364.0s (new)

    In order to get performance for the writing operation, we have been using database specific methods for bulk import (e.g. copyExpert for Postgres, bcp for SQL Server, or make table for MS-Access). Some of the methods require explicit writing of a CSV file which already takes a good portion of the time.

    ReplyDelete
    Replies
    1. Sorry, I had the times for MySQL writing on localhost mixed up. The new one performs better: 83s while the old one took 123s.

      Delete