Wednesday, April 5, 2023

In-process, in-memory databases

There are a few database systems that are a bit different. They are libraries that can be linked directly to your application. Linking can be done statically (during the compilation/linking step) or dynamically (using a shared library or DLL). Here I want to show two cases:

  • SQLite [1] used from R on data frames
  • DuckDB [2] used from Python, again on data frames
So these databases don't only run inside R or Python but also can operate directly on data frames.

R and sqldf



I am loading a large CSV file here. The task at hand is to find a record with the largest value. I do this in two different ways:

  1. With standard R, using the which.max function,
  2. Using sqldf [3]  and an SQL query. 
The SQL query is a bit non-standard. Most databases will require a GROUP BY clause in a query with an aggregate function. SQLite does not and will return values for which the maximum is assumed. A more standard approach would be to use a subquery:

SELECT *
FROM df
WHERE Value = (SELECT MAX(Value) FROM df) 


Python and DuckDB 


DuckDB [1] is a relatively new column-oriented database. Just like SQLite, it can be used as an in-process, in-memory database.


Again, first, I use a "native" method, followed by an SQL approach. The SQL query uses sorting to find the largest value.

Running SQL queries against data frames is an alternative way to select data. In some cases, that can be more intuitive, especially if you are familiar with SQL.

Notes:
  • I assume we can't add SQL indices when working with data frames. That would mean the database engine will make full table scans. If indices are important, we can, of course, populate an in-memory database.
  • dfsql [3] supports databases other than SQLite.
  • I think this is pretty cool!

References


  1. DuckDB is an in-process SQL OLAP database management system, https://duckdb.org/
  2. SQLite, https://www.sqlite.org/
  3. sqldf, https://github.com/ggrothendieck/sqldf

No comments:

Post a Comment