Tuesday, April 4, 2023

Compression and large tables (Excel and CSV)




Here is an interesting little experiment: load a large CSV file into Excel. My original Powerpoint slide from a recent presentation was not as complete as it should be. Here is more size info:



I argue that a column store database can better compress data. Let's see if I can put some meat on the bone.
  • psd_alldata_csv.zip. The downloaded zipped CSV file from [1]. This file is 9.8 MB. 
  • psd_alldata.csv. The unzipped CSV file. The size of this file is 190 MB. It has 1.97 million records (rows).
  • Partial load to sheet.xlsx. When we try to load this into an Excel sheet, only 1 million rows are loaded. This incomplete xlsx file (zipped XML) is 50 MB. My guess is that Excel does not zip at a high compression level when saving spreadsheets to save time. 
  • psd_alldata.xlsx. When we load the CSV file into the Excel datamodel (PowerQuery data), the database engine will highly compress things. When this is saved to an xlsx file, it is just 5.4 MB. 
So indeed, the in-memory database in Excel is actually compressing things much better compared to just using zipped XML.


References


  1. US Department of Agriculture, Foreign Agricultural Service, FAS Home / Market and Trade Data / PSD Online / Reports and Data / PSD Data Sets, https://apps.fas.usda.gov/psdonline/app/index.html#/app/downloads

4 comments:

  1. Are you sure the whole data set is loaded into that psd_alldata.xlsx file? What happens if you delete the psd_alldata.csv file from your hard drive and reopen psd_alldata.xlsx in excel. Are all 1.97 million records present? - Steve

    ReplyDelete
    Replies
    1. Yes I am sure. The VertiPaq engine is very good in compression.

      Delete
    2. You can get some extra compression saving the datamodel file in Excel binary format *.xlsb.

      Delete
  2. Hi, Can you help in Run length optimization that optimize production processes by analyzing the length of production runs and the associated inventory levels. By calculating the optimal production run length and inventory level, RLA helps businesses to balance the costs of inventory holding with the benefits of reduced setup and production costs.

    ReplyDelete