Database Paper Browser

Back to papers

An Empirical Evaluation of Columnar Storage Formats

Summary: Revisits Parquet and ORC with a stress-test benchmark on modern hardware and workloads, pinpointing internal choices that favor today's analytics: default dictionary encoding, integer encodings optimized for decode speed, optional block compression, and finer-grained auxiliary structures. Shows format inefficiencies for common ML workflows and GPU decoding, and derives concrete guidelines for next-generation columnar formats. (summarized by gpt-5-mini on Feb 09 2026)

Paper ID
13390
Venue
VLDB
Year
2024
Pagerank
6.1204636e-05
Overall Rank
4,514 | 68.60%
DOI
10.14778/3626292.3626298

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 20 of 20 citing papers.

Rank Citing Paper Year Venue Pagerank
3,416 LeCo: Lightweight Compression via Learning Serial Correlations 2024 SIGMOD 7.1196234e-05
5,562 A Deep Dive into Common Open Formats for Analytical DBMSs 2023 VLDB 5.4331334e-05
7,469 Bullion: A Column Store for Machine Learning 2025 CIDR 4.7204398e-05
7,876 Two Birds With One Stone: Designing a Hybrid Cloud Storage Engine for HTAP 2024 VLDB 4.6298182e-05
9,128 Apache TsFile: An IoT-native Time Series File Format 2024 VLDB 4.3909921e-05
9,201 F3: The Open-Source Data File Format for the Future 2026 SIGMOD 4.3743539e-05
9,645 The FastLanes File Format 2025 VLDB 4.3109001e-05
9,701 Towards Functional Decomposition of Storage Formats 2025 CIDR 4.3008468e-05
10,105 RABIT: Efficient Range Queries with Bitmap Indexing 2026 SIGMOD 4.1945683e-05
10,175 Improving LZ4 for Effective Compression and Efficient Query 2026 SIGMOD 4.1945683e-05
10,193 Predictive Translation: High-Performance Buffer Management Without the Trade-Offs 2026 SIGMOD 4.1945683e-05
10,196 PTO: A Workload-driven Predictive Table Optimizer for Lakehouse Systems 2026 SIGMOD 4.1945683e-05
10,220 FlatStor: An Efficient Embedded-Index Based Columnar Data Layout for Multimodal Data Workloads 2026 VLDB 4.1945683e-05
10,241 Robust Predicate Transfer with Dynamic Execution 2026 VLDB 4.1945683e-05
10,372 Data Chunk Compaction in Vectorized Execution 2025 SIGMOD 4.1945683e-05
10,484 Femur: A Flexible Framework for Fast and Secure Querying from Public Key-Value Store 2025 SIGMOD 4.1945683e-05
10,494 Nested Parquet Is Flat, Why Not Use It? How To Scan Nested Data With On-the-Fly Key Generation and Joins 2025 SIGMOD 4.1945683e-05
10,767 The HANA Native Query Engine for Lakehouse Systems 2025 VLDB 4.1945683e-05
10,854 LiquidCache: Efficient Pushdown Caching for Cloud-Native Data Analytics 2025 VLDB 4.1945683e-05
10,856 Analyzing Near-Network Hardware Acceleration with Co-Processing on DPUs 2025 VLDB 4.1945683e-05
Previous Page 1 / 1 Next

Outgoing Citations (Sorted by Pagerank)

Showing 37 of 37 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Rank Cited Paper Year Venue Pagerank
21 C-Store: A Column-oriented DBMS 2005 VLDB 0.00086087497
70 Hive - A Warehousing Solution Over a Map-Reduce Framework 2009 VLDB 0.00059533166
80 Weaving Relations for Cache Performance 2001 VLDB 0.00055721729
109 Dremel: Interactive Analysis of Web-Scale Datasets 2010 VLDB 0.00048186983
123 A Decomposition Storage Model 1985 SIGMOD 0.00045255007
131 Integrating Compression and Execution in Column-Oriented Database Systems 2006 SIGMOD 0.0004370331
167 The Snowflake Elastic Data Warehouse 2016 SIGMOD 0.00039180521
210 Gorilla: A Fast, Scalable, In-Memory Time Series Database 2015 VLDB 0.0003404384
426 Amazon Redshift and the Case for Simpler Data Warehouses 2015 SIGMOD 0.00023594359
495 Milvus: A Purpose-Built Vector Data Management System 2021 SIGMOD 0.00021767688
746 Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores 2020 VLDB 0.00017326979
1,169 SuRF: Practical Range Query Filtering with Fast Succinct Tries 2018 SIGMOD 0.00013536447
1,270 BitWeaving: Fast Scans for Main Memory Data Processing 2013 SIGMOD 0.00012926086
1,377 Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics 2021 CIDR 0.00012296941
1,943 Procella: Unifying serving and analytical data at YouTube 2019 VLDB 0.00010012569
1,989 Column Imprints: A Secondary Index Structure 2013 SIGMOD 9.8478437e-05
2,040 A Study of the Fundamental Performance Characteristics of GPUs and CPUs for Database Analytics 2020 SIGMOD 9.7057698e-05
2,062 Dremel: A Decade of Interactive SQL Analysis at Web Scale 2020 VLDB 9.6481955e-05
2,064 Chimp: Efficient Lossless Floating Point Compression for Time Series Databases 2022 VLDB 9.6418929e-05
2,127 SQL-on-Hadoop: Full Circle Back to Shared-Nothing Database Architectures 2014 VLDB 9.4863172e-05
2,985 DSB: A Decision Support Benchmark for Workload-Driven and Traditional Database Systems 2021 VLDB 7.7795847e-05
2,998 Major Technical Advancements in Apache Hive 2014 SIGMOD 7.753765e-05
3,416 LeCo: Lightweight Compression via Learning Serial Correlations 2024 SIGMOD 7.1196234e-05
3,608 Column Sketches: A Scan Accelerator for Rapid and Robust Predicate Evaluation 2018 SIGMOD 6.924272e-05
3,611 SNARF: A Learning-Enhanced Range Filter 2022 VLDB 6.9191399e-05
3,644 BtrBlocks: Efficient Columnar Compression for Data Lakes 2023 SIGMOD 6.8854928e-05
4,518 The FastLanes Compression Layout: Decoding >100 Billion Integers per Second with Scalar Code 2023 VLDB 6.117844e-05
4,670 Napa: Powering Scalable Data Warehousing with Robust Query Performance at Google 2021 VLDB 6.0104466e-05
5,019 Orchestrating Data Placement and Query Execution in Heterogeneous CPU-GPU DBMS 2022 VLDB 5.7559197e-05
5,040 Tile-based Lightweight Integer Compression in GPU 2022 SIGMOD 5.7425187e-05
5,835 Order-Preserving Key Compression for In-Memory Search Trees 2020 SIGMOD 5.30905e-05
6,279 Self-Organizing Data Containers 2022 CIDR 5.1295282e-05
6,367 Good to the Last Bit: Data-Driven Encoding with CodecDB 2021 SIGMOD 5.0941072e-05
6,715 Shared Foundations: Modernizing Meta's Data Lakehouse 2023 CIDR 4.9509939e-05
7,112 Wide Table Layout Optimization based on Column Ordering and Duplication 2017 SIGMOD 4.8275068e-05
7,427 Selection Pushdown in Column Stores using Bit Manipulation Instructions 2023 SIGMOD 4.7327406e-05
8,731 Columnar Formats for Schemaless LSM-based Document Stores 2022 VLDB 4.4577278e-05
Previous Page 1 / 1 Next

Semantically Similar Papers