Back to papers
A Deep Dive into Common Open Formats for Analytical DBMSs
Summary: Systematic evaluation of Arrow, Parquet, and ORC against OLAP DBMS requirements, showing how layout, vectorization, compression, metadata, and mmap trade-offs affect query efficiency and integration. Identifies co‑design opportunities for unified in‑memory/on‑disk representation and practical guidance for implementers.
(summarized by gpt-5-mini on Feb 09 2026)
- Paper ID
- 13144
- Venue
- VLDB
- Year
- 2023
- Pagerank
- 5.4331334e-05
- Overall Rank
- 5,562 | 61.31%
- DOI
-
10.14778/3611479.3611507
Incoming Non-self Citations Over Time
Incoming Citations (Sorted by Pagerank)
Showing 11 of 11 citing papers.
| Rank |
Citing Paper |
Year |
Venue |
Pagerank |
| 4,495 |
ClickHouse - Lightning Fast Analytics for Everyone |
2024 |
VLDB |
6.1410277e-05 |
| 7,876 |
Two Birds With One Stone: Designing a Hybrid Cloud Storage Engine for HTAP |
2024 |
VLDB |
4.6298182e-05 |
| 9,128 |
Apache TsFile: An IoT-native Time Series File Format |
2024 |
VLDB |
4.3909921e-05 |
| 9,201 |
F3: The Open-Source Data File Format for the Future |
2026 |
SIGMOD |
4.3743539e-05 |
| 9,645 |
The FastLanes File Format |
2025 |
VLDB |
4.3109001e-05 |
| 9,701 |
Towards Functional Decomposition of Storage Formats |
2025 |
CIDR |
4.3008468e-05 |
| 10,220 |
FlatStor: An Efficient Embedded-Index Based Columnar Data Layout for Multimodal Data Workloads |
2026 |
VLDB |
4.1945683e-05 |
| 10,494 |
Nested Parquet Is Flat, Why Not Use It? How To Scan Nested Data With On-the-Fly Key Generation and Joins |
2025 |
SIGMOD |
4.1945683e-05 |
| 10,741 |
Beyond Compression: A Comprehensive Evaluation of Lossless Floating-Point Compression |
2025 |
VLDB |
4.1945683e-05 |
| 10,854 |
LiquidCache: Efficient Pushdown Caching for Cloud-Native Data Analytics |
2025 |
VLDB |
4.1945683e-05 |
| 10,856 |
Analyzing Near-Network Hardware Acceleration with Co-Processing on DPUs |
2025 |
VLDB |
4.1945683e-05 |
Outgoing Citations (Sorted by Pagerank)
Showing 26 of 26 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank |
Cited Paper |
Year |
Venue |
Pagerank |
| 109 |
Dremel: Interactive Analysis of Web-Scale Datasets |
2010 |
VLDB |
0.00048186983 |
| 131 |
Integrating Compression and Execution in Column-Oriented Database Systems |
2006 |
SIGMOD |
0.0004370331 |
| 426 |
Amazon Redshift and the Case for Simpler Data Warehouses |
2015 |
SIGMOD |
0.00023594359 |
| 497 |
Column-Stores vs. Row-Stores: How Different Are They Really? |
2008 |
SIGMOD |
0.00021716559 |
| 659 |
The Making of TPC-DS |
2006 |
VLDB |
0.00018500853 |
| 1,270 |
BitWeaving: Fast Scans for Main Memory Data Processing |
2013 |
SIGMOD |
0.00012926086 |
| 1,377 |
Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics |
2021 |
CIDR |
0.00012296941 |
| 1,611 |
Qd-tree: Learning Data Layouts for Big Data Analytics |
2020 |
SIGMOD |
0.00011147324 |
| 2,127 |
SQL-on-Hadoop: Full Circle Back to Shared-Nothing Database Architectures |
2014 |
VLDB |
9.4863172e-05 |
| 2,258 |
SQL Server Column Store Indexes |
2011 |
SIGMOD |
9.1678883e-05 |
| 2,528 |
Velox: Meta’s Unified Execution Engine |
2022 |
VLDB |
8.59454e-05 |
| 2,613 |
Decomposed Bounded Floats for Fast Compression and Queries |
2021 |
VLDB |
8.4503824e-05 |
| 3,038 |
Azure Data Lake Store: A Hyperscale Distributed File Service for Big Data Analytics |
2017 |
SIGMOD |
7.6717218e-05 |
| 3,608 |
Column Sketches: A Scan Accelerator for Rapid and Robust Predicate Evaluation |
2018 |
SIGMOD |
6.924272e-05 |
| 4,514 |
An Empirical Evaluation of Columnar Storage Formats |
2024 |
VLDB |
6.1204636e-05 |
| 4,667 |
FlexPushdownDB: Hybrid Pushdown and Caching in a Cloud DBMS |
2021 |
VLDB |
6.0116919e-05 |
| 5,123 |
Accelerating Generalized Linear Models with MLWeaving: A One-Size-Fits-All System for Any-Precision Learning |
2019 |
VLDB |
5.6796998e-05 |
| 5,318 |
Analyzing and Comparing Lakehouse Storage Systems |
2023 |
CIDR |
5.5715872e-05 |
| 5,898 |
Column Partition and Permutation for Run Length Encoding in Columnar Databases |
2020 |
SIGMOD |
5.2839046e-05 |
| 6,279 |
Self-Organizing Data Containers |
2022 |
CIDR |
5.1295282e-05 |
| 6,311 |
VergeDB: A Database for IoT Analytics on Edge Devices |
2021 |
CIDR |
5.1161316e-05 |
| 6,367 |
Good to the Last Bit: Data-Driven Encoding with CodecDB |
2021 |
SIGMOD |
5.0941072e-05 |
| 6,666 |
Mainlining Databases: Supporting Fast Transactional Workloads on Universal Columnar Data File Formats |
2021 |
VLDB |
4.9691571e-05 |
| 7,128 |
Jigsaw: A Data Storage and Query Processing Engine for Irregular Table Partitioning |
2021 |
SIGMOD |
4.8230171e-05 |
| 7,429 |
CompressDB: Enabling Efficient Compressed Data Direct Processing for Various Databases |
2022 |
SIGMOD |
4.7320139e-05 |
| 8,088 |
PIDS: Attribute Decomposition for Improved Compression and Query Performance in Columnar Storage |
2020 |
VLDB |
4.5897316e-05 |
Semantically Similar Papers
| Overall Rank |
Paper |
Year |
Venue |
Pagerank |
| 658 |
Towards a Unified Architecture for in-RDBMS Analytics |
2012 |
SIGMOD |
0.00018506577 |
| 9,701 |
Towards Functional Decomposition of Storage Formats |
2025 |
CIDR |
4.3008468e-05 |
| 2,998 |
Major Technical Advancements in Apache Hive |
2014 |
SIGMOD |
7.753765e-05 |
| 6,340 |
Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine |
2024 |
SIGMOD |
5.1051018e-05 |
| 1,053 |
A Common Database Approach for OLTP and OLAP Using an In-Memory Column Database |
2009 |
SIGMOD |
0.00014429683 |
| 10,248 |
Active Data Lakes: Regaining Physical Data Independence Without Losing Interoperability |
2026 |
VLDB |
4.1945683e-05 |
| 7,866 |
Operational Analytics Data Management Systems |
2016 |
VLDB |
4.6321795e-05 |
| 6,666 |
Mainlining Databases: Supporting Fast Transactional Workloads on Universal Columnar Data File Formats |
2021 |
VLDB |
4.9691571e-05 |
| 3,753 |
Choosing A Cloud DBMS: Architectures and Tradeoffs |
2019 |
VLDB |
6.7871241e-05 |
| 4,514 |
An Empirical Evaluation of Columnar Storage Formats |
2024 |
VLDB |
6.1204636e-05 |