Partition, Don’t Sort! Compression Boosters for Cloud Data Ingestion Pipelines

Summary: Rather than expensive global sorting, cluster similarly structured nested Dremel-encoded records at ingestion to create compressible partitions. A decision-tree–inspired clustering is up to 17.44× faster than partition-then-sort and yields up to 2× compression, while per-bucket sorting matches increasing-cardinality compression at lower ingestion cost. (summarized by gpt-5-mini on Feb 09 2026)

Paper ID: 13557
Venue: VLDB
Year: 2024
Pagerank: 4.1905499e-05
Overall Rank: 11,070 | 23.07%
DOI: 10.14778/3681954.3682013

Incoming Non-self Citations Over Time

No non-self incoming citations found for this paper in this database.

Authors

1. Patrick Hansert
2. Sebastian Michel

Incoming Citations (Sorted by Pagerank)

Showing 0 of 0 citing papers.

Rank	Citing Paper	Year	Venue	Pagerank

Outgoing Citations (Sorted by Pagerank)

Showing 23 of 23 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Rank	Cited Paper	Year	Venue	Pagerank
34	Similarity Search in High Dimensions via Hashing	1999	VLDB	0.00076824554
109	Dremel: Interactive Analysis of Web-Scale Datasets	2010	VLDB	0.00048217028
132	Integrating Compression and Execution in Column-Oriented Database Systems	2006	SIGMOD	0.00043697853
290	Linear Clustering of Objects with Multiple Attributes	1990	SIGMOD	0.00028845557
407	Database Cracking	2007	CIDR	0.00023941779
659	The Making of TPC-DS	2006	VLDB	0.00018514913
739	Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores	2020	VLDB	0.00017365933
1,109	Sybase IQ Multiplex – Designed For Analytics	2004	VLDB	0.00013927106
2,060	Dremel: A Decade of Interactive SQL Analysis at Web Scale	2020	VLDB	9.6585115e-05
2,681	NET-FLi: On-the-fly Compression, Archiving and Indexing of Streaming Network Traffic	2010	VLDB	8.319573e-05
3,731	Skipping-oriented Partitioning for Columnar Layouts	2017	VLDB	6.8074069e-05
3,777	Instance-Optimized Data Layouts for Cloud Analytics Workloads	2021	SIGMOD	6.7713324e-05
4,702	JSON Tiles: Fast Analytics on Semi-Structured Data	2021	SIGMOD	5.9796907e-05
5,879	Column Partition and Permutation for Run Length Encoding in Columnar Databases	2020	SIGMOD	5.2875479e-05
6,340	Rearranging Data to Maximize the Efficiency of Compression	1986	PODS	5.0985531e-05
6,461	Pando: Enhanced Data Skipping with Logical Data Partitioning	2023	VLDB	5.0479786e-05
6,675	Exploiting Common Patterns for Tree-Structured Data	2017	SIGMOD	4.9615691e-05
6,788	Proteus: Autonomous Adaptive Storage for Mixed Workloads	2022	SIGMOD	4.9207259e-05
6,799	Understanding Insights into the Basic Structure and Essential Issues of Table Placement Methods in Clusters	2013	VLDB	4.9181295e-05
7,110	Wide Table Layout Optimization based on Column Ordering and Duplication	2017	SIGMOD	4.8228761e-05
7,127	Jigsaw: A Data Storage and Query Processing Engine for Irregular Table Partitioning	2021	SIGMOD	4.8184276e-05
7,576	Reducing Ambiguity in Json Schema Discovery	2021	SIGMOD	4.7030704e-05
8,223	Automated Multidimensional Data Layouts in Amazon Redshift	2024	SIGMOD	4.5509217e-05

Semantically Similar Papers

Overall Rank	Paper	Year	Venue	Pagerank
10,384	Data Chunk Compaction in Vectorized Execution	2025	SIGMOD	4.1905499e-05
6,237	Self-Organizing Data Containers	2022	CIDR	5.1371094e-05
5,677	Joins on Encoded and Partitioned Data	2014	VLDB	5.376511e-05
9,595	High-Ratio Compression for Machine-Generated Data	2023	SIGMOD	4.3153078e-05
3,066	Learning a Partitioning Advisor for Cloud Databases	2020	SIGMOD	7.6255556e-05
8,575	Robust and Budget-Constrained Encoding Configurations for In-Memory Database Systems	2022	VLDB	4.4880409e-05
7,431	CompressDB: Enabling Efficient Compressed Data Direct Processing for Various Databases	2022	SIGMOD	4.7274757e-05
3,642	BtrBlocks: Efficient Columnar Compression for Data Lakes	2023	SIGMOD	6.8876984e-05
5,879	Column Partition and Permutation for Run Length Encoding in Columnar Databases	2020	SIGMOD	5.2875479e-05
9,700	Towards Functional Decomposition of Storage Formats	2025	CIDR	4.2967256e-05