Back to papers
Partition, Don’t Sort! Compression Boosters for Cloud Data Ingestion Pipelines
Summary: Rather than expensive global sorting, cluster similarly structured nested Dremel-encoded records at ingestion to create compressible partitions. A decision-tree–inspired clustering is up to 17.44× faster than partition-then-sort and yields up to 2× compression, while per-bucket sorting matches increasing-cardinality compression at lower ingestion cost.
(summarized by gpt-5-mini on Feb 09 2026)
- Paper ID
- 13556
- Venue
- VLDB
- Year
- 2024
- Pagerank
- 4.1945683e-05
- Overall Rank
- 11,067 | 23.01%
- DOI
-
10.14778/3681954.3682013
Incoming Non-self Citations Over Time
No non-self incoming citations found for this paper in this database.
Incoming Citations (Sorted by Pagerank)
Showing 0 of 0 citing papers.
| Rank |
Citing Paper |
Year |
Venue |
Pagerank |
Outgoing Citations (Sorted by Pagerank)
Showing 23 of 23 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank |
Cited Paper |
Year |
Venue |
Pagerank |
| 34 |
Similarity Search in High Dimensions via Hashing |
1999 |
VLDB |
0.00076637636 |
| 109 |
Dremel: Interactive Analysis of Web-Scale Datasets |
2010 |
VLDB |
0.00048186983 |
| 131 |
Integrating Compression and Execution in Column-Oriented Database Systems |
2006 |
SIGMOD |
0.0004370331 |
| 290 |
Linear Clustering of Objects with Multiple Attributes |
1990 |
SIGMOD |
0.00028919734 |
| 408 |
Database Cracking |
2007 |
CIDR |
0.00023953844 |
| 659 |
The Making of TPC-DS |
2006 |
VLDB |
0.00018500853 |
| 746 |
Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores |
2020 |
VLDB |
0.00017326979 |
| 1,111 |
Sybase IQ Multiplex – Designed For Analytics |
2004 |
VLDB |
0.00013936696 |
| 2,062 |
Dremel: A Decade of Interactive SQL Analysis at Web Scale |
2020 |
VLDB |
9.6481955e-05 |
| 2,681 |
NET-FLi: On-the-fly Compression, Archiving and Indexing of Streaming Network Traffic |
2010 |
VLDB |
8.3232427e-05 |
| 3,737 |
Skipping-oriented Partitioning for Columnar Layouts |
2017 |
VLDB |
6.8033227e-05 |
| 3,779 |
Instance-Optimized Data Layouts for Cloud Analytics Workloads |
2021 |
SIGMOD |
6.7747205e-05 |
| 4,704 |
JSON Tiles: Fast Analytics on Semi-Structured Data |
2021 |
SIGMOD |
5.9853687e-05 |
| 5,898 |
Column Partition and Permutation for Run Length Encoding in Columnar Databases |
2020 |
SIGMOD |
5.2839046e-05 |
| 6,343 |
Rearranging Data to Maximize the Efficiency of Compression |
1986 |
PODS |
5.1026755e-05 |
| 6,466 |
Pando: Enhanced Data Skipping with Logical Data Partitioning |
2023 |
VLDB |
5.0528281e-05 |
| 6,674 |
Exploiting Common Patterns for Tree-Structured Data |
2017 |
SIGMOD |
4.9663344e-05 |
| 6,802 |
Understanding Insights into the Basic Structure and Essential Issues of Table Placement Methods in Clusters |
2013 |
VLDB |
4.9226626e-05 |
| 6,803 |
Proteus: Autonomous Adaptive Storage for Mixed Workloads |
2022 |
SIGMOD |
4.9224958e-05 |
| 7,112 |
Wide Table Layout Optimization based on Column Ordering and Duplication |
2017 |
SIGMOD |
4.8275068e-05 |
| 7,128 |
Jigsaw: A Data Storage and Query Processing Engine for Irregular Table Partitioning |
2021 |
SIGMOD |
4.8230171e-05 |
| 7,571 |
Reducing Ambiguity in Json Schema Discovery |
2021 |
SIGMOD |
4.7075853e-05 |
| 8,225 |
Automated Multidimensional Data Layouts in Amazon Redshift |
2024 |
SIGMOD |
4.555289e-05 |
Semantically Similar Papers
| Overall Rank |
Paper |
Year |
Venue |
Pagerank |
| 10,372 |
Data Chunk Compaction in Vectorized Execution |
2025 |
SIGMOD |
4.1945683e-05 |
| 6,279 |
Self-Organizing Data Containers |
2022 |
CIDR |
5.1295282e-05 |
| 5,670 |
Joins on Encoded and Partitioned Data |
2014 |
VLDB |
5.3804618e-05 |
| 9,595 |
High-Ratio Compression for Machine-Generated Data |
2023 |
SIGMOD |
4.3194469e-05 |
| 3,076 |
Learning a Partitioning Advisor for Cloud Databases |
2020 |
SIGMOD |
7.6107677e-05 |
| 8,578 |
Robust and Budget-Constrained Encoding Configurations for In-Memory Database Systems |
2022 |
VLDB |
4.4923477e-05 |
| 7,429 |
CompressDB: Enabling Efficient Compressed Data Direct Processing for Various Databases |
2022 |
SIGMOD |
4.7320139e-05 |
| 3,644 |
BtrBlocks: Efficient Columnar Compression for Data Lakes |
2023 |
SIGMOD |
6.8854928e-05 |
| 5,898 |
Column Partition and Permutation for Run Length Encoding in Columnar Databases |
2020 |
SIGMOD |
5.2839046e-05 |
| 9,701 |
Towards Functional Decomposition of Storage Formats |
2025 |
CIDR |
4.3008468e-05 |