Database Paper Browser

Back to papers

Principles of Dataset Versioning: Exploring the Recreation/Storage Tradeoff

Summary: Principled dataset versioning, analyzing the storage-recreation trade-off and six problem settings. Demonstrates intractability for many cases and offers heuristics from delay-constrained scheduling and spanning-tree methods, plus a DATAHUB prototype. (summarized by gpt-5-nano on Feb 09 2026)

Paper ID
11013
Venue
VLDB
Year
2015
Pagerank
0.00011345567
Overall Rank
1,565 | 89.12%
DOI
-

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 26 of 26 citing papers.

Rank Citing Paper Year Venue Pagerank
610 Goods: Organizing Google's Datasets 2016 SIGMOD 0.00019232674
939 Data Lake Management: Challenges and Opportunities 2019 VLDB 0.00015187344
1,463 ARDA: Automatic Relational Data Augmentation for Machine Learning 2020 VLDB 0.00011869295
2,037 OrpheusDB: Bolt-on Versioning for Relational Databases 2017 VLDB 9.7120139e-05
2,152 MISTIQUE: A System to Store and Query Model Intermediates for Model Diagnosis 2018 SIGMOD 9.4239787e-05
2,430 Decibel: The Relational Dataset Branching System 2016 VLDB 8.8330417e-05
2,972 ForkBase: An Efficient Storage Engine for Blockchain and Forkable Applications 2018 VLDB 7.79259e-05
3,347 Collaborative Data Analytics with DataHub 2015 VLDB 7.1921364e-05
4,047 Orca: Scalable Temporal Graph Neural Network Training with Theoretical Guarantees 2023 SIGMOD 6.4972105e-05
5,271 ORPHEUSDB: A Lightweight Approach to Relational Dataset Versioning 2017 SIGMOD 5.5941385e-05
5,280 Explaining Dataset Changes for Semantic Data Versioning with Explain-Da-V 2023 VLDB 5.5896735e-05
5,411 Beyond Relations: A Case for Elevating to the Entity-Relationship Abstraction 2025 CIDR 5.5207515e-05
5,684 Dagger: A Data (not code) Debugger 2020 CIDR 5.3720749e-05
6,053 Optimizing Machine Learning Workloads in Collaborative Environments 2020 SIGMOD 5.2326838e-05
6,295 Your notebook is not crumby enough, REPLace it 2020 CIDR 5.1249204e-05
6,409 Fine-Grained Lineage for Safer Notebook Interactions 2021 VLDB 5.0756653e-05
6,469 Materialization and Reuse Optimizations for Production Data Science Pipelines 2022 SIGMOD 5.0519488e-05
6,981 Dataset Relationship Management 2019 CIDR 4.8743957e-05
7,254 DEX: Query Execution in a Delta-based Storage System 2017 SIGMOD 4.7885915e-05
7,756 LETUS: A Log-Structured Efficient Trusted Universal BlockChain Storage 2024 SIGMOD 4.6598957e-05
8,729 OneProvenance: Efficient Extraction of Dynamic Coarse-Grained Provenance From Database Query Event Logs 2023 VLDB 4.4582221e-05
8,910 R2D2: Reducing Redundancy and Duplication in Data Lakes 2023 SIGMOD 4.427232e-05
9,378 CHEX: Multiversion Replay with Ordered Checkpoints 2022 VLDB 4.3463396e-05
9,754 Pensieve: Skewness-Aware Version Switching for Efficient Graph Processing 2020 SIGMOD 4.2897489e-05
10,469 Alsatian: Optimizing Model Search for Deep Transfer Learning 2025 SIGMOD 4.1945683e-05
13,280 Effective Data Versioning for Collaborative Data Analytics 2020 SIGMOD -
Previous Page 1 / 1 Next

Outgoing Citations (Sorted by Pagerank)

Showing 4 of 4 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Rank Cited Paper Year Venue Pagerank
293 A Taxonomy of Time in Databases 1985 SIGMOD 0.00028676087
676 Archiving Scientific Data 2002 SIGMOD 0.00018281665
1,281 DataHub: Collaborative Data Science & Dataset Version Management at Scale 2015 CIDR 0.00012854744
4,558 Managing Structured Collections of Community Data 2011 CIDR 6.0869516e-05
Previous Page 1 / 1 Next

Semantically Similar Papers