Database Paper Browser

Back to papers

DataHub: Collaborative Data Science & Dataset Version Management at Scale

Summary: Proposes dataset version control with branch/merge/diff/search for large, divergent datasets—bringing git-like semantics to data management. Presents DataHub, a collaborative-analysis platform built on this VCS, and outlines scalability, provenance, storage, and merge challenges. (summarized by gpt-5-mini on Feb 09 2026)

Paper ID
260
Venue
CIDR
Year
2015
Pagerank
0.00012854744
Overall Rank
1,281 | 91.10%
DOI
-

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 31 of 31 citing papers.

Rank Citing Paper Year Venue Pagerank
610 Goods: Organizing Google's Datasets 2016 SIGMOD 0.00019232674
939 Data Lake Management: Challenges and Opportunities 2019 VLDB 0.00015187344
1,463 ARDA: Automatic Relational Data Augmentation for Machine Learning 2020 VLDB 0.00011869295
1,565 Principles of Dataset Versioning: Exploring the Recreation/Storage Tradeoff 2015 VLDB 0.00011345567
2,037 OrpheusDB: Bolt-on Versioning for Relational Databases 2017 VLDB 9.7120139e-05
2,269 Ground: A Data Context Service 2017 CIDR 9.147379e-05
2,430 Decibel: The Relational Dataset Branching System 2016 VLDB 8.8330417e-05
2,965 SQLShare: Results from a Multi-Year SQL-as-a-Service Experiment 2016 SIGMOD 7.8059273e-05
2,972 ForkBase: An Efficient Storage Engine for Blockchain and Forkable Applications 2018 VLDB 7.79259e-05
3,347 Collaborative Data Analytics with DataHub 2015 VLDB 7.1921364e-05
3,942 Ember: No-Code Context Enrichment via Similarity-Based Keyless Joins 2022 VLDB 6.6114622e-05
4,003 Data Platform for Machine Learning 2019 SIGMOD 6.54347e-05
4,047 Orca: Scalable Temporal Graph Neural Network Training with Theoretical Guarantees 2023 SIGMOD 6.4972105e-05
4,774 LIMA: Fine-grained Lineage Tracing and Reuse in Machine Learning Systems 2021 SIGMOD 5.9316087e-05
4,863 Data-Sharing Markets: Model, Protocol, and Algorithms to Incentivize the Formation of Data-Sharing Consortia 2023 SIGMOD 5.8697471e-05
5,271 ORPHEUSDB: A Lightweight Approach to Relational Dataset Versioning 2017 SIGMOD 5.5941385e-05
5,280 Explaining Dataset Changes for Semantic Data Versioning with Explain-Da-V 2023 VLDB 5.5896735e-05
6,053 Optimizing Machine Learning Workloads in Collaborative Environments 2020 SIGMOD 5.2326838e-05
6,891 Analysis of Indexing Structures for Immutable Data 2020 SIGMOD 4.8927093e-05
7,254 DEX: Query Execution in a Delta-based Storage System 2017 SIGMOD 4.7885915e-05
7,311 The Machine Learning Bazaar: Harnessing the ML Ecosystem for Effective System Development 2020 SIGMOD 4.7656884e-05
7,833 Dependency-Driven Analytics: a Compass for Uncharted Data Oceans 2017 CIDR 4.6382648e-05
8,729 OneProvenance: Efficient Extraction of Dynamic Coarse-Grained Provenance From Database Query Event Logs 2023 VLDB 4.4582221e-05
8,849 SourceSight: Enabling Effective Source Selection 2016 SIGMOD 4.4369118e-05
9,076 DataDiff: User-Interpretable Data Transformation Summaries for Collaborative Data Analysis 2018 SIGMOD 4.401804e-05
9,316 READY: Completeness is in the Eye of the Beholder 2017 CIDR 4.3559005e-05
11,020 Accelerating Merkle Patricia Trie with GPU 2024 VLDB 4.1945683e-05
11,149 Git is for Data 2023 CIDR 4.1945683e-05
11,216 Demystifying the QoS and QoE of Edge-hosted Video Streaming Applications in the Wild with SNESet 2023 SIGMOD 4.1945683e-05
11,518 A Demonstration of RELIC: A System for REtrospective Lineage InferenCe of Data Workflows 2021 VLDB 4.1945683e-05
11,667 Peering through the Dark: An Owl's View of Inter-job Dependencies and Jobs' Impact in Shared Clusters 2019 SIGMOD 4.1945683e-05
Previous Page 1 / 1 Next

Outgoing Citations (Sorted by Pagerank)

Showing 5 of 5 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Rank Cited Paper Year Venue Pagerank
293 A Taxonomy of Time in Databases 1985 SIGMOD 0.00028676087
676 Archiving Scientific Data 2002 SIGMOD 0.00018281665
711 A Case for A Collaborative Query Management System 2009 CIDR 0.00017751589
1,767 ORCHESTRA: Rapid, Collaborative Sharing of Dynamic Data 2005 CIDR 0.00010623574
2,173 Querying Data Provenance 2010 SIGMOD 9.3676609e-05
Previous Page 1 / 1 Next

Semantically Similar Papers