DataHub: Collaborative Data Science & Dataset Version Management at Scale
Summary: Proposes dataset version control with branch/merge/diff/search for large, divergent datasets—bringing git-like semantics to data management. Presents DataHub, a collaborative-analysis platform built on this VCS, and outlines scalability, provenance, storage, and merge challenges. (summarized by gpt-5-mini on Feb 09 2026)
Incoming Non-self Citations Over Time
Authors
- 1. Anant Bhardwaj
- 2. Souvik Bhattacherjee
- 3. Amit Chavan
- 4. Amol Deshpande
- 5. Aaron J. Elmore
- 6. Samuel Madden
- 7. Aditya Parameswaran
Incoming Citations (Sorted by Pagerank)
Showing 31 of 31 citing papers.
Previous
Page 1 / 1
Next
Outgoing Citations (Sorted by Pagerank)
Showing 5 of 5 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank | Cited Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 293 | A Taxonomy of Time in Databases | 1985 | SIGMOD | 0.00028676087 |
| 676 | Archiving Scientific Data | 2002 | SIGMOD | 0.00018281665 |
| 711 | A Case for A Collaborative Query Management System | 2009 | CIDR | 0.00017751589 |
| 1,767 | ORCHESTRA: Rapid, Collaborative Sharing of Dynamic Data | 2005 | CIDR | 0.00010623574 |
| 2,173 | Querying Data Provenance | 2010 | SIGMOD | 9.3676609e-05 |
Previous
Page 1 / 1
Next
Semantically Similar Papers
| Overall Rank | Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 5,280 | Explaining Dataset Changes for Semantic Data Versioning with Explain-Da-V | 2023 | VLDB | 5.5896735e-05 |
| 4,003 | Data Platform for Machine Learning | 2019 | SIGMOD | 6.54347e-05 |
| 11,319 | Building a Shared Conceptual Model of Complex, Heterogeneous Data Systems: A Demonstration | 2022 | CIDR | 4.1945683e-05 |
| 6,981 | Dataset Relationship Management | 2019 | CIDR | 4.8743957e-05 |
| 9,076 | DataDiff: User-Interpretable Data Transformation Summaries for Collaborative Data Analysis | 2018 | SIGMOD | 4.401804e-05 |
| 2,430 | Decibel: The Relational Dataset Branching System | 2016 | VLDB | 8.8330417e-05 |
| 13,280 | Effective Data Versioning for Collaborative Data Analytics | 2020 | SIGMOD | - |
| 11,149 | Git is for Data | 2023 | CIDR | 4.1945683e-05 |
| 1,565 | Principles of Dataset Versioning: Exploring the Recreation/Storage Tradeoff | 2015 | VLDB | 0.00011345567 |
| 3,347 | Collaborative Data Analytics with DataHub | 2015 | VLDB | 7.1921364e-05 |