Git is for Data
Summary: Argues Git's UX/ecosystem is ideal for ML dataset management but vanilla Git fails at scale; introduces XetHub, an extension that preserves Git semantics while enabling TB+ repositories. Demonstrates scalable, low‑friction reproducibility and integration with DevOps pipelines. (summarized by gpt-5-mini on Feb 09 2026)
Incoming Non-self Citations Over Time
No non-self incoming citations found for this paper in this database.
Authors
- 1. Yucheng Low
- 2. Rajat Arya
- 3. Ajit Banerjee
- 4. Ann Huang
- 5. Brian Ronan
- 6. Hoyt Koepke
- 7. Joseph Godlewski
- 8. Zach Nation
Incoming Citations (Sorted by Pagerank)
Showing 0 of 0 citing papers.
| Rank | Citing Paper | Year | Venue | Pagerank |
|---|
Previous
Page 1 / 1
Next
Outgoing Citations (Sorted by Pagerank)
Showing 4 of 4 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank | Cited Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 1,281 | DataHub: Collaborative Data Science & Dataset Version Management at Scale | 2015 | CIDR | 0.00012854744 |
| 2,443 | Data Management for Data Science: Towards Embedded Analytics | 2020 | CIDR | 8.8078476e-05 |
| 3,875 | Cloudy with High Chance of DBMS: A 10-year Prediction for Enterprise-Grade ML | 2020 | CIDR | 6.675257e-05 |
| 4,003 | Data Platform for Machine Learning | 2019 | SIGMOD | 6.54347e-05 |
Previous
Page 1 / 1
Next