Capturing and Querying Fine-grained Provenance of Preprocessing Pipelines in Data Science
Summary: Fine-grained, element-level provenance for ML preprocessing pipelines. Formalizes a core set of preprocessing operators and provenance patterns; introduces an application-level Python library and evaluates scalability/overhead on real pipelines to enable debugging queries. (summarized by gpt-5-nano on Feb 09 2026)
Incoming Non-self Citations Over Time
Authors
- 1. Adriane Chapman
- 2. Paolo Missier
- 3. Giulia Simonelli
- 4. Riccardo Torlone
Incoming Citations (Sorted by Pagerank)
Showing 4 of 4 citing papers.
| Rank | Citing Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 9,118 | Towards Observability for Production Machine Learning Pipelines | 2022 | VLDB | 4.3928288e-05 |
| 9,231 | Modyn: Data-Centric Machine Learning Pipeline Orchestration | 2025 | SIGMOD | 4.3690661e-05 |
| 10,419 | Unified Lineage System: Tracking Data Provenance at Scale | 2025 | SIGMOD | 4.1945683e-05 |
| 11,396 | DPDS: Assisting Data Science with Data Provenance | 2022 | VLDB | 4.1945683e-05 |
Previous
Page 1 / 1
Next
Outgoing Citations (Sorted by Pagerank)
Showing 13 of 13 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
Previous
Page 1 / 1
Next
Semantically Similar Papers
| Overall Rank | Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 8,394 | Hypothetical Reasoning via Provenance Abstraction | 2019 | SIGMOD | 4.527807e-05 |
| 11,665 | Ursprung: Provenance for Large-Scale Analytics Environments | 2019 | SIGMOD | 4.1945683e-05 |
| 1,765 | Efficient Lineage Tracking For Scientific Workflows | 2008 | SIGMOD | 0.00010630348 |
| 2,173 | Querying Data Provenance | 2010 | SIGMOD | 9.3676609e-05 |
| 11,471 | On Optimizing the Trade-off between Privacy and Utility in Data Provenance | 2021 | SIGMOD | 4.1945683e-05 |
| 5,843 | Tracing Lineage Beyond Relational Operators | 2007 | VLDB | 5.3032967e-05 |
| 8,729 | OneProvenance: Efficient Extraction of Dynamic Coarse-Grained Provenance From Database Query Event Logs | 2023 | VLDB | 4.4582221e-05 |
| 2,456 | Production Machine Learning Pipelines: Empirical Analysis and Optimization Opportunities | 2021 | SIGMOD | 8.7733773e-05 |
| 11,396 | DPDS: Assisting Data Science with Data Provenance | 2022 | VLDB | 4.1945683e-05 |
| 5,086 | Improving Reproducibility of Data Science Pipelines through Transparent Provenance Capture | 2020 | VLDB | 5.7078462e-05 |