Debugging Large-Scale Data Science Pipelines using Dagger
Summary: End-to-end debugger for data pipelines; Dagger isolates data-centric errors from transformations to model inputs. It enables inter-module (black-box blocks) and intra-module (inspect DataFrames/Python objects) debugging, demonstrated on Intel BI workloads. (summarized by gpt-5-nano on Feb 09 2026)
Incoming Non-self Citations Over Time
Authors
- 1. El Kindi Rezig
- 2. Ashrita Brahmaroutu
- 3. Nesime Tatbul
- 4. Mourad Ouzzani
- 5. Nan Tang
- 6. Timothy Mattson
- 7. Samuel Madden
- 8. Michael Stonebraker
Incoming Citations (Sorted by Pagerank)
Showing 3 of 3 citing papers.
| Rank | Citing Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 6,944 | DataPrism: Exposing Disconnect between Data and Systems | 2022 | SIGMOD | 4.8912787e-05 |
| 11,409 | Machine Programming: Turning Data into Programmer Productivity | 2022 | VLDB | 4.1945683e-05 |
| 13,232 | Data Cleaning in the Era of Data Science: Challenges and Opportunities | 2021 | CIDR | - |
Previous
Page 1 / 1
Next
Outgoing Citations (Sorted by Pagerank)
Showing 8 of 8 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank | Cited Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 1,277 | The Data Civilizer System | 2017 | CIDR | 0.00012879695 |
| 1,350 | Northstar: An Interactive Data Science System | 2018 | VLDB | 0.00012431059 |
| 1,612 | Detecting Data Errors: Where are we and what needs to be done? | 2016 | VLDB | 0.00011142794 |
| 2,152 | MISTIQUE: A System to Store and Query Model Intermediates for Model Diagnosis | 2018 | SIGMOD | 9.4239787e-05 |
| 3,023 | Helix: Accelerating Human-in-the-loop Machine Learning | 2018 | VLDB | 7.6929986e-05 |
| 4,426 | Data Debugging and Exploration with Vizier | 2019 | SIGMOD | 6.1969994e-05 |
| 5,684 | Dagger: A Data (not code) Debugger | 2020 | CIDR | 5.3720749e-05 |
| 8,000 | Data Civilizer 2.0: A Holistic Framework for Data Preparation and Analytics | 2019 | VLDB | 4.6092803e-05 |
Previous
Page 1 / 1
Next
Semantically Similar Papers
| Overall Rank | Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 9,118 | Towards Observability for Production Machine Learning Pipelines | 2022 | VLDB | 4.3928288e-05 |
| 10,816 | mlidea: Interactively Improving ML Data Preparation Code via "Shadow Pipelines" | 2025 | VLDB | 4.1945683e-05 |
| 6,817 | Error Diagnosis and Data Profiling with Data X-Ray | 2015 | VLDB | 4.9171711e-05 |
| 13,143 | Bridging Disciplines in Data Management Research to Solve Complex Data Problems | 2025 | VLDB | - |
| 4,734 | MLINSPECT: A Data Distribution Debugger for Machine Learning Pipelines | 2021 | SIGMOD | 5.9615384e-05 |
| 8,341 | BugDoc: Algorithms to Debug Computational Processes | 2020 | SIGMOD | 4.5433282e-05 |
| 11,147 | Reconstructing and Querying ML Pipeline Intermediates | 2023 | CIDR | 4.1945683e-05 |
| 4,426 | Data Debugging and Exploration with Vizier | 2019 | SIGMOD | 6.1969994e-05 |
| 9,220 | BugDoc: A System for Debugging Computational Pipelines | 2020 | SIGMOD | 4.3702188e-05 |
| 5,684 | Dagger: A Data (not code) Debugger | 2020 | CIDR | 5.3720749e-05 |