Back to papers
Automating and Optimizing Data-Centric What-If Analyses on Native Machine Learning Pipelines
Summary: mlwhatif declaratively specifies data-centric what-if analyses over ML pipelines and auto-generates variants via patches. A 4-rule optimizer executes variants; instrumented dataflow plans enable linear speedups (up to 13x) and data-size independence.
(summarized by gpt-5-nano on Feb 09 2026)
- Paper ID
- 6631
- Venue
- SIGMOD
- Year
- 2023
- Pagerank
- 4.5487511e-05
- Overall Rank
- 8,257 | 42.56%
- DOI
-
10.1145/3589273
Incoming Non-self Citations Over Time
Incoming Citations (Sorted by Pagerank)
Showing 6 of 6 citing papers.
Outgoing Citations (Sorted by Pagerank)
Showing 23 of 23 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank |
Cited Paper |
Year |
Venue |
Pagerank |
| 179 |
Efficient and Extensible Algorithms for Multi Query Optimization |
2000 |
SIGMOD |
0.00037672155 |
| 185 |
DuckDB: an Embeddable Analytical Database |
2019 |
SIGMOD |
0.00036538405 |
| 517 |
Can Foundation Models Wrangle Your Data? |
2023 |
VLDB |
0.00021169035 |
| 791 |
ActiveClean: Interactive Data Cleaning For Statistical Modeling |
2016 |
VLDB |
0.00016629664 |
| 1,298 |
Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms |
2019 |
VLDB |
0.00012758104 |
| 1,337 |
HoloDetect: Few-Shot Learning for Error Detection |
2019 |
SIGMOD |
0.00012497164 |
| 1,404 |
Responsible Data Management |
2020 |
VLDB |
0.00012174977 |
| 1,427 |
Towards Scalable Dataframe Systems |
2020 |
VLDB |
0.0001204248 |
| 1,646 |
Caravan: Provisioning for What-If Analysis |
2013 |
CIDR |
0.00011036992 |
| 1,666 |
HELIX: Holistic Optimization for Accelerating Iterative Machine Learning |
2019 |
VLDB |
0.0001096361 |
| 1,867 |
Interpretable Data-Based Explanations for Fairness Debugging |
2022 |
SIGMOD |
0.00010272055 |
| 2,122 |
SystemDS: A Declarative Machine Learning System for the End-to-End Data Science Lifecycle |
2020 |
CIDR |
9.4989076e-05 |
| 2,284 |
Cost-Based Optimization of Decision Support Queries using Transient-Views |
1998 |
SIGMOD |
9.1053836e-05 |
| 2,456 |
Production Machine Learning Pipelines: Empirical Analysis and Optimization Opportunities |
2021 |
SIGMOD |
8.7733773e-05 |
| 2,896 |
Evaluating End-to-End Optimization for Data Analytics Applications in Weld |
2018 |
VLDB |
7.9452051e-05 |
| 3,407 |
End-to-end Optimization of Machine Learning Prediction Queries |
2022 |
SIGMOD |
7.1295646e-05 |
| 4,664 |
Efficient Answering of Historical What-if Queries |
2022 |
SIGMOD |
6.0127053e-05 |
| 4,734 |
MLINSPECT: A Data Distribution Debugger for Machine Learning Pipelines |
2021 |
SIGMOD |
5.9615384e-05 |
| 5,607 |
HYPER: Hypothetical Reasoning With What-If and How-To Queries Using a Probabilistic Causal Approach |
2022 |
SIGMOD |
5.4137872e-05 |
| 6,469 |
Materialization and Reuse Optimizations for Production Data Science Pipelines |
2022 |
SIGMOD |
5.0519488e-05 |
| 8,514 |
UPLIFT: Parallelization Strategies for Feature Transformations in Machine Learning Workloads |
2022 |
VLDB |
4.4944285e-05 |
| 8,853 |
Complaint-Driven Training Data Debugging at Interactive Speeds |
2022 |
SIGMOD |
4.4350727e-05 |
| 11,310 |
Screening Native ML Pipelines with “ArgusEyes” |
2022 |
CIDR |
4.1945683e-05 |
Semantically Similar Papers
| Overall Rank |
Paper |
Year |
Venue |
Pagerank |
| 5,567 |
Optimizing Data Pipelines for Machine Learning in Feature Stores |
2023 |
VLDB |
5.4305348e-05 |
| 10,816 |
mlidea: Interactively Improving ML Data Preparation Code via "Shadow Pipelines" |
2025 |
VLDB |
4.1945683e-05 |
| 4,734 |
MLINSPECT: A Data Distribution Debugger for Machine Learning Pipelines |
2021 |
SIGMOD |
5.9615384e-05 |
| 11,313 |
Towards Observability for Machine Learning Pipelines |
2022 |
CIDR |
4.1945683e-05 |
| 11,147 |
Reconstructing and Querying ML Pipeline Intermediates |
2023 |
CIDR |
4.1945683e-05 |
| 2,456 |
Production Machine Learning Pipelines: Empirical Analysis and Optimization Opportunities |
2021 |
SIGMOD |
8.7733773e-05 |
| 9,118 |
Towards Observability for Production Machine Learning Pipelines |
2022 |
VLDB |
4.3928288e-05 |
| 6,469 |
Materialization and Reuse Optimizations for Production Data Science Pipelines |
2022 |
SIGMOD |
5.0519488e-05 |
| 6,291 |
Lightweight Inspection of Data Preprocessing in Native Machine Learning Pipelines |
2021 |
CIDR |
5.1269764e-05 |
| 8,114 |
mlwhatif: What If You Could Stop Re-Implementing Your Machine Learning Pipeline Analyses Over and Over? |
2023 |
VLDB |
4.5823351e-05 |