Back to papers
Democratizing Data Science through Interactive Curation of ML Pipelines
Summary: Interactive AutoML for scientists via curated ML pipelines. Uses query-optimization, cost-based bandits, and Bayesian optimization to achieve interactive latency and beat expert solutions on unseen data across 300+ datasets.
(summarized by gpt-5-nano on Feb 09 2026)
- Paper ID
- 5675
- Venue
- SIGMOD
- Year
- 2019
- Pagerank
- 0.00015337438
- Overall Rank
- 921 | 93.60%
- DOI
-
10.1145/3299869.3319863
Incoming Non-self Citations Over Time
Incoming Citations (Sorted by Pagerank)
Showing 27 of 27 citing papers.
| Rank |
Citing Paper |
Year |
Venue |
Pagerank |
| 1,463 |
ARDA: Automatic Relational Data Augmentation for Machine Learning |
2020 |
VLDB |
0.00011869295 |
| 1,751 |
Auctus: A Dataset Search Engine for Data Discovery and Augmentation |
2021 |
VLDB |
0.00010683295 |
| 2,122 |
SystemDS: A Declarative Machine Learning System for the End-to-End Data Science Lifecycle |
2020 |
CIDR |
9.4989076e-05 |
| 2,321 |
DBPal: A Fully Pluggable NL2SQL Training Pipeline |
2020 |
SIGMOD |
9.03609e-05 |
| 3,934 |
SimpleTS: An Efficient and Universal Model Selection Framework for Time Series Forecasting |
2023 |
VLDB |
6.6175631e-05 |
| 4,456 |
AutoOD: Automatic Outlier Detection |
2023 |
SIGMOD |
6.1704203e-05 |
| 4,554 |
A Demonstration of AutoOD: A Self-Tuning Anomaly Detection System |
2022 |
VLDB |
6.0911296e-05 |
| 4,557 |
Distributed Deep Learning on Data Systems: A Comparative Analysis of Approaches |
2021 |
VLDB |
6.087611e-05 |
| 4,774 |
LIMA: Fine-grained Lineage Tracing and Reuse in Machine Learning Systems |
2021 |
SIGMOD |
5.9316087e-05 |
| 4,957 |
Doing More with Less: Characterizing Dataset Downsampling for AutoML |
2021 |
VLDB |
5.8035715e-05 |
| 5,429 |
DiffPrep: Differentiable Data Preprocessing Pipeline Search for Learning over Tabular Data |
2023 |
SIGMOD |
5.5087325e-05 |
| 6,053 |
Optimizing Machine Learning Workloads in Collaborative Environments |
2020 |
SIGMOD |
5.2326838e-05 |
| 7,311 |
The Machine Learning Bazaar: Harnessing the ML Ecosystem for Effective System Development |
2020 |
SIGMOD |
4.7656884e-05 |
| 7,704 |
ExDRa: Exploratory Data Science on Federated Raw Data |
2021 |
SIGMOD |
4.6733838e-05 |
| 8,092 |
Saga: A Scalable Framework for Optimizing Data Cleaning Pipelines for Machine Learning Applications |
2023 |
SIGMOD |
4.587921e-05 |
| 8,163 |
Capturing and Querying Fine-grained Provenance of Preprocessing Pipelines in Data Science |
2021 |
VLDB |
4.5723431e-05 |
| 8,177 |
DORIAN in action: Assisted Design of Data Science Pipelines |
2022 |
VLDB |
4.5673266e-05 |
| 8,743 |
CtxPipe: Context-aware Data Preparation Pipeline Construction for Machine Learning |
2024 |
SIGMOD |
4.456315e-05 |
| 8,828 |
HAIPipe: Combining Human-generated and Machine-generated Pipelines for Data Preparation |
2023 |
SIGMOD |
4.4407488e-05 |
| 9,192 |
Hyper-Tune: Towards Efficient Hyper-parameter Tuning at Scale |
2022 |
VLDB |
4.3765131e-05 |
| 10,252 |
CAPS: Cost-Aware ML Pipeline Selection |
2026 |
VLDB |
4.1945683e-05 |
| 10,560 |
A Systematic Study on Early Stopping Metrics in HPO and the Implications of Uncertainty |
2025 |
VLDB |
4.1945683e-05 |
| 10,628 |
CatDB: Data-catalog-guided, LLM-based Generation of Data-centric ML Pipelines |
2025 |
VLDB |
4.1945683e-05 |
| 10,682 |
AutoPrep: Natural Language Question-Aware Data Preparation with a Multi-Agent Framework |
2025 |
VLDB |
4.1945683e-05 |
| 11,216 |
Demystifying the QoS and QoE of Edge-hosted Video Streaming Applications in the Wild with SNESet |
2023 |
SIGMOD |
4.1945683e-05 |
| 11,476 |
Enforcing Constraints for Machine Learning Systems via Declarative Feature Selection: An Experimental Study |
2021 |
SIGMOD |
4.1945683e-05 |
| 11,549 |
Active Reinforcement Learning for Data Preparation: Learn2Clean with Human-In-The-Loop |
2020 |
CIDR |
4.1945683e-05 |
Outgoing Citations (Sorted by Pagerank)
Showing 10 of 10 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
Semantically Similar Papers
| Overall Rank |
Paper |
Year |
Venue |
Pagerank |
| 8,828 |
HAIPipe: Combining Human-generated and Machine-generated Pipelines for Data Preparation |
2023 |
SIGMOD |
4.4407488e-05 |
| 13,184 |
ML2DAC: Meta-learning to Democratize AutoML for Clustering Analyses |
2023 |
SIGMOD |
- |
| 13,098 |
Demonstrating CatDB: LLM-based Generation of Data-centric ML Pipelines |
2025 |
SIGMOD |
- |
| 10,816 |
mlidea: Interactively Improving ML Data Preparation Code via "Shadow Pipelines" |
2025 |
VLDB |
4.1945683e-05 |
| 3,070 |
Explore-by-Example: An Automatic Query Steering Framework for Interactive Data Exploration |
2014 |
SIGMOD |
7.6137064e-05 |
| 8,177 |
DORIAN in action: Assisted Design of Data Science Pipelines |
2022 |
VLDB |
4.5673266e-05 |
| 5,304 |
A Scalable AutoML Approach Based on Graph Neural Networks |
2022 |
VLDB |
5.5779335e-05 |
| 4,758 |
Optimization for Active Learning-based Interactive Database Exploration |
2019 |
VLDB |
5.9422515e-05 |
| 7,311 |
The Machine Learning Bazaar: Harnessing the ML Ecosystem for Effective System Development |
2020 |
SIGMOD |
4.7656884e-05 |
| 11,549 |
Active Reinforcement Learning for Data Preparation: Learn2Clean with Human-In-The-Loop |
2020 |
CIDR |
4.1945683e-05 |