Database Paper Browser

Back to papers

Saga: A Scalable Framework for Optimizing Data Cleaning Pipelines for Machine Learning Applications

Summary: Saga automatically searches top data-cleaning pipelines for ML, combining AutoML, feature selection, and hyper-parameter tuning. Generates hybrid local/distributed runtime plans, extensible to new primitives, with monotonicity pruning and accuracy gains. (summarized by gpt-5-nano on Feb 09 2026)

Paper ID
6721
Venue
SIGMOD
Year
2023
Pagerank
4.587921e-05
Overall Rank
8,092 | 43.71%
DOI
10.1145/3617338

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 5 of 5 citing papers.

Previous Page 1 / 1 Next

Outgoing Citations (Sorted by Pagerank)

Showing 43 of 43 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Rank Cited Paper Year Venue Pagerank
112 Potter's Wheel: An Interactive Data Cleaning System 2001 VLDB 0.00047045036
192 HoloClean: Holistic Data Repairs with Probabilistic Inference 2017 VLDB 0.00035728858
489 Data Curation at Scale: The Data Tamer System 2013 CIDR 0.00022030728
656 ERACER: A Database Approach for Statistical Inference and Data Cleaning 2010 SIGMOD 0.00018588729
683 Cerebro: A Data System for Optimized Deep Learning Model Selection 2020 VLDB 0.00018195476
791 ActiveClean: Interactive Data Cleaning For Statistical Modeling 2016 VLDB 0.00016629664
833 Guided Data Repair 2011 VLDB 0.00016138432
881 Don’t be SCAREd: Use SCalable Automatic REpairing with Maximal Likelihood and Bounded Changes 2013 SIGMOD 0.00015661103
921 Democratizing Data Science through Interactive Curation of ML Pipelines 2019 SIGMOD 0.00015337438
1,012 NADEEF: A Commodity Data Cleaning System 2013 SIGMOD 0.0001464733
1,078 Model Management 2.0: Manipulating Richer Mappings 2007 SIGMOD 0.00014245848
1,277 The Data Civilizer System 2017 CIDR 0.00012879695
1,337 HoloDetect: Few-Shot Learning for Error Detection 2019 SIGMOD 0.00012497164
1,391 Ease.ml: Towards Multi-tenant Resource Sharing for Machine Learning Workloads 2018 VLDB 0.0001223506
1,402 Hybrid Parallelization Strategies for Large-Scale Machine Learning in SystemML 2014 VLDB 0.00012180605
1,420 Data Management Challenges in Production Machine Learning 2017 SIGMOD 0.00012057956
1,482 Automating Large-Scale Data Quality Verification 2018 VLDB 0.00011725533
1,527 Generic Schema Matching, Ten Years Later 2011 VLDB 0.00011499442
1,666 HELIX: Holistic Optimization for Accelerating Iterative Machine Learning 2019 VLDB 0.0001096361
1,894 Baran: Effective Error Correction via a Unified Context Representation and Transfer Learning 2020 VLDB 0.0001018378
1,940 SliceLine: Fast, Linear-Algebra-based Slice Finding for ML Model Debugging 2021 SIGMOD 0.00010020173
2,122 SystemDS: A Declarative Machine Learning System for the End-to-End Data Science Lifecycle 2020 CIDR 9.4989076e-05
2,302 Nearest Neighbor Classifiers over Incomplete Information: From Certain Answers to Certain Predictions 2021 VLDB 9.0668832e-05
2,573 Query Optimization for Dynamic Imputation 2017 VLDB 8.518235e-05
2,946 BigDansing: A System for Big Data Cleansing 2015 SIGMOD 7.8372441e-05
2,968 Raha: A Configuration-Free Error Detection System 2019 SIGMOD 7.7985097e-05
3,133 Time Series Data Cleaning: From Anomaly Detection to Anomaly Repairing 2017 VLDB 7.4978041e-05
3,491 TensorFlow Data Validation: Data Analysis and Validation in Continuous ML Pipelines 2020 SIGMOD 7.0451276e-05
3,528 Distributed Data Deduplication 2016 VLDB 7.0066139e-05
4,110 Learning to Validate the Predictions of Black Box Classifiers on Unseen Data 2020 SIGMOD 6.4428544e-05
4,464 Magellan: Toward Building Entity Matching Management Systems over Data Science Stacks 2016 VLDB 6.1606042e-05
4,749 Slice Tuner: A Selective Data Acquisition Framework for Accurate and Fair Machine Learning Models 2021 SIGMOD 5.9503689e-05
4,769 Automated Feature Engineering for Algorithmic Fairness 2021 VLDB 5.934329e-05
4,774 LIMA: Fine-grained Lineage Tracing and Reuse in Machine Learning Systems 2021 SIGMOD 5.9316087e-05
4,989 BEER: Blocking for Effective Entity Resolution 2021 SIGMOD 5.7827362e-05
5,050 xPAD: A Platform for Analytic Data Flows 2013 SIGMOD 5.7340229e-05
5,729 KATARA: Reliable Data Cleaning with Knowledge Bases and Crowdsourcing 2015 VLDB 5.3506368e-05
5,806 BlinkML: Efficient Maximum Likelihood Estimation with Probabilistic Guarantees 2019 SIGMOD 5.3200643e-05
6,102 QoX-Driven ETL Design: Reducing the Cost of ETL Consulting Engagements 2009 SIGMOD 5.2087887e-05
6,993 Unit Testing Data with Deequ 2019 SIGMOD 4.8693227e-05
7,450 SystemER: A Human-in-the-loop System for Explainable Entity Resolution 2019 VLDB 4.7265276e-05
9,001 The Power of Nested Parallelism in Big Data Processing – Hitting Three Flies with One Slap – 2021 SIGMOD 4.4107627e-05
9,927 AlphaEvolve: A Learning Framework to Discover Novel Alphas in Quantitative Investment 2021 SIGMOD 4.2532819e-05
Previous Page 1 / 1 Next

Semantically Similar Papers