Database Paper Browser

Back to papers

ActiveClean: Interactive Data Cleaning For Statistical Modeling

Summary: ActiveClean enables iterative, interactive data cleaning during model training. Convex loss models are targeted; preserves convergence and prioritizes influential records; yields up to 2.5x accuracy per cleaning budget, beating uniform sampling. (summarized by gpt-5-nano on Feb 09 2026)

Paper ID
11381
Venue
VLDB
Year
2016
Pagerank
0.00016629664
Overall Rank
791 | 94.50%
DOI
-

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 47 of 47 citing papers.

Rank Citing Paper Year Venue Pagerank
1,420 Data Management Challenges in Production Machine Learning 2017 SIGMOD 0.00012057956
1,532 Data Management in Machine Learning: Challenges, Techniques, and Systems 2017 SIGMOD 0.00011472681
1,894 Baran: Effective Error Correction via a Unified Context Representation and Transfer Learning 2020 VLDB 0.0001018378
2,302 Nearest Neighbor Classifiers over Incomplete Information: From Certain Answers to Certain Predictions 2021 VLDB 9.0668832e-05
2,506 Auto-Detect: Data-Driven Error Detection in Tables 2018 SIGMOD 8.6335464e-05
2,753 Complaint-driven Training Data Debugging for Query 2.0 2020 SIGMOD 8.1724339e-05
2,839 VolcanoML: Speeding up End-to-End AutoML via Scalable Search Space Decomposition 2021 VLDB 8.0378978e-05
2,968 Raha: A Configuration-Free Error Detection System 2019 SIGMOD 7.7985097e-05
3,396 Automatic Data Repair: Are We Ready to Deploy? 2024 VLDB 7.1455126e-05
3,473 AI Meets Database: AI4DB and DB4AI 2021 SIGMOD 7.062864e-05
3,773 Cleaning Crowdsourced Labels Using Oracles for Statistical Classification 2019 VLDB 6.7758649e-05
4,102 GoodCore: Data-effective and Data-efficient Machine Learning through Coreset Selection over Incomplete Data 2023 SIGMOD 6.4522929e-05
4,273 Cleaning Denial Constraint Violations through Relaxation 2020 SIGMOD 6.3003864e-05
4,424 PrIU: A Provenance-Based Approach for Incrementally Updating Regression Models 2020 SIGMOD 6.198474e-05
4,607 Data Integration and Machine Learning: A Natural Synergy 2018 SIGMOD 6.0538827e-05
4,935 OmniFair: A Declarative System for Model-Agnostic Group Fairness in Machine Learning 2021 SIGMOD 5.8198727e-05
5,222 Enabling SQL-based Training Data Debugging for Federated Learning 2022 VLDB 5.6210545e-05
5,429 DiffPrep: Differentiable Data Preprocessing Pipeline Search for Learning over Tabular Data 2023 SIGMOD 5.5087325e-05
5,978 Rotom: A Meta-Learned Data Augmentation Framework for Entity Matching, Data Cleaning, Text Classification, and Beyond 2021 SIGMOD 5.2453012e-05
6,263 Equitable Data Valuation Meets the Right to Be Forgotten in Model Markets 2023 VLDB 5.1349507e-05
7,796 CHEF: A Cheap and Fast Pipeline for Iteratively Cleaning Label Uncertainties 2021 VLDB 4.6482625e-05
7,867 Learning Over Dirty Data Without Cleaning 2020 SIGMOD 4.6320452e-05
8,092 Saga: A Scalable Framework for Optimizing Data Cleaning Pipelines for Machine Learning Applications 2023 SIGMOD 4.587921e-05
8,182 SHiFT: An Efficient, Flexible Search Engine for Transfer Learning 2023 VLDB 4.5659133e-05
8,257 Automating and Optimizing Data-Centric What-If Analyses on Native Machine Learning Pipelines 2023 SIGMOD 4.5487511e-05
8,590 Exploratory Training: When Annotators Learn About Data 2023 SIGMOD 4.4896282e-05
8,743 CtxPipe: Context-aware Data Preparation Pipeline Construction for Machine Learning 2024 SIGMOD 4.456315e-05
8,840 The Cost of Representation by Subset Repairs 2025 VLDB 4.4388652e-05
9,043 Query-Guided Resolution in Uncertain Databases 2023 SIGMOD 4.4039656e-05
9,054 Selecting Data to Clean for Fact Checking: Minimizing Uncertainty vs. Maximizing Surprise 2019 VLDB 4.4039656e-05
9,118 Towards Observability for Production Machine Learning Pipelines 2022 VLDB 4.3928288e-05
9,348 GIDCL: A Graph-Enhanced Interpretable Data Cleaning Framework with Large Language Models 2024 SIGMOD 4.3526427e-05
9,389 DataVinci: Learning Syntactic and Semantic String Repairs 2025 SIGMOD 4.3441378e-05
10,026 Minimum Change ≠ Best Cleaning: Parallel and Incremental Error Detection under Integrity Constraints 2026 SIGMOD 4.1945683e-05
10,029 Outliers: The Good, the Bad and the Ugly 2026 SIGMOD 4.1945683e-05
10,478 Data Enhancement for Binary Classification of Relational Data 2025 SIGMOD 4.1945683e-05
10,528 Two Birds with One Stone: Efficient Deep Learning over Mislabeled Data through Subset Selection 2025 SIGMOD 4.1945683e-05
10,617 Deduplicated Sampling On-Demand 2025 VLDB 4.1945683e-05
10,628 CatDB: Data-catalog-guided, LLM-based Generation of Data-centric ML Pipelines 2025 VLDB 4.1945683e-05
10,644 Still More Shades of Null: An Evaluation Suite for Responsible Missing Value Imputation 2025 VLDB 4.1945683e-05
10,816 mlidea: Interactively Improving ML Data Preparation Code via "Shadow Pipelines" 2025 VLDB 4.1945683e-05
10,953 Certain and Approximately Certain Models for Statistical Learning 2024 SIGMOD 4.1945683e-05
11,052 Efficiently Mitigating the Impact of Data Drift on Machine Learning Pipelines 2024 VLDB 4.1945683e-05
11,137 Generalizable Data Cleaning of Tabular Data in Latent Space 2024 VLDB 4.1945683e-05
11,178 LinCQA: Faster Consistent Query Answering with Linear Time Guarantees 2023 SIGMOD 4.1945683e-05
11,431 Ease.ML: A Lifecycle Management System for MLDev and MLOps 2021 CIDR 4.1945683e-05
11,682 IHCS: An Integrated Hybrid Cleaning System 2019 VLDB 4.1945683e-05
Previous Page 1 / 1 Next

Outgoing Citations (Sorted by Pagerank)

Showing 9 of 9 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Previous Page 1 / 1 Next

Semantically Similar Papers