ActiveClean: An Interactive Data Cleaning Framework For Modern Machine Learning
Summary: ActiveClean is a progressive data-cleaning framework that interleaves cleaning with ML training, updating models as analysts clean small data batches. Key ideas include importance weighting, dirty-data detection, and a visual interface, enabling robust learning in high-dimensional pipelines, demonstrated on video classification and topic modeling. (summarized by gpt-5-nano on Feb 09 2026)
Incoming Non-self Citations Over Time
Authors
- 1. Sanjay Krishnan
- 2. Michael J. Franklin
- 3. Ken Goldberg
- 4. Jiannan Wang
- 5. Eugene Wu
Incoming Citations (Sorted by Pagerank)
Showing 6 of 6 citing papers.
| Rank | Citing Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 791 | ActiveClean: Interactive Data Cleaning For Statistical Modeling | 2016 | VLDB | 0.00016629664 |
| 1,482 | Automating Large-Scale Data Quality Verification | 2018 | VLDB | 0.00011725533 |
| 3,299 | SCODED: Statistical Constraint Oriented Data Error Detection | 2020 | SIGMOD | 7.2546659e-05 |
| 5,242 | Towards Benchmarking Feature Type Inference for AutoML Platforms | 2021 | SIGMOD | 5.6074743e-05 |
| 6,187 | Semi-Supervised Data Cleaning with Raha and Baran | 2021 | CIDR | 5.1656857e-05 |
| 6,553 | How do Categorical Duplicates Affect ML? A New Benchmark and Empirical Analyses | 2024 | VLDB | 5.0157344e-05 |
Previous
Page 1 / 1
Next
Outgoing Citations (Sorted by Pagerank)
Showing 6 of 6 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank | Cited Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 643 | Corleone: Hands-Off Crowdsourcing for Entity Matching | 2014 | SIGMOD | 0.00018754451 |
| 833 | Guided Data Repair | 2011 | VLDB | 0.00016138432 |
| 881 | Don’t be SCAREd: Use SCalable Automatic REpairing with Maximal Likelihood and Bounded Changes | 2013 | SIGMOD | 0.00015661103 |
| 2,184 | A Sample-and-Clean Framework for Fast and Accurate Query Processing on Dirty Data | 2014 | SIGMOD | 9.3429789e-05 |
| 8,593 | Wisteria: Nurturing Scalable Data Cleaning Infrastructure | 2015 | VLDB | 4.4891474e-05 |
| 8,728 | Stale View Cleaning: Getting Fresh Answers from Stale Materialized Views | 2015 | VLDB | 4.4589711e-05 |
Previous
Page 1 / 1
Next