Database Paper Browser

Back to papers

Data Cleaning: Overview and Emerging Challenges

Summary: Presents a taxonomy of data cleaning, focusing on constraint- and pattern-based detection and repair for data quality. Links qualitative cleaning to ML and statistics, addressing scalability for big data and its impact on analytics, with a statistical view on inference. (summarized by gpt-5-nano on Feb 09 2026)

Paper ID
5217
Venue
SIGMOD
Year
2016
Pagerank
0.00011086905
Overall Rank
1,627 | 88.69%
DOI
10.1145/2882903.2912574

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 32 of 32 citing papers.

Rank Citing Paper Year Venue Pagerank
1,482 Automating Large-Scale Data Quality Verification 2018 VLDB 0.00011725533
2,587 Table-GPT: Table Fine-tuned GPT for Diverse Table Tasks 2024 SIGMOD 8.4924618e-05
3,299 SCODED: Statistical Constraint Oriented Data Error Detection 2020 SIGMOD 7.2546659e-05
3,396 Automatic Data Repair: Are We Ready to Deploy? 2024 VLDB 7.1455126e-05
3,491 TensorFlow Data Validation: Data Analysis and Validation in Continuous ML Pipelines 2020 SIGMOD 7.0451276e-05
3,773 Cleaning Crowdsourced Labels Using Oracles for Statistical Classification 2019 VLDB 6.7758649e-05
4,424 PrIU: A Provenance-Based Approach for Incrementally Updating Regression Models 2020 SIGMOD 6.198474e-05
4,607 Data Integration and Machine Learning: A Natural Synergy 2018 SIGMOD 6.0538827e-05
5,429 DiffPrep: Differentiable Data Preprocessing Pipeline Search for Learning over Tabular Data 2023 SIGMOD 5.5087325e-05
6,134 Finding Label and Model Errors in Perception Data With Learned Observation Assertions 2022 SIGMOD 5.1943414e-05
6,295 Your notebook is not crumby enough, REPLace it 2020 CIDR 5.1249204e-05
6,449 Causal Data Integration 2023 VLDB 5.0587746e-05
6,553 How do Categorical Duplicates Affect ML? A New Benchmark and Empirical Analyses 2024 VLDB 5.0157344e-05
6,689 Efficient Knowledge Graph Accuracy Evaluation 2019 VLDB 4.9623586e-05
7,564 PIClean: A Probabilistic and Interactive Data Cleaning System 2019 SIGMOD 4.7093702e-05
7,605 The Computation of Optimal Subset Repairs 2020 VLDB 4.697534e-05
7,634 ReStore - Neural Data Completion for Relational Databases 2021 SIGMOD 4.6911382e-05
7,667 Fast Detection of Denial Constraint Violations 2022 VLDB 4.683767e-05
7,867 Learning Over Dirty Data Without Cleaning 2020 SIGMOD 4.6320452e-05
8,138 Fast and Reliable Missing Data Contingency Analysis with Predicate-Constraints 2020 SIGMOD 4.5771031e-05
8,743 CtxPipe: Context-aware Data Preparation Pipeline Construction for Machine Learning 2024 SIGMOD 4.456315e-05
9,056 A Data Quality Metric (DQM): How to Estimate the Number of Undetected Errors in Data Sets 2017 VLDB 4.4039656e-05
9,389 DataVinci: Learning Syntactic and Semantic String Repairs 2025 SIGMOD 4.3441378e-05
9,849 Reptile: Aggregation-level Explanations for Hierarchical Data 2022 SIGMOD 4.2721228e-05
9,856 In-Database Data Imputation 2024 SIGMOD 4.269353e-05
10,128 WaveStitch: Flexible and Fast Conditional Time Series Generation With Diffusion Models 2026 SIGMOD 4.1945683e-05
10,235 Repairing Property Graphs under PG-Constraints 2026 VLDB 4.1945683e-05
11,029 Efficient and Reliable Estimation of Knowledge Graph Accuracy 2024 VLDB 4.1945683e-05
11,178 LinCQA: Faster Consistent Query Answering with Linear Time Guarantees 2023 SIGMOD 4.1945683e-05
11,216 Demystifying the QoS and QoE of Edge-hosted Video Streaming Applications in the Wild with SNESet 2023 SIGMOD 4.1945683e-05
11,536 LOCATER: Cleaning WiFi Connectivity Datasets for Semantic Localization 2021 VLDB 4.1945683e-05
11,682 IHCS: An Integrated Hybrid Cleaning System 2019 VLDB 4.1945683e-05
Previous Page 1 / 1 Next

Outgoing Citations (Sorted by Pagerank)

Showing 45 of 45 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Rank Cited Paper Year Venue Pagerank
112 Potter's Wheel: An Interactive Data Cleaning System 2001 VLDB 0.00047045036
119 Answering Queries using Humans, Algorithms and Databases 2011 CIDR 0.0004564788
214 Scorpion: Explaining Away Outliers in Aggregate Queries 2013 VLDB 0.0003363692
263 CrowdER: Crowdsourcing Entity Resolution 2012 VLDB 0.00029862413
265 A Cost-Based Model and Effective Heuristic for Repairing Constraints by Value Modification 2005 SIGMOD 0.00029763412
280 Eliminating Fuzzy Duplicates in Data Warehouses 2002 VLDB 0.00029113044
477 Model-Driven Data Acquisition in Sensor Networks 2004 VLDB 0.00022221803
489 Data Curation at Scale: The Data Tamer System 2013 CIDR 0.00022030728
507 Data Quality and Data Cleaning: An Overview 2003 SIGMOD 0.00021473263
555 Discovering Denial Constraints 2013 VLDB 0.00020254908
623 Improving Data Quality: Consistency and Accuracy 2007 VLDB 0.00018996374
643 Corleone: Hands-Off Crowdsourcing for Entity Matching 2014 SIGMOD 0.00018754451
656 ERACER: A Database Approach for Statistical Inference and Data Cleaning 2010 SIGMOD 0.00018588729
833 Guided Data Repair 2011 VLDB 0.00016138432
866 Leveraging Transitive Relations for Crowdsourced Joins 2013 SIGMOD 0.00015801196
881 Don’t be SCAREd: Use SCalable Automatic REpairing with Maximal Likelihood and Bounded Changes 2013 SIGMOD 0.00015661103
1,012 NADEEF: A Commodity Data Cleaning System 2013 SIGMOD 0.0001464733
1,159 Towards Certain Fixes with Editing Rules and Master Data 2010 VLDB 0.00013592813
1,164 CrowdScreen: Algorithms for Filtering Data with Humans 2012 SIGMOD 0.00013564823
1,188 On Generating Near-Optimal Tableaux for Conditional Functional Dependencies 2008 VLDB 0.00013441729
1,197 The LLUNATIC Data-Cleaning Framework 2013 VLDB 0.00013390321
1,242 Question Selection for Crowd Entity Resolution 2013 VLDB 0.00013096655
1,546 KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing 2015 SIGMOD 0.00011446851
1,594 Adaptive Cleaning for RFID Data Streams 2006 VLDB 0.00011222484
1,624 Sampling the Repairs of Functional Dependency Violations under Hard Constraints 2010 VLDB 0.00011099222
2,184 A Sample-and-Clean Framework for Fast and Accurate Query Processing on Dirty Data 2014 SIGMOD 9.3429789e-05
2,231 Dedoop: Efficient Deduplication with Hadoop 2012 VLDB 9.2304499e-05
2,602 Tracing Data Errors with View-Conditioned Causality 2011 SIGMOD 8.4667197e-05
2,629 Online Outlier Detection in Sensor Data Using Non-Parametric Models 2006 VLDB 8.4160309e-05
2,722 Progressive Approach to Relational Entity Resolution 2014 VLDB 8.2338356e-05
2,797 Query-Oriented Data Cleaning with Oracles 2015 SIGMOD 8.1108589e-05
2,823 Interaction between Record Matching and Data Repairing 2011 SIGMOD 8.0593894e-05
2,946 BigDansing: A System for Big Data Cleansing 2015 SIGMOD 7.8372441e-05
3,067 CrowdFill: Collecting Structured Data from the Crowd 2014 SIGMOD 7.6180371e-05
3,118 Scaling Up Crowd-Sourcing to Very Large Datasets: A Case for Active Learning 2015 VLDB 7.5379338e-05
3,192 Towards Dependable Data Repairing with Fixing Rules 2014 SIGMOD 7.4095761e-05
3,360 Modeling and Querying Possible Repairs in Duplicate Detection 2009 VLDB 7.1742067e-05
3,920 Continuous Outlier Detection in Data Streams: An Extensible Framework and State-Of-The-Art Algorithms 2013 SIGMOD 6.6309693e-05
4,451 CLAMShell: Speeding up Crowds for Low-latency Data Labeling 2016 VLDB 6.1738675e-05
5,586 QuERy: A Framework for Integrating Entity Resolution with Query Processing 2016 VLDB 5.4219548e-05
5,660 Descriptive and Prescriptive Data Cleaning 2014 SIGMOD 5.3847321e-05
6,941 Estimating the Impact of Unknown Unknowns on Aggregate Query Results 2016 SIGMOD 4.8924e-05
8,148 When Speed Has a Price: Fast Information Extraction Using Approximate Algorithms 2013 VLDB 4.5754467e-05
8,593 Wisteria: Nurturing Scalable Data Cleaning Infrastructure 2015 VLDB 4.4891474e-05
8,728 Stale View Cleaning: Getting Fresh Answers from Stale Materialized Views 2015 VLDB 4.4589711e-05
Previous Page 1 / 1 Next

Semantically Similar Papers