Automatic Data Repair: Are We Ready to Deploy?

Summary: Driver-information taxonomy plus empirical evaluation of 12 repair methods on 12 datasets across error rates/types and 4 downstream tasks using a new practical error-reduction metric. A unified repair-optimization boosts SOTA, shows repair consistently benefits downstream analyses, and provides deployment guidelines. (summarized by gpt-5-mini on Feb 09 2026)

Paper ID: 13486
Venue: VLDB
Year: 2024
Pagerank: 7.1386386e-05
Overall Rank: 3,397 | 76.40%
DOI: 10.14778/3675034.3675051

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 8 of 8 citing papers.

Rank	Citing Paper	Year	Venue	Pagerank
9,558	Clean4TSDB: A Data Cleaning Tool for Time Series Databases	2024	VLDB	4.3212967e-05
9,983	Towards Scalable Visual Data Wrangling via Direct Manipulation	2026	CIDR	4.1905499e-05
10,026	Minimum Change ≠ Best Cleaning: Parallel and Incremental Error Detection under Integrity Constraints	2026	SIGMOD	4.1905499e-05
10,318	Fault Lines: Benchmarking the Impact of Label Data Quality on ML Robustness and Fairness	2026	VLDB	4.1905499e-05
10,692	Federated Incomplete Tabular Data Prediction with Missing Complementarity	2025	VLDB	4.1905499e-05
10,730	UniClean: A Scalable Data Cleaning Solution for Mixed Errors based on Unified Cleaners and Optimized Cleaning Workflow	2025	VLDB	4.1905499e-05
10,816	DemandClean: A Multi-Objective Learning Framework for Balancing Model Tolerance to Data Authenticity and Diversity	2025	VLDB	4.1905499e-05
11,140	Generalizable Data Cleaning of Tabular Data in Latent Space	2024	VLDB	4.1905499e-05

Outgoing Citations (Sorted by Pagerank)

Showing 36 of 36 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Rank	Cited Paper	Year	Venue	Pagerank
192	HoloClean: Holistic Data Repairs with Probabilistic Inference	2017	VLDB	0.00035692958
198	Declarative Data Cleaning: Language, Model, and Algorithms	2001	VLDB	0.0003505869
219	Deep Entity Matching with Pre-Trained Language Models	2021	VLDB	0.00033354456
293	Deep Learning for Entity Matching: A Design Space Exploration	2018	SIGMOD	0.00028661817
668	Incremental Knowledge Base Construction Using DeepDive	2015	VLDB	0.00018428925
700	Reasoning about Record Matching Rules	2009	VLDB	0.00017927576
740	Distributed Representations of Tuples for Entity Resolution	2018	VLDB	0.00017358024
788	ActiveClean: Interactive Data Cleaning For Statistical Modeling	2016	VLDB	0.00016618698
879	Don’t be SCAREd: Use SCalable Automatic REpairing with Maximal Likelihood and Bounded Changes	2013	SIGMOD	0.00015649604
942	Framework for Evaluating Clustering Algorithms in Duplicate Detection	2009	VLDB	0.00015143877
1,047	Functional Dependency Discovery: An Experimental Evaluation of Seven Algorithms	2015	VLDB	0.00014442219
1,160	Towards Certain Fixes with Editing Rules and Master Data	2010	VLDB	0.0001358129
1,197	The LLUNATIC Data-Cleaning Framework	2013	VLDB	0.00013373177
1,340	HoloDetect: Few-Shot Learning for Error Detection	2019	SIGMOD	0.00012492795
1,403	Detecting Data Errors: Where are we and what needs to be done?	2016	VLDB	0.00012180046
1,544	KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing	2015	SIGMOD	0.00011438274
1,629	Data Cleaning: Overview and Emerging Challenges	2016	SIGMOD	0.00011073148
1,895	Baran: Effective Error Correction via a Unified Context Representation and Transfer Learning	2020	VLDB	0.00010174634
1,928	A Data- and Workload-Aware Algorithm for Range Queries Under Differential Privacy	2014	VLDB	0.00010062105
2,161	Uni-Detect: A Unified Approach to Automated Error Detection in Tables	2019	SIGMOD	9.4029915e-05
2,258	Efficient Denial Constraint Discovery with Hydra	2018	VLDB	9.1804145e-05
2,484	Discovery of Approximate (and Exact) Denial Constraints	2020	VLDB	8.6737275e-05
2,507	Auto-Detect: Data-Driven Error Detection in Tables	2018	SIGMOD	8.6254741e-05
2,642	Messing Up with BART: Error Generation for Evaluating Data-Cleaning Algorithms	2016	VLDB	8.3925184e-05
2,968	Raha: A Configuration-Free Error Detection System	2019	SIGMOD	7.7964476e-05
3,000	BigDansing: A System for Big Data Cleansing	2015	SIGMOD	7.7447724e-05
3,057	ZeroER: Entity Resolution using Zero Labeled Examples	2020	SIGMOD	7.6458287e-05
3,313	Efficient and Effective Data Imputation with Influence Functions	2022	VLDB	7.2336734e-05
3,854	Generating Concise Entity Matching Rules	2017	SIGMOD	6.697423e-05
4,271	Cleaning Denial Constraint Violations through Relaxation	2020	SIGMOD	6.2943273e-05
4,596	Data Integration and Machine Learning: A Natural Synergy	2018	SIGMOD	6.0540725e-05
5,150	Horizon: Scalable Dependency-driven Data Cleaning	2021	VLDB	5.6553571e-05
6,348	NADEEF: A Generalized Data Cleaning System	2013	VLDB	5.0969173e-05
6,696	Analyzing How BERT Performs Entity Matching	2022	VLDB	4.9542861e-05
9,073	VerifAI: Verified Generative AI	2024	CIDR	4.396857e-05
9,380	Constraint-Variance Tolerant Data Repairing	2016	SIGMOD	4.3439402e-05

Semantically Similar Papers

Overall Rank	Paper	Year	Venue	Pagerank
5,671	Descriptive and Prescriptive Data Cleaning	2014	SIGMOD	5.3797808e-05
2,830	Interaction between Record Matching and Data Repairing	2011	SIGMOD	8.0515409e-05
879	Don’t be SCAREd: Use SCalable Automatic REpairing with Maximal Likelihood and Bounded Changes	2013	SIGMOD	0.00015649604
9,380	Constraint-Variance Tolerant Data Repairing	2016	SIGMOD	4.3439402e-05
7,013	Qualitative Data Cleaning	2016	VLDB	4.8576683e-05
830	Guided Data Repair	2011	VLDB	0.00016125759
1,403	Detecting Data Errors: Where are we and what needs to be done?	2016	VLDB	0.00012180046
1,629	Data Cleaning: Overview and Emerging Challenges	2016	SIGMOD	0.00011073148
621	Improving Data Quality: Consistency and Accuracy	2007	VLDB	0.00018978331
3,198	Towards Dependable Data Repairing with Fixing Rules	2014	SIGMOD	7.4029546e-05