Database Paper Browser

Back to papers

Detecting Data Errors: Where are we and what needs to be done?

Summary: Empirical study: data-cleaning tools miss large portions of real-world errors and have robustness gaps. Proposes multi-tool workflow to boost coverage with less verification; notes domain-specific tools and enrichment, but some errors remain undetectable. (summarized by gpt-5-nano on Feb 09 2026)

Paper ID
11386
Venue
VLDB
Year
2016
Pagerank
0.00011142794
Overall Rank
1,612 | 88.79%
DOI
-

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 49 of 49 citing papers.

Rank Citing Paper Year Venue Pagerank
192 HoloClean: Holistic Data Repairs with Probabilistic Inference 2017 VLDB 0.00035728858
517 Can Foundation Models Wrangle Your Data? 2023 VLDB 0.00021169035
1,277 The Data Civilizer System 2017 CIDR 0.00012879695
1,337 HoloDetect: Few-Shot Learning for Error Detection 2019 SIGMOD 0.00012497164
1,831 Synthesizing Entity Matching Rules by Examples 2018 VLDB 0.00010384082
1,894 Baran: Effective Error Correction via a Unified Context Representation and Transfer Learning 2020 VLDB 0.0001018378
1,914 Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks 2020 SIGMOD 0.00010109102
2,158 Uni-Detect: A Unified Approach to Automated Error Detection in Tables 2019 SIGMOD 9.4141354e-05
2,349 RPT: Relational Pre-trained Transformer Is Almost All You Need towards Democratizing Data Preparation 2021 VLDB 8.9876423e-05
2,506 Auto-Detect: Data-Driven Error Detection in Tables 2018 SIGMOD 8.6335464e-05
2,753 Complaint-driven Training Data Debugging for Query 2.0 2020 SIGMOD 8.1724339e-05
2,968 Raha: A Configuration-Free Error Detection System 2019 SIGMOD 7.7985097e-05
3,252 Auto-Suggest: Learning-to-Recommend Data Preparation Steps Using Data Science Notebooks 2020 SIGMOD 7.3178277e-05
3,299 SCODED: Statistical Constraint Oriented Data Error Detection 2020 SIGMOD 7.2546659e-05
3,396 Automatic Data Repair: Are We Ready to Deploy? 2024 VLDB 7.1455126e-05
3,976 UGuide – User-Guided Discovery of FD-Detectable Errors 2017 SIGMOD 6.5736462e-05
5,028 Adaptive Data Augmentation for Supervised Learning over Missing Data 2021 VLDB 5.7506746e-05
5,096 Auto-Transform: Learning-to-Transform by Patterns 2020 VLDB 5.7011825e-05
5,192 Pattern Functional Dependencies for Data Cleaning 2020 VLDB 5.6375087e-05
5,429 DiffPrep: Differentiable Data Preprocessing Pipeline Search for Learning over Tabular Data 2023 SIGMOD 5.5087325e-05
5,445 QFix: Diagnosing Errors through Query Histories 2017 SIGMOD 5.5020909e-05
5,928 SchemaPile: A Large Collection of Relational Database Schemas 2024 SIGMOD 5.2685946e-05
6,187 Semi-Supervised Data Cleaning with Raha and Baran 2021 CIDR 5.1656857e-05
6,280 Self-supervised and Interpretable Data Cleaning with Sequence Generative Adversarial Networks 2023 VLDB 5.1290457e-05
6,295 Your notebook is not crumby enough, REPLace it 2020 CIDR 5.1249204e-05
6,546 Properties of Inconsistency Measures for Databases 2021 SIGMOD 5.0185588e-05
7,391 Time Series Data Validity 2023 SIGMOD 4.7429293e-05
7,564 PIClean: A Probabilistic and Interactive Data Cleaning System 2019 SIGMOD 4.7093702e-05
8,208 SMARTFEAT: Efficient Feature Construction through Feature-Level Foundation Model Interactions 2024 CIDR 4.5581306e-05
8,472 Rapidash: Efficient Detection of Constraint Violations 2024 VLDB 4.5036378e-05
8,590 Exploratory Training: When Annotators Learn About Data 2023 SIGMOD 4.4896282e-05
8,678 Progressive Deep Web Crawling Through Keyword Queries For Data Enrichment 2019 SIGMOD 4.4702119e-05
8,743 CtxPipe: Context-aware Data Preparation Pipeline Construction for Machine Learning 2024 SIGMOD 4.456315e-05
9,056 A Data Quality Metric (DQM): How to Estimate the Number of Undetected Errors in Data Sets 2017 VLDB 4.4039656e-05
9,077 VerifAI: Verified Generative AI 2024 CIDR 4.4010762e-05
9,118 Towards Observability for Production Machine Learning Pipelines 2022 VLDB 4.3928288e-05
9,306 Debugging Large-Scale Data Science Pipelines using Dagger 2020 VLDB 4.3572942e-05
9,348 GIDCL: A Graph-Enhanced Interpretable Data Cleaning Framework with Large Language Models 2024 SIGMOD 4.3526427e-05
9,479 Data Imputation with Limited Data Redundancy Using Data Lakes 2025 VLDB 4.3341665e-05
9,577 CoClean: Collaborative Data Cleaning 2020 SIGMOD 4.3248438e-05
9,856 In-Database Data Imputation 2024 SIGMOD 4.269353e-05
9,928 Fainder: A Fast and Accurate Index for Distribution-Aware Dataset Search 2024 VLDB 4.2511622e-05
9,984 Towards Scalable Visual Data Wrangling via Direct Manipulation 2026 CIDR 4.1945683e-05
10,026 Minimum Change ≠ Best Cleaning: Parallel and Incremental Error Detection under Integrity Constraints 2026 SIGMOD 4.1945683e-05
10,463 Zorro: Quantifying Uncertainty in Models & Predictions Arising from Dirty Data 2025 SIGMOD 4.1945683e-05
10,723 UniClean: A Scalable Data Cleaning Solution for Mixed Errors based on Unified Cleaners and Optimized Cleaning Workflow 2025 VLDB 4.1945683e-05
10,821 Demonstrating Matelda for Multi-Table Error Detection 2025 VLDB 4.1945683e-05
11,216 Demystifying the QoS and QoE of Edge-hosted Video Streaming Applications in the Wild with SNESet 2023 SIGMOD 4.1945683e-05
11,529 GEDet: Detecting Erroneous Nodes with A Few Examples 2021 VLDB 4.1945683e-05
Previous Page 1 / 1 Next

Outgoing Citations (Sorted by Pagerank)

Showing 13 of 13 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Previous Page 1 / 1 Next

Semantically Similar Papers