Database Paper Browser

Back to papers

HoloClean: Holistic Data Repairs with Probabilistic Inference

Summary: HoloClean couples constraint-driven and statistical data repair via automatic probabilistic-program generation from dirty data. Scalable inference over millions of tuples; precision ~90%, recall ~76%, F1 >2x vs state-of-the-art. (summarized by gpt-5-nano on Feb 09 2026)

Paper ID
11404
Venue
VLDB
Year
2017
Pagerank
0.00035728858
Overall Rank
192 | 98.67%
DOI
-

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 50 of 133 citing papers.

Rank Citing Paper Year Venue Pagerank
7,407 Intermittent Query Processing 2019 VLDB 4.7373205e-05
7,564 PIClean: A Probabilistic and Interactive Data Cleaning System 2019 SIGMOD 4.7093702e-05
7,605 The Computation of Optimal Subset Repairs 2020 VLDB 4.697534e-05
7,634 ReStore - Neural Data Completion for Relational Databases 2021 SIGMOD 4.6911382e-05
7,667 Fast Detection of Denial Constraint Violations 2022 VLDB 4.683767e-05
7,704 ExDRa: Exploratory Data Science on Federated Raw Data 2021 SIGMOD 4.6733838e-05
7,766 ICARUS: Minimizing Human Effort in Iterative Data Completion 2018 VLDB 4.6564959e-05
7,867 Learning Over Dirty Data Without Cleaning 2020 SIGMOD 4.6320452e-05
8,000 Data Civilizer 2.0: A Holistic Framework for Data Preparation and Analytics 2019 VLDB 4.6092803e-05
8,092 Saga: A Scalable Framework for Optimizing Data Cleaning Pipelines for Machine Learning Applications 2023 SIGMOD 4.587921e-05
8,121 Automation of Data Prep, ML, and Data Science: New Cure or Snake Oil? 2021 SIGMOD 4.5809305e-05
8,145 Evaluating Top-k Queries with Inconsistency Degrees 2020 VLDB 4.5761263e-05
8,182 SHiFT: An Efficient, Flexible Search Engine for Transfer Learning 2023 VLDB 4.5659133e-05
8,422 Deducing Certain Fixes to Graphs 2019 VLDB 4.5167705e-05
8,472 Rapidash: Efficient Detection of Constraint Violations 2024 VLDB 4.5036378e-05
8,590 Exploratory Training: When Annotators Learn About Data 2023 SIGMOD 4.4896282e-05
8,716 nsDB: Architecting the Next Generation Database by Integrating Neural and Symbolic Systems 2024 VLDB 4.4618187e-05
8,745 Sparcle: Boosting the Accuracy of Data Cleaning Systems through Spatial Awareness 2024 VLDB 4.456315e-05
8,789 Machine Learning Meets Big Spatial Data 2019 VLDB 4.4509194e-05
8,836 Fast Approximate Denial Constraint Discovery 2023 VLDB 4.4393184e-05
8,840 The Cost of Representation by Subset Repairs 2025 VLDB 4.4388652e-05
9,043 Query-Guided Resolution in Uncertain Databases 2023 SIGMOD 4.4039656e-05
9,054 Selecting Data to Clean for Fact Checking: Minimizing Uncertainty vs. Maximizing Surprise 2019 VLDB 4.4039656e-05
9,076 DataDiff: User-Interpretable Data Transformation Summaries for Collaborative Data Analysis 2018 SIGMOD 4.401804e-05
9,077 VerifAI: Verified Generative AI 2024 CIDR 4.4010762e-05
9,118 Towards Observability for Production Machine Learning Pipelines 2022 VLDB 4.3928288e-05
9,192 Hyper-Tune: Towards Efficient Hyper-parameter Tuning at Scale 2022 VLDB 4.3765131e-05
9,240 ZIP: Lazy Imputation during Query Processing 2024 VLDB 4.3690661e-05
9,348 GIDCL: A Graph-Enhanced Interpretable Data Cleaning Framework with Large Language Models 2024 SIGMOD 4.3526427e-05
9,355 Discovering Top-k Rules using Subjective and Objective Criteria 2023 SIGMOD 4.3514328e-05
9,389 DataVinci: Learning Syntactic and Semantic String Repairs 2025 SIGMOD 4.3441378e-05
9,434 Rock: Cleaning Data by Embedding ML in Logic Rules 2024 SIGMOD 4.3430376e-05
9,438 Bootleg: Chasing the Tail with Self-Supervised Named Entity Disambiguation 2021 CIDR 4.3425082e-05
9,478 Incremental Detection of Denial Constraint Violations 2025 VLDB 4.3341665e-05
9,479 Data Imputation with Limited Data Redundancy Using Data Lakes 2025 VLDB 4.3341665e-05
9,487 Making It Tractable to Catch Duplicates and Conflicts in Graphs 2023 SIGMOD 4.3341665e-05
9,492 Lingua Manga : A Generic Large Language Model Centric System for Data Curation 2023 VLDB 4.3341665e-05
9,560 MTSClean: Efficient Constraint-based Cleaning for Multi-Dimensional Time Series Data 2024 VLDB 4.3254416e-05
9,577 CoClean: Collaborative Data Cleaning 2020 SIGMOD 4.3248438e-05
9,673 Don’t Be a Tattle-Tale: Preventing Leakages through Data Dependencies on Access Control Protected Data 2022 VLDB 4.3055474e-05
9,749 Efficient Differential Dependency Discovery 2024 VLDB 4.2897489e-05
9,771 EasyDR: A Human-in-the-loop Error Detection and Repair Platform for Holistic Table Cleaning 2022 VLDB 4.2856106e-05
9,847 Discovering Top-k Relevant and Diversified Rules 2024 SIGMOD 4.2721228e-05
9,849 Reptile: Aggregation-level Explanations for Hierarchical Data 2022 SIGMOD 4.2721228e-05
9,856 In-Database Data Imputation 2024 SIGMOD 4.269353e-05
9,886 Scalable and Usable Relational Learning With Automatic Language Bias 2021 SIGMOD 4.2621158e-05
9,896 Towards Interpretable and Learnable Risk Analysis for Entity Resolution 2020 SIGMOD 4.2600049e-05
9,924 On Saving Outliers for Better Clustering over Noisy Data 2021 SIGMOD 4.2544238e-05
9,963 Parallel Rule Discovery from Large Datasets by Sampling 2022 SIGMOD 4.2294678e-05
9,984 Towards Scalable Visual Data Wrangling via Direct Manipulation 2026 CIDR 4.1945683e-05
Previous Page 2 / 3 Next

Outgoing Citations (Sorted by Pagerank)

Showing 23 of 23 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Rank Cited Paper Year Venue Pagerank
265 A Cost-Based Model and Effective Heuristic for Repairing Constraints by Value Modification 2005 SIGMOD 0.00029763412
322 Record Linkage: Similarity Measures and Algorithms 2006 SIGMOD 0.00027518768
489 Data Curation at Scale: The Data Tamer System 2013 CIDR 0.00022030728
555 Discovering Denial Constraints 2013 VLDB 0.00020254908
560 Dependencies Revisited for Improving Data Quality 2008 PODS 0.00020141923
623 Improving Data Quality: Consistency and Accuracy 2007 VLDB 0.00018996374
656 ERACER: A Database Approach for Statistical Inference and Data Cleaning 2010 SIGMOD 0.00018588729
667 Incremental Knowledge Base Construction Using DeepDive 2015 VLDB 0.00018440557
702 Reasoning about Record Matching Rules 2009 VLDB 0.00017918203
814 Entity Resolution: Theory, Practice & Open Challenges 2012 VLDB 0.00016370594
881 Don’t be SCAREd: Use SCalable Automatic REpairing with Maximal Likelihood and Bounded Changes 2013 SIGMOD 0.00015661103
1,012 NADEEF: A Commodity Data Cleaning System 2013 SIGMOD 0.0001464733
1,014 Tuffy: Scaling up Statistical Inference in Markov Logic Networks using an RDBMS 2011 VLDB 0.00014640258
1,044 DimmWitted: A Study of Main-Memory Statistical Analytics 2014 VLDB 0.00014475229
1,159 Towards Certain Fixes with Editing Rules and Master Data 2010 VLDB 0.00013592813
1,197 The LLUNATIC Data-Cleaning Framework 2013 VLDB 0.00013390321
1,211 Truth Finding on the Deep Web: Is the Problem Solved? 2013 VLDB 0.00013257101
1,546 KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing 2015 SIGMOD 0.00011446851
1,612 Detecting Data Errors: Where are we and what needs to be done? 2016 VLDB 0.00011142794
1,624 Sampling the Repairs of Functional Dependency Violations under Hard Constraints 2010 VLDB 0.00011099222
3,042 Dichotomies in the Complexity of Preferred Repairs 2015 PODS 7.669374e-05
3,192 Towards Dependable Data Repairing with Fixing Rules 2014 SIGMOD 7.4095761e-05
3,897 SLiMFast: Guaranteed Results for Data Fusion and Source Reliability 2017 SIGMOD 6.6554845e-05
Previous Page 1 / 1 Next

Semantically Similar Papers