Database Paper Browser

Back to papers

HoloClean: Holistic Data Repairs with Probabilistic Inference

Summary: HoloClean couples constraint-driven and statistical data repair via automatic probabilistic-program generation from dirty data. Scalable inference over millions of tuples; precision ~90%, recall ~76%, F1 >2x vs state-of-the-art. (summarized by gpt-5-nano on Feb 09 2026)

Paper ID
11404
Venue
VLDB
Year
2017
Pagerank
0.00035728858
Overall Rank
192 | 98.67%
DOI
-

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 50 of 133 citing papers.

Rank Citing Paper Year Venue Pagerank
254 Snorkel: Rapid Training Data Creation with Weak Supervision 2018 VLDB 0.00030540555
517 Can Foundation Models Wrangle Your Data? 2023 VLDB 0.00021169035
1,337 HoloDetect: Few-Shot Learning for Error Detection 2019 SIGMOD 0.00012497164
1,482 Automating Large-Scale Data Quality Verification 2018 VLDB 0.00011725533
1,894 Baran: Effective Error Correction via a Unified Context Representation and Transfer Learning 2020 VLDB 0.0001018378
2,122 SystemDS: A Declarative Machine Learning System for the End-to-End Data Science Lifecycle 2020 CIDR 9.4989076e-05
2,158 Uni-Detect: A Unified Approach to Automated Error Detection in Tables 2019 SIGMOD 9.4141354e-05
2,280 SMOKE: Fine-grained Lineage at Interactive Speed 2018 VLDB 9.1111033e-05
2,302 Nearest Neighbor Classifiers over Incomplete Information: From Certain Answers to Certain Predictions 2021 VLDB 9.0668832e-05
2,349 RPT: Relational Pre-trained Transformer Is Almost All You Need towards Democratizing Data Preparation 2021 VLDB 8.9876423e-05
2,483 Discovery of Approximate (and Exact) Denial Constraints 2020 VLDB 8.6864916e-05
2,506 Auto-Detect: Data-Driven Error Detection in Tables 2018 SIGMOD 8.6335464e-05
2,566 Database Repairs and Consistent Query Answering: Origins and Further Developments 2019 PODS 8.5243847e-05
2,587 Table-GPT: Table Fine-tuned GPT for Diverse Table Tasks 2024 SIGMOD 8.4924618e-05
2,753 Complaint-driven Training Data Debugging for Query 2.0 2020 SIGMOD 8.1724339e-05
2,839 VolcanoML: Speeding up End-to-End AutoML via Scalable Search Space Decomposition 2021 VLDB 8.0378978e-05
2,958 The Role of Massively Multi-Task and Weak Supervision in Software 2.0 2019 CIDR 7.8173975e-05
2,968 Raha: A Configuration-Free Error Detection System 2019 SIGMOD 7.7985097e-05
3,155 Ten Years of WebTables 2018 VLDB 7.4672742e-05
3,299 SCODED: Statistical Constraint Oriented Data Error Detection 2020 SIGMOD 7.2546659e-05
3,311 Efficient and Effective Data Imputation with Influence Functions 2022 VLDB 7.2406486e-05
3,396 Automatic Data Repair: Are We Ready to Deploy? 2024 VLDB 7.1455126e-05
3,711 Saga: A Platform for Continuous Construction and Serving of Knowledge At Scale 2022 SIGMOD 6.823609e-05
3,825 Cleanits: A Data Cleaning System for Industrial Time Series 2019 VLDB 6.7255837e-05
3,831 Kamino: Constraint-Aware Differentially Private Data Synthesis 2021 VLDB 6.7181688e-05
4,127 A Statistical Perspective on Discovering Functional Dependencies in Noisy Data 2020 SIGMOD 6.4310458e-05
4,273 Cleaning Denial Constraint Violations through Relaxation 2020 SIGMOD 6.3003864e-05
4,471 GOGGLES: Automatic Image Labeling with Affinity Coding 2020 SIGMOD 6.1555681e-05
4,607 Data Integration and Machine Learning: A Natural Synergy 2018 SIGMOD 6.0538827e-05
5,028 Adaptive Data Augmentation for Supervised Learning over Missing Data 2021 VLDB 5.7506746e-05
5,096 Auto-Transform: Learning-to-Transform by Patterns 2020 VLDB 5.7011825e-05
5,153 Horizon: Scalable Dependency-driven Data Cleaning 2021 VLDB 5.6607963e-05
5,222 Enabling SQL-based Training Data Debugging for Federated Learning 2022 VLDB 5.6210545e-05
5,251 Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale 2019 SIGMOD 5.6029615e-05
5,412 Mining an "Anti-Knowledge Base" from Wikipedia Updates with Applications to Fact Checking and Beyond 2020 VLDB 5.5207515e-05
5,978 Rotom: A Meta-Learned Data Augmentation Framework for Entity Matching, Data Cleaning, Text Classification, and Beyond 2021 SIGMOD 5.2453012e-05
6,134 Finding Label and Model Errors in Perception Data With Learned Observation Assertions 2022 SIGMOD 5.1943414e-05
6,187 Semi-Supervised Data Cleaning with Raha and Baran 2021 CIDR 5.1656857e-05
6,280 Self-supervised and Interpretable Data Cleaning with Sequence Generative Adversarial Networks 2023 VLDB 5.1290457e-05
6,451 Multivariate Time Series Cleaning under Speed Constraints 2024 SIGMOD 5.0583324e-05
6,477 Fast Algorithms for Denial Constraint Discovery 2023 VLDB 5.0488285e-05
6,546 Properties of Inconsistency Measures for Databases 2021 SIGMOD 5.0185588e-05
6,553 How do Categorical Duplicates Affect ML? A New Benchmark and Empirical Analyses 2024 VLDB 5.0157344e-05
6,683 Probabilistic Databases for All 2020 PODS 4.9638979e-05
6,690 Parallel Discrepancy Detection and Incremental Detection 2021 VLDB 4.9621556e-05
6,887 Synthesizing Linked Data Under Cardinality and Integrity Constraints 2021 SIGMOD 4.8937852e-05
6,944 DataPrism: Exposing Disconnect between Data and Systems 2022 SIGMOD 4.8912787e-05
7,066 On Multiple Semantics for Declarative Database Repairs 2020 SIGMOD 4.8445108e-05
7,223 Akane: Perplexity-Guided Time Series Data Cleaning 2024 SIGMOD 4.7965857e-05
7,243 Data Integration and Machine Learning: A Natural Synergy 2018 VLDB 4.7913666e-05
Previous Page 1 / 3 Next

Outgoing Citations (Sorted by Pagerank)

Showing 23 of 23 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Rank Cited Paper Year Venue Pagerank
265 A Cost-Based Model and Effective Heuristic for Repairing Constraints by Value Modification 2005 SIGMOD 0.00029763412
322 Record Linkage: Similarity Measures and Algorithms 2006 SIGMOD 0.00027518768
489 Data Curation at Scale: The Data Tamer System 2013 CIDR 0.00022030728
555 Discovering Denial Constraints 2013 VLDB 0.00020254908
560 Dependencies Revisited for Improving Data Quality 2008 PODS 0.00020141923
623 Improving Data Quality: Consistency and Accuracy 2007 VLDB 0.00018996374
656 ERACER: A Database Approach for Statistical Inference and Data Cleaning 2010 SIGMOD 0.00018588729
667 Incremental Knowledge Base Construction Using DeepDive 2015 VLDB 0.00018440557
702 Reasoning about Record Matching Rules 2009 VLDB 0.00017918203
814 Entity Resolution: Theory, Practice & Open Challenges 2012 VLDB 0.00016370594
881 Don’t be SCAREd: Use SCalable Automatic REpairing with Maximal Likelihood and Bounded Changes 2013 SIGMOD 0.00015661103
1,012 NADEEF: A Commodity Data Cleaning System 2013 SIGMOD 0.0001464733
1,014 Tuffy: Scaling up Statistical Inference in Markov Logic Networks using an RDBMS 2011 VLDB 0.00014640258
1,044 DimmWitted: A Study of Main-Memory Statistical Analytics 2014 VLDB 0.00014475229
1,159 Towards Certain Fixes with Editing Rules and Master Data 2010 VLDB 0.00013592813
1,197 The LLUNATIC Data-Cleaning Framework 2013 VLDB 0.00013390321
1,211 Truth Finding on the Deep Web: Is the Problem Solved? 2013 VLDB 0.00013257101
1,546 KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing 2015 SIGMOD 0.00011446851
1,612 Detecting Data Errors: Where are we and what needs to be done? 2016 VLDB 0.00011142794
1,624 Sampling the Repairs of Functional Dependency Violations under Hard Constraints 2010 VLDB 0.00011099222
3,042 Dichotomies in the Complexity of Preferred Repairs 2015 PODS 7.669374e-05
3,192 Towards Dependable Data Repairing with Fixing Rules 2014 SIGMOD 7.4095761e-05
3,897 SLiMFast: Guaranteed Results for Data Fusion and Source Reliability 2017 SIGMOD 6.6554845e-05
Previous Page 1 / 1 Next

Semantically Similar Papers