Database Paper Browser

Back to papers

KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing

Summary: Katara is a data cleaning system powered by knowledge bases and crowdsourcing that interprets table semantics to align data with KBs and identify correct versus incorrect values. It then outputs top-k repairs and demonstrates scalable, cross-domain applicability with efficient, crowd-assisted annotation. (summarized by gpt-5-nano on Feb 09 2026)

Paper ID
5048
Venue
SIGMOD
Year
2015
Pagerank
0.00011446851
Overall Rank
1,546 | 89.25%
DOI
10.1145/2723372.2749431

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 45 of 45 citing papers.

Rank Citing Paper Year Venue Pagerank
192 HoloClean: Holistic Data Repairs with Probabilistic Inference 2017 VLDB 0.00035728858
517 Can Foundation Models Wrangle Your Data? 2023 VLDB 0.00021169035
1,337 HoloDetect: Few-Shot Learning for Error Detection 2019 SIGMOD 0.00012497164
1,612 Detecting Data Errors: Where are we and what needs to be done? 2016 VLDB 0.00011142794
1,627 Data Cleaning: Overview and Emerging Challenges 2016 SIGMOD 0.00011086905
1,894 Baran: Effective Error Correction via a Unified Context Representation and Transfer Learning 2020 VLDB 0.0001018378
2,158 Uni-Detect: A Unified Approach to Automated Error Detection in Tables 2019 SIGMOD 9.4141354e-05
2,302 Nearest Neighbor Classifiers over Incomplete Information: From Certain Answers to Certain Predictions 2021 VLDB 9.0668832e-05
2,349 RPT: Relational Pre-trained Transformer Is Almost All You Need towards Democratizing Data Preparation 2021 VLDB 8.9876423e-05
2,506 Auto-Detect: Data-Driven Error Detection in Tables 2018 SIGMOD 8.6335464e-05
2,968 Raha: A Configuration-Free Error Detection System 2019 SIGMOD 7.7985097e-05
3,299 SCODED: Statistical Constraint Oriented Data Error Detection 2020 SIGMOD 7.2546659e-05
3,396 Automatic Data Repair: Are We Ready to Deploy? 2024 VLDB 7.1455126e-05
3,773 Cleaning Crowdsourced Labels Using Oracles for Statistical Classification 2019 VLDB 6.7758649e-05
4,126 Waldo: An Adaptive Human Interface for Crowd Entity Resolution 2017 SIGMOD 6.4314729e-05
4,806 Uncertainty Annotated Databases - A Lightweight Approach for Approximating Certain Answers 2019 SIGMOD 5.9092698e-05
5,096 Auto-Transform: Learning-to-Transform by Patterns 2020 VLDB 5.7011825e-05
5,729 KATARA: Reliable Data Cleaning with Knowledge Bases and Crowdsourcing 2015 VLDB 5.3506368e-05
5,978 Rotom: A Meta-Learned Data Augmentation Framework for Entity Matching, Data Cleaning, Text Classification, and Beyond 2021 SIGMOD 5.2453012e-05
6,182 Top-K Deep Video Analytics: A Probabilistic Approach 2021 SIGMOD 5.1682689e-05
6,187 Semi-Supervised Data Cleaning with Raha and Baran 2021 CIDR 5.1656857e-05
6,416 Synthesizing Type-Detection Logic for Rich Semantic Data Types using Open-source Code 2018 SIGMOD 5.072267e-05
7,013 Qualitative Data Cleaning 2016 VLDB 4.8619024e-05
7,223 Akane: Perplexity-Guided Time Series Data Cleaning 2024 SIGMOD 4.7965857e-05
7,292 Subjective Knowledge Base Construction Powered By Crowdsourcing and Knowledge Base 2018 SIGMOD 4.7740174e-05
7,766 ICARUS: Minimizing Human Effort in Iterative Data Completion 2018 VLDB 4.6564959e-05
9,043 Query-Guided Resolution in Uncertain Databases 2023 SIGMOD 4.4039656e-05
9,221 VisClean: Interactive Cleaning for Progressive Visualization 2020 VLDB 4.3699444e-05
9,240 ZIP: Lazy Imputation during Query Processing 2024 VLDB 4.3690661e-05
9,348 GIDCL: A Graph-Enhanced Interpretable Data Cleaning Framework with Large Language Models 2024 SIGMOD 4.3526427e-05
9,479 Data Imputation with Limited Data Redundancy Using Data Lakes 2025 VLDB 4.3341665e-05
9,771 EasyDR: A Human-in-the-loop Error Detection and Repair Platform for Holistic Table Cleaning 2022 VLDB 4.2856106e-05
9,777 Data Augmentation for ML-driven Data Preparation and Integration 2021 VLDB 4.2856106e-05
9,896 Towards Interpretable and Learnable Risk Analysis for Entity Resolution 2020 SIGMOD 4.2600049e-05
10,003 Clustering with Set Outliers and Applications in Relational Clustering 2026 PODS 4.1945683e-05
10,026 Minimum Change ≠ Best Cleaning: Parallel and Incremental Error Detection under Integrity Constraints 2026 SIGMOD 4.1945683e-05
11,006 FusionQuery: On-demand Fusion Queries over Multi-source Heterogeneous Data 2024 VLDB 4.1945683e-05
11,069 Hardware-Efficient Data Imputation through DBMS Extensibility 2024 VLDB 4.1945683e-05
11,178 LinCQA: Faster Consistent Query Answering with Linear Time Guarantees 2023 SIGMOD 4.1945683e-05
11,399 ActivePDB: Active Probabilistic Databases 2022 VLDB 4.1945683e-05
11,454 Contextual Data Cleaning with Ontology FDs 2021 SIGMOD 4.1945683e-05
11,536 LOCATER: Cleaning WiFi Connectivity Datasets for Semantic Localization 2021 VLDB 4.1945683e-05
11,680 WiClean: A System for Fixing Wikipedia Interlinks Using Revision History Patterns 2019 VLDB 4.1945683e-05
11,788 CDB: Optimizing Queries with Crowd-Based Selections and Joins 2017 SIGMOD 4.1945683e-05
11,816 DOCS: Domain-Aware Crowdsourcing System 2017 VLDB 4.1945683e-05
Previous Page 1 / 1 Next

Outgoing Citations (Sorted by Pagerank)

Showing 27 of 27 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Rank Cited Paper Year Venue Pagerank
112 Potter's Wheel: An Interactive Data Cleaning System 2001 VLDB 0.00047045036
224 CORDS: Automatic Discovery of Correlations and Soft Functional Dependencies 2004 SIGMOD 0.00032746205
263 CrowdER: Crowdsourcing Entity Resolution 2012 VLDB 0.00029862413
265 A Cost-Based Model and Effective Heuristic for Repairing Constraints by Value Modification 2005 SIGMOD 0.00029763412
364 Annotating and Searching Web Tables Using Entities, Types and Relationships 2010 VLDB 0.00025637562
555 Discovering Denial Constraints 2013 VLDB 0.00020254908
560 Dependencies Revisited for Improving Data Quality 2008 PODS 0.00020141923
623 Improving Data Quality: Consistency and Accuracy 2007 VLDB 0.00018996374
656 ERACER: A Database Approach for Statistical Inference and Data Cleaning 2010 SIGMOD 0.00018588729
674 Supporting Top-k Join Queries in Relational Databases 2003 VLDB 0.00018327585
732 Discovering Data Quality Rules 2008 VLDB 0.00017465093
833 Guided Data Repair 2011 VLDB 0.00016138432
881 Don’t be SCAREd: Use SCalable Automatic REpairing with Maximal Likelihood and Bounded Changes 2013 SIGMOD 0.00015661103
1,001 Recovering Semantics of Tables on the Web 2011 VLDB 0.00014706505
1,012 NADEEF: A Commodity Data Cleaning System 2013 SIGMOD 0.0001464733
1,159 Towards Certain Fixes with Editing Rules and Master Data 2010 VLDB 0.00013592813
1,197 The LLUNATIC Data-Cleaning Framework 2013 VLDB 0.00013390321
1,628 PARIS: Probabilistic Alignment of Relations, Instances, and Schema 2012 VLDB 0.00011085347
2,078 Sample-Driven Schema Mapping 2012 SIGMOD 9.599707e-05
2,420 From Data Fusion to Knowledge Fusion 2014 VLDB 8.8530994e-05
2,755 Advanced Processing for Ontological Queries 2010 VLDB 8.1690695e-05
2,823 Interaction between Record Matching and Data Repairing 2011 SIGMOD 8.0593894e-05
2,847 Building, Maintaining, and Using Knowledge Bases: A Report from the Trenches 2013 SIGMOD 8.0224023e-05
3,192 Towards Dependable Data Repairing with Fixing Rules 2014 SIGMOD 7.4095761e-05
5,081 Reducing Uncertainty of Schema Matching via Crowdsourcing 2013 VLDB 5.7132042e-05
5,852 Repairing Vertex Labels under Neighborhood Constraints 2014 VLDB 5.3007132e-05
7,588 Scalable Column Concept Determination for Web Tables Using Large Knowledge Bases 2013 VLDB 4.7030914e-05
Previous Page 1 / 1 Next

Semantically Similar Papers