Database Paper Browser

Back to papers

A Sample-and-Clean Framework for Fast and Accurate Query Processing on Dirty Data

Summary: Sample-and-Clean blends SAQP with selective cleaning on a small subset to reduce dirty-data bias. Derives confidence intervals by sample size and shows accuracy gains with speedups on noisy TPC-H, Microsoft Academic, and sensor data. (summarized by gpt-5-nano on Feb 09 2026)

Paper ID
4876
Venue
SIGMOD
Year
2014
Pagerank
9.3429789e-05
Overall Rank
2,184 | 84.81%
DOI
10.1145/2588555.2610505

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 37 of 37 citing papers.

Rank Citing Paper Year Venue Pagerank
791 ActiveClean: Interactive Data Cleaning For Statistical Modeling 2016 VLDB 0.00016629664
1,350 Northstar: An Interactive Data Science System 2018 VLDB 0.00012431059
1,627 Data Cleaning: Overview and Emerging Challenges 2016 SIGMOD 0.00011086905
1,874 Knowing When You’re Wrong: Building Fast and Reliable Approximate Query Processing Systems 2014 SIGMOD 0.00010244443
1,882 Tuplex: Data Science in Python at Native Code Speed 2021 SIGMOD 0.0001021625
1,894 Baran: Effective Error Correction via a Unified Context Representation and Transfer Learning 2020 VLDB 0.0001018378
2,132 Towards Sustainable Insights or why polygamy is bad for you 2017 CIDR 9.4770432e-05
2,302 Nearest Neighbor Classifiers over Incomplete Information: From Certain Answers to Certain Predictions 2021 VLDB 9.0668832e-05
2,797 Query-Oriented Data Cleaning with Oracles 2015 SIGMOD 8.1108589e-05
2,946 BigDansing: A System for Big Data Cleansing 2015 SIGMOD 7.8372441e-05
3,263 QASCA: A Quality-Aware Task Assignment System for Crowdsourcing Applications 2015 SIGMOD 7.3097573e-05
3,773 Cleaning Crowdsourced Labels Using Oracles for Statistical Classification 2019 VLDB 6.7758649e-05
3,944 AQP++: Connecting Approximate Query Processing With Aggregate Precomputation for Interactive Analytics 2018 SIGMOD 6.6078243e-05
4,273 Cleaning Denial Constraint Violations through Relaxation 2020 SIGMOD 6.3003864e-05
4,375 Sample Debiasing in the Themis Open World Database System 2020 SIGMOD 6.2427076e-05
4,451 CLAMShell: Speeding up Crowds for Low-latency Data Labeling 2016 VLDB 6.1738675e-05
4,668 PrivateClean: Data Cleaning and Differential Privacy 2016 SIGMOD 6.0115918e-05
5,153 Horizon: Scalable Dependency-driven Data Cleaning 2021 VLDB 5.6607963e-05
5,586 QuERy: A Framework for Integrating Entity Resolution with Query Processing 2016 VLDB 5.4219548e-05
5,929 ActiveClean: An Interactive Data Cleaning Framework For Modern Machine Learning 2016 SIGMOD 5.2682177e-05
6,689 Efficient Knowledge Graph Accuracy Evaluation 2019 VLDB 4.9623586e-05
6,740 Combining Aggregation and Sampling (Nearly) Optimally for Approximate Query Processing 2021 SIGMOD 4.944395e-05
7,013 Qualitative Data Cleaning 2016 VLDB 4.8619024e-05
7,117 Crowdsourced Data Management: Overview and Challenges 2017 SIGMOD 4.826509e-05
7,237 CleanM: An Optimizable Query Language for Unified Scale-Out Data Cleaning 2017 VLDB 4.7928651e-05
7,251 Learning to Sample: Counting with Complex Queries 2020 VLDB 4.7890519e-05
7,634 ReStore - Neural Data Completion for Relational Databases 2021 SIGMOD 4.6911382e-05
7,766 ICARUS: Minimizing Human Effort in Iterative Data Completion 2018 VLDB 4.6564959e-05
8,593 Wisteria: Nurturing Scalable Data Cleaning Infrastructure 2015 VLDB 4.4891474e-05
8,728 Stale View Cleaning: Getting Fresh Answers from Stale Materialized Views 2015 VLDB 4.4589711e-05
9,043 Query-Guided Resolution in Uncertain Databases 2023 SIGMOD 4.4039656e-05
9,054 Selecting Data to Clean for Fact Checking: Minimizing Uncertainty vs. Maximizing Surprise 2019 VLDB 4.4039656e-05
9,056 A Data Quality Metric (DQM): How to Estimate the Number of Undetected Errors in Data Sets 2017 VLDB 4.4039656e-05
9,196 QOCO: A Query Oriented Data Cleaning System with Oracles 2015 VLDB 4.3749064e-05
9,348 GIDCL: A Graph-Enhanced Interpretable Data Cleaning Framework with Large Language Models 2024 SIGMOD 4.3526427e-05
10,617 Deduplicated Sampling On-Demand 2025 VLDB 4.1945683e-05
11,029 Efficient and Reliable Estimation of Knowledge Graph Accuracy 2024 VLDB 4.1945683e-05
Previous Page 1 / 1 Next

Outgoing Citations (Sorted by Pagerank)

Showing 21 of 21 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Rank Cited Paper Year Venue Pagerank
14 Online Aggregation 1997 SIGMOD 0.0010801504
59 Sampling-Based Estimation of the Number of Distinct Values of an Attribute 1995 VLDB 0.00064501896
263 CrowdER: Crowdsourcing Entity Resolution 2012 VLDB 0.00029862413
319 Evaluation of entity resolution approaches on real-world match problems 2010 VLDB 0.00027781866
378 Towards Estimation Error Guarantees for Distinct Values 2000 PODS 0.0002497492
398 Big Data Integration 2013 VLDB 0.00024372588
429 The Aqua Approximate Query Answering System 1999 SIGMOD 0.00023476494
449 Approximate Query Processing: Taming the TeraBytes! A Tutorial 2001 VLDB 0.00022846068
692 Pay-as-you-go User Feedback for Dataspace Systems 2008 SIGMOD 0.00018083948
727 On Synopses for Distinct-Value Estimation Under Multiset Operations 2007 SIGMOD 0.00017508726
739 Congressional Samples for Approximate Answering of Group-By Queries 2000 SIGMOD 0.00017401518
833 Guided Data Repair 2011 VLDB 0.00016138432
866 Leveraging Transitive Relations for Crowdsourced Joins 2013 SIGMOD 0.00015801196
1,012 NADEEF: A Commodity Data Cleaning System 2013 SIGMOD 0.0001464733
1,159 Towards Certain Fixes with Editing Rules and Master Data 2010 VLDB 0.00013592813
1,260 Dynamic Sample Selection for Approximate Query Processing 2003 SIGMOD 0.00012993347
1,464 Online Aggregation for Large MapReduce Jobs 2011 VLDB 0.00011865546
1,909 SciBORQ: Scientific data management with Bounds On Runtime and Quality 2011 CIDR 0.00010121304
2,736 Online Aggregation and Continuous Query support in MapReduce 2010 SIGMOD 8.2043187e-05
3,067 CrowdFill: Collecting Structured Data from the Crowd 2014 SIGMOD 7.6180371e-05
4,093 Distributed Online Aggregations 2009 VLDB 6.4558147e-05
Previous Page 1 / 1 Next

Semantically Similar Papers