Back to papers
Cleaning Crowdsourced Labels Using Oracles for Statistical Classification
Summary: Oracle-based label cleaning for crowdsourced data in classification. TARS estimates test performance from noisy labels with confidence bounds and selects which labels to clean to boost training accuracy under budget, beating existing strategies.
(summarized by gpt-5-nano on Feb 09 2026)
- Paper ID
- 11972
- Venue
- VLDB
- Year
- 2019
- Pagerank
- 6.7758649e-05
- Overall Rank
- 3,773 | 73.76%
- DOI
-
10.14778/3297753.3297758
Incoming Non-self Citations Over Time
Incoming Citations (Sorted by Pagerank)
Showing 11 of 11 citing papers.
| Rank |
Citing Paper |
Year |
Venue |
Pagerank |
| 2,753 |
Complaint-driven Training Data Debugging for Query 2.0 |
2020 |
SIGMOD |
8.1724339e-05 |
| 4,424 |
PrIU: A Provenance-Based Approach for Incrementally Updating Regression Models |
2020 |
SIGMOD |
6.198474e-05 |
| 5,941 |
Big Graphs: Challenges and Opportunities |
2022 |
VLDB |
5.2635446e-05 |
| 5,978 |
Rotom: A Meta-Learned Data Augmentation Framework for Entity Matching, Data Cleaning, Text Classification, and Beyond |
2021 |
SIGMOD |
5.2453012e-05 |
| 6,690 |
Parallel Discrepancy Detection and Incremental Detection |
2021 |
VLDB |
4.9621556e-05 |
| 7,796 |
CHEF: A Cheap and Fast Pipeline for Iteratively Cleaning Label Uncertainties |
2021 |
VLDB |
4.6482625e-05 |
| 9,054 |
Selecting Data to Clean for Fact Checking: Minimizing Uncertainty vs. Maximizing Surprise |
2019 |
VLDB |
4.4039656e-05 |
| 9,434 |
Rock: Cleaning Data by Embedding ML in Logic Rules |
2024 |
SIGMOD |
4.3430376e-05 |
| 9,487 |
Making It Tractable to Catch Duplicates and Conflicts in Graphs |
2023 |
SIGMOD |
4.3341665e-05 |
| 9,873 |
CORAL: Collaborative Automatic Labeling System based on Large Language Models |
2024 |
VLDB |
4.2667743e-05 |
| 11,000 |
MisDetect: Iterative Mislabel Detection using Early Loss |
2024 |
VLDB |
4.1945683e-05 |
Outgoing Citations (Sorted by Pagerank)
Showing 22 of 22 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank |
Cited Paper |
Year |
Venue |
Pagerank |
| 263 |
CrowdER: Crowdsourcing Entity Resolution |
2012 |
VLDB |
0.00029862413 |
| 791 |
ActiveClean: Interactive Data Cleaning For Statistical Modeling |
2016 |
VLDB |
0.00016629664 |
| 833 |
Guided Data Repair |
2011 |
VLDB |
0.00016138432 |
| 1,242 |
Question Selection for Crowd Entity Resolution |
2013 |
VLDB |
0.00013096655 |
| 1,491 |
CDAS: A Crowdsourcing Data Analytics System |
2012 |
VLDB |
0.00011694982 |
| 1,546 |
KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing |
2015 |
SIGMOD |
0.00011446851 |
| 1,627 |
Data Cleaning: Overview and Emerging Challenges |
2016 |
SIGMOD |
0.00011086905 |
| 2,175 |
Falcon: Scaling Up Hands-Off Crowdsourced Entity Matching to Build Cloud Services |
2017 |
SIGMOD |
9.3644117e-05 |
| 2,184 |
A Sample-and-Clean Framework for Fast and Accurate Query Processing on Dirty Data |
2014 |
SIGMOD |
9.3429789e-05 |
| 2,452 |
Data Fusion – Resolving Data Conflicts for Integration |
2009 |
VLDB |
8.7839322e-05 |
| 2,797 |
Query-Oriented Data Cleaning with Oracles |
2015 |
SIGMOD |
8.1108589e-05 |
| 2,937 |
Truth Inference in Crowdsourcing: Is the Problem Solved? |
2017 |
VLDB |
7.853108e-05 |
| 3,067 |
CrowdFill: Collecting Structured Data from the Crowd |
2014 |
SIGMOD |
7.6180371e-05 |
| 3,118 |
Scaling Up Crowd-Sourcing to Very Large Datasets: A Case for Active Learning |
2015 |
VLDB |
7.5379338e-05 |
| 3,263 |
QASCA: A Quality-Aware Task Assignment System for Crowdsourcing Applications |
2015 |
SIGMOD |
7.3097573e-05 |
| 3,322 |
iCrowd: An Adaptive Crowdsourcing Framework |
2015 |
SIGMOD |
7.2230626e-05 |
| 3,897 |
SLiMFast: Guaranteed Results for Data Fusion and Source Reliability |
2017 |
SIGMOD |
6.6554845e-05 |
| 4,104 |
Online Entity Resolution Using an Oracle |
2016 |
VLDB |
6.4493809e-05 |
| 4,451 |
CLAMShell: Speeding up Crowds for Low-latency Data Labeling |
2016 |
VLDB |
6.1738675e-05 |
| 4,827 |
An Online Cost Sensitive Decision-Making Method in Crowdsourcing Systems |
2013 |
SIGMOD |
5.8938399e-05 |
| 5,405 |
Truth Discovery and Crowdsourcing Aggregation: A Unified Perspective |
2015 |
VLDB |
5.5257718e-05 |
| 8,362 |
Minimizing Efforts in Validating Crowd Answers |
2015 |
SIGMOD |
4.5366717e-05 |
Semantically Similar Papers
| Overall Rank |
Paper |
Year |
Venue |
Pagerank |
| 10,512 |
Auto-Test: Learning Semantic-Domain Constraints for Unsupervised Error Detection in Tables |
2025 |
SIGMOD |
4.1945683e-05 |
| 7,178 |
Towards Globally Optimal Crowdsourcing Quality Management: The Uniform Worker Setting |
2016 |
SIGMOD |
4.8085946e-05 |
| 10,306 |
Fault Lines: Benchmarking the Impact of Label Data Quality on ML Robustness and Fairness |
2026 |
VLDB |
4.1945683e-05 |
| 11,137 |
Generalizable Data Cleaning of Tabular Data in Latent Space |
2024 |
VLDB |
4.1945683e-05 |
| 2,797 |
Query-Oriented Data Cleaning with Oracles |
2015 |
SIGMOD |
8.1108589e-05 |
| 6,868 |
Cost-Effective Data Annotation using Game-Based Crowdsourcing |
2019 |
VLDB |
4.9010083e-05 |
| 5,029 |
Crowdsourced Top-k Queries by Confidence-Aware Pairwise Judgments |
2017 |
SIGMOD |
5.7502622e-05 |
| 10,923 |
k-Clustering with Comparison and Distance Oracles |
2024 |
PODS |
4.1945683e-05 |
| 3,118 |
Scaling Up Crowd-Sourcing to Very Large Datasets: A Case for Active Learning |
2015 |
VLDB |
7.5379338e-05 |
| 9,684 |
How to Design Robust Algorithms using Noisy Comparison Oracle |
2021 |
VLDB |
4.3047774e-05 |