Back to papers
GoodCore: Data-effective and Data-efficient Machine Learning through Coreset Selection over Incomplete Data
Summary: GoodCore selects a coreset for incomplete data by modeling missingness as repairs over worlds and optimizing the expected subset without cleaning. It proves NP-hard and offers an approximation with imputation-based variants, enabling data-efficient ML.
(summarized by gpt-5-nano on Feb 09 2026)
- Paper ID
- 6660
- Venue
- SIGMOD
- Year
- 2023
- Pagerank
- 6.4522929e-05
- Overall Rank
- 4,102 | 71.47%
- DOI
-
10.1145/3589302
Incoming Non-self Citations Over Time
Incoming Citations (Sorted by Pagerank)
Showing 12 of 12 citing papers.
| Rank |
Citing Paper |
Year |
Venue |
Pagerank |
| 8,116 |
LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data Lakes |
2024 |
VLDB |
4.581507e-05 |
| 8,281 |
Optimizing Data Acquisition to Enhance Machine Learning Performance |
2024 |
VLDB |
4.5435639e-05 |
| 9,077 |
VerifAI: Verified Generative AI |
2024 |
CIDR |
4.4010762e-05 |
| 9,709 |
Outlier Summarization via Human Interpretable Rules |
2024 |
VLDB |
4.299267e-05 |
| 10,239 |
BRIEF: Bi-level Coreset Selection for Efficient Instruction Tuning in LLMs |
2026 |
VLDB |
4.1945683e-05 |
| 10,289 |
LEAD: Iterative Data Selection for Efficient LLM Instruction Tuning |
2026 |
VLDB |
4.1945683e-05 |
| 10,528 |
Two Birds with One Stone: Efficient Deep Learning over Mislabeled Data through Subset Selection |
2025 |
SIGMOD |
4.1945683e-05 |
| 10,601 |
Less is More: Efficient Time Series Dataset Condensation via Two-fold Modal Matching |
2025 |
VLDB |
4.1945683e-05 |
| 10,811 |
DemandClean: A Multi-Objective Learning Framework for Balancing Model Tolerance to Data Authenticity and Diversity |
2025 |
VLDB |
4.1945683e-05 |
| 10,953 |
Certain and Approximately Certain Models for Statistical Learning |
2024 |
SIGMOD |
4.1945683e-05 |
| 11,000 |
MisDetect: Iterative Mislabel Detection using Early Loss |
2024 |
VLDB |
4.1945683e-05 |
| 11,041 |
QCore: Data-Efficient, On-Device Continual Calibration for Quantized Models |
2024 |
VLDB |
4.1945683e-05 |
Outgoing Citations (Sorted by Pagerank)
Showing 16 of 16 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank |
Cited Paper |
Year |
Venue |
Pagerank |
| 49 |
Consistent Query Answers in Inconsistent Databases |
1999 |
PODS |
0.00067660624 |
| 656 |
ERACER: A Database Approach for Statistical Inference and Data Cleaning |
2010 |
SIGMOD |
0.00018588729 |
| 791 |
ActiveClean: Interactive Data Cleaning For Statistical Modeling |
2016 |
VLDB |
0.00016629664 |
| 2,302 |
Nearest Neighbor Classifiers over Incomplete Information: From Certain Answers to Certain Predictions |
2021 |
VLDB |
9.0668832e-05 |
| 2,566 |
Database Repairs and Consistent Query Answering: Origins and Further Developments |
2019 |
PODS |
8.5243847e-05 |
| 3,311 |
Efficient and Effective Data Imputation with Influence Functions |
2022 |
VLDB |
7.2406486e-05 |
| 4,825 |
Synthesizing Natural Language to Visualization (NL2VIS) Benchmarks from NL2SQL Benchmarks |
2021 |
SIGMOD |
5.8946721e-05 |
| 5,028 |
Adaptive Data Augmentation for Supervised Learning over Missing Data |
2021 |
VLDB |
5.7506746e-05 |
| 5,279 |
CDB: A Crowd-Powered Database System |
2018 |
VLDB |
5.5902418e-05 |
| 5,362 |
Cost-Effective Crowdsourced Entity Resolution: A Partial-Order Approach |
2016 |
SIGMOD |
5.5473503e-05 |
| 5,381 |
Selective Data Acquisition in the Wild for Model Charging |
2022 |
VLDB |
5.5399508e-05 |
| 5,963 |
Automatic Data Acquisition for Deep Learning |
2021 |
VLDB |
5.2526794e-05 |
| 7,179 |
Coresets over Multiple Tables for Feature-rich and Data-efficient Machine Learning |
2023 |
VLDB |
4.8078895e-05 |
| 7,575 |
Human-in-the-loop Outlier Detection |
2020 |
SIGMOD |
4.7068909e-05 |
| 9,221 |
VisClean: Interactive Cleaning for Progressive Visualization |
2020 |
VLDB |
4.3699444e-05 |
| 11,582 |
Interactively Discovering and Ranking Desired Tuples without Writing SQL Queries |
2020 |
SIGMOD |
4.1945683e-05 |
Semantically Similar Papers
| Overall Rank |
Paper |
Year |
Venue |
Pagerank |
| 8,840 |
The Cost of Representation by Subset Repairs |
2025 |
VLDB |
4.4388652e-05 |
| 5,253 |
Enriching Data Imputation with Extensive Similarity Neighbors |
2015 |
VLDB |
5.6014916e-05 |
| 6,986 |
A Cost-based Optimizer for Gradient Descent Optimization |
2017 |
SIGMOD |
4.8727048e-05 |
| 2,752 |
Composable Core-sets for Diversity and Coverage Maximization |
2014 |
PODS |
8.1742326e-05 |
| 3,311 |
Efficient and Effective Data Imputation with Influence Functions |
2022 |
VLDB |
7.2406486e-05 |
| 11,050 |
Win-Win: On Simultaneous Clustering and Imputing over Incomplete Data |
2024 |
VLDB |
4.1945683e-05 |
| 2,302 |
Nearest Neighbor Classifiers over Incomplete Information: From Certain Answers to Certain Predictions |
2021 |
VLDB |
9.0668832e-05 |
| 10,953 |
Certain and Approximately Certain Models for Statistical Learning |
2024 |
SIGMOD |
4.1945683e-05 |
| 10,881 |
Datamap-Driven Tabular Coreset Selection for Classifier Training |
2025 |
VLDB |
4.1945683e-05 |
| 7,179 |
Coresets over Multiple Tables for Feature-rich and Data-efficient Machine Learning |
2023 |
VLDB |
4.8078895e-05 |