Back to papers
LEAD: Iterative Data Selection for Efficient LLM Instruction Tuning
Summary: LEAD performs in-loop iterative data selection for instruction tuning, avoiding costly full-dataset inference by estimating sample utility via Instance-Level Dynamic Uncertainty (IDU): instantaneous loss, gradient-based loss-change approximation, and exponential smoothing. A two-stage coarse-to-fine pipeline (MAB cluster prioritization + IDU fine selection) yields ~6–11% avg gains using 2.5% of data and 5–10× faster training.
(summarized by gpt-5-mini on Mar 13 2026)
- Paper ID
- 14329
- Venue
- VLDB
- Year
- 2026
- Pagerank
- 4.1945683e-05
- Overall Rank
- 10,289 | 28.43%
- DOI
-
10.14778/3778092.3778103
Incoming Non-self Citations Over Time
No non-self incoming citations found for this paper in this database.
Incoming Citations (Sorted by Pagerank)
Showing 0 of 0 citing papers.
| Rank |
Citing Paper |
Year |
Venue |
Pagerank |
Outgoing Citations (Sorted by Pagerank)
Showing 10 of 10 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank |
Cited Paper |
Year |
Venue |
Pagerank |
| 254 |
Snorkel: Rapid Training Data Creation with Weak Supervision |
2018 |
VLDB |
0.00030540555 |
| 4,102 |
GoodCore: Data-effective and Data-efficient Machine Learning through Coreset Selection over Incomplete Data |
2023 |
SIGMOD |
6.4522929e-05 |
| 4,825 |
Synthesizing Natural Language to Visualization (NL2VIS) Benchmarks from NL2SQL Benchmarks |
2021 |
SIGMOD |
5.8946721e-05 |
| 5,381 |
Selective Data Acquisition in the Wild for Model Charging |
2022 |
VLDB |
5.5399508e-05 |
| 7,575 |
Human-in-the-loop Outlier Detection |
2020 |
SIGMOD |
4.7068909e-05 |
| 8,268 |
Learned Data-aware Image Representations of Line Charts for Similarity Search |
2023 |
SIGMOD |
4.5456668e-05 |
| 9,221 |
VisClean: Interactive Cleaning for Progressive Visualization |
2020 |
VLDB |
4.3699444e-05 |
| 9,479 |
Data Imputation with Limited Data Redundancy Using Data Lakes |
2025 |
VLDB |
4.3341665e-05 |
| 10,610 |
Weak-to-Strong Prompts with Lightweight-to-Powerful LLMs for High-Accuracy, Low-Cost, and Explainable Data Transformation |
2025 |
VLDB |
4.1945683e-05 |
| 10,837 |
Natural Language to SQL: State of the Art and Open Problems |
2025 |
VLDB |
4.1945683e-05 |
Semantically Similar Papers
| Overall Rank |
Paper |
Year |
Venue |
Pagerank |
| 10,658 |
LLMLog: Advanced Log Template Generation via LLM-driven Multi-Round Annotation |
2025 |
VLDB |
4.1945683e-05 |
| 9,235 |
ThriftLLM: On Cost-Effective Selection of Large Language Models for Classification Queries |
2025 |
VLDB |
4.3690661e-05 |
| 9,985 |
Making Prompts First-Class Citizens for Adaptive LLM Pipelines |
2026 |
CIDR |
4.1945683e-05 |
| 7,020 |
LLM for Data Management |
2024 |
VLDB |
4.8595728e-05 |
| 10,452 |
ScaleLLM: A Technique for Scalable LLM-augmented Data Systems |
2025 |
SIGMOD |
4.1945683e-05 |
| 10,064 |
Cut Costs, Not Accuracy: LLM-Powered Data Processing with Guarantees |
2026 |
SIGMOD |
4.1945683e-05 |
| 10,595 |
Optimized Batch Prompting for Cost-effective LLMs |
2025 |
VLDB |
4.1945683e-05 |
| 11,256 |
Self-Training for Label-Efficient Information Extraction from Semi-Structured Web-Pages |
2023 |
VLDB |
4.1945683e-05 |
| 10,316 |
LLM-AutoDP: Automatic Data Processing via LLM Agents for Model Fine-tuning |
2026 |
VLDB |
4.1945683e-05 |
| 10,239 |
BRIEF: Bi-level Coreset Selection for Efficient Instruction Tuning in LLMs |
2026 |
VLDB |
4.1945683e-05 |