Back to papers
Data Imputation with Limited Data Redundancy Using Data Lakes
Summary: LakeFill leverages LLMs and data lakes for tuple-level retrieval and encoding of incomplete tuples to find cross-table candidates when intra-table redundancy is low. It uses checklist-based reranking and a two-stage confidence-aware reasoner, beating prior methods.
(summarized by gpt-5-mini on Feb 09 2026)
- Paper ID
- 13966
- Venue
- VLDB
- Year
- 2025
- Pagerank
- 4.3341665e-05
- Overall Rank
- 9,479 | 34.06%
- DOI
-
10.14778/3748191.3748200
Incoming Non-self Citations Over Time
Incoming Citations (Sorted by Pagerank)
Showing 2 of 2 citing papers.
Outgoing Citations (Sorted by Pagerank)
Showing 14 of 14 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank |
Cited Paper |
Year |
Venue |
Pagerank |
| 192 |
HoloClean: Holistic Data Repairs with Probabilistic Inference |
2017 |
VLDB |
0.00035728858 |
| 221 |
Deep Entity Matching with Pre-Trained Language Models |
2021 |
VLDB |
0.00033121824 |
| 513 |
TURL: Table Understanding through Representation Learning |
2021 |
VLDB |
0.00021288342 |
| 517 |
Can Foundation Models Wrangle Your Data? |
2023 |
VLDB |
0.00021169035 |
| 1,159 |
Towards Certain Fixes with Editing Rules and Master Data |
2010 |
VLDB |
0.00013592813 |
| 1,546 |
KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing |
2015 |
SIGMOD |
0.00011446851 |
| 1,612 |
Detecting Data Errors: Where are we and what needs to be done? |
2016 |
VLDB |
0.00011142794 |
| 1,894 |
Baran: Effective Error Correction via a Unified Context Representation and Transfer Learning |
2020 |
VLDB |
0.0001018378 |
| 2,349 |
RPT: Relational Pre-trained Transformer Is Almost All You Need towards Democratizing Data Preparation |
2021 |
VLDB |
8.9876423e-05 |
| 2,587 |
Table-GPT: Table Fine-tuned GPT for Diverse Table Tasks |
2024 |
SIGMOD |
8.4924618e-05 |
| 3,662 |
The Dawn of Natural Language to SQL: Are We Fully Ready? |
2024 |
VLDB |
6.8672143e-05 |
| 3,970 |
HAIChart: Human and AI Paired Visualization System |
2024 |
VLDB |
6.5784767e-05 |
| 8,268 |
Learned Data-aware Image Representations of Line Charts for Similarity Search |
2023 |
SIGMOD |
4.5456668e-05 |
| 9,077 |
VerifAI: Verified Generative AI |
2024 |
CIDR |
4.4010762e-05 |
Semantically Similar Papers
| Overall Rank |
Paper |
Year |
Venue |
Pagerank |
| 6,800 |
DTT: An Example-Driven Tabular Transformer for Joinability by Leveraging Large Language Models |
2024 |
SIGMOD |
4.9231471e-05 |
| 1,116 |
Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes |
2024 |
VLDB |
0.00013890154 |
| 5,462 |
RetClean: Retrieval-Based Data Cleaning Using LLMs and Data Lakes |
2024 |
VLDB |
5.494769e-05 |
| 10,812 |
TARImpute: Task-Aware auto-Recommender System for Missing Value Imputation Algorithms with Clustering Case Studies |
2025 |
VLDB |
4.1945683e-05 |
| 10,953 |
Certain and Approximately Certain Models for Statistical Learning |
2024 |
SIGMOD |
4.1945683e-05 |
| 3,311 |
Efficient and Effective Data Imputation with Influence Functions |
2022 |
VLDB |
7.2406486e-05 |
| 2,573 |
Query Optimization for Dynamic Imputation |
2017 |
VLDB |
8.518235e-05 |
| 5,253 |
Enriching Data Imputation with Extensive Similarity Neighbors |
2015 |
VLDB |
5.6014916e-05 |
| 9,856 |
In-Database Data Imputation |
2024 |
SIGMOD |
4.269353e-05 |
| 10,675 |
On LLM-Enhanced Mixed-Type Data Imputation with High-Order Message Passing |
2025 |
VLDB |
4.1945683e-05 |