RetClean: Retrieval-Based Data Cleaning Using LLMs and Data Lakes
Summary: RetClean uses retrieval-augmented LLMs over indexed data lakes to produce provenance-aware imputations by retrieving top-k relevant tuples for each dirty record. Supports three modes—cloud LLMs for world knowledge, RAG with data-lake evidence for enterprise tables, and fine-tuned local LLMs for privacy—exposed via a GUI to explore model/trade-off choices. (summarized by gpt-5-mini on Feb 09 2026)
Incoming Non-self Citations Over Time
Authors
- 1. Zan Ahmad Naeem
- 2. Mohammad Shahmeer Ahmad
- 3. Mohamed Eltabakh
- 4. Mourad Ouzzani
- 5. Nan Tang
Incoming Citations (Sorted by Pagerank)
Showing 3 of 3 citing papers.
| Rank | Citing Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 7,931 | In-depth Analysis of Graph-based RAG in a Unified Framework | 2025 | VLDB | 4.613363e-05 |
| 10,064 | Cut Costs, Not Accuracy: LLM-Powered Data Processing with Guarantees | 2026 | SIGMOD | 4.1945683e-05 |
| 10,828 | Buckaroo: A Direct Manipulation Visual Data Wrangler | 2025 | VLDB | 4.1945683e-05 |
Previous
Page 1 / 1
Next
Outgoing Citations (Sorted by Pagerank)
Showing 2 of 2 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank | Cited Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 517 | Can Foundation Models Wrangle Your Data? | 2023 | VLDB | 0.00021169035 |
| 1,541 | Symphony: Towards Natural Language Query Answering over Multi-modal Data Lakes | 2023 | CIDR | 0.00011456579 |
Previous
Page 1 / 1
Next