Can Foundation Models Wrangle Your Data?

Summary: Demonstrates that large foundation models, via prompting without task-specific fine-tuning, can generalize to five classical data cleaning and integration tasks and achieve state-of-the-art performance. Identifies limits on private/domain data and integration challenges for DM systems. (summarized by gpt-5-mini on Feb 09 2026)

Paper ID: 13326
Venue: VLDB
Year: 2023
Pagerank: 0.00021169035
Overall Rank: 517 | 96.41%
DOI: 10.14778/3574245.3574258

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 5 of 55 citing papers.

Rank	Citing Paper	Year	Venue	Pagerank
10,973	Unstructured Data Fusion for Schema and Data Extraction	2024	SIGMOD	4.1945683e-05
11,047	Blocker and Matcher Can Mutually Benefit: A Co-Learning Framework for Low-Resource Entity Resolution	2024	VLDB	4.1945683e-05
11,054	Enriching Relations with Additional Attributes for ER	2024	VLDB	4.1945683e-05
11,137	Generalizable Data Cleaning of Tabular Data in Latent Space	2024	VLDB	4.1945683e-05
11,297	DataRinse: Semantic Transforms for Data preparation based on Code Mining	2023	VLDB	4.1945683e-05

Outgoing Citations (Sorted by Pagerank)

Showing 15 of 15 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Rank	Cited Paper	Year	Venue	Pagerank
192	HoloClean: Holistic Data Repairs with Probabilistic Inference	2017	VLDB	0.00035728858
221	Deep Entity Matching with Pre-Trained Language Models	2021	VLDB	0.00033121824
263	CrowdER: Crowdsourcing Entity Resolution	2012	VLDB	0.00029862413
300	Deep Learning for Entity Matching: A Design Space Exploration	2018	SIGMOD	0.00028441466
489	Data Curation at Scale: The Data Tamer System	2013	CIDR	0.00022030728
513	TURL: Table Understanding through Representation Learning	2021	VLDB	0.00021288342
643	Corleone: Hands-Off Crowdsourcing for Entity Matching	2014	SIGMOD	0.00018754451
656	ERACER: A Database Approach for Statistical Inference and Data Cleaning	2010	SIGMOD	0.00018588729
1,012	NADEEF: A Commodity Data Cleaning System	2013	SIGMOD	0.0001464733
1,337	HoloDetect: Few-Shot Learning for Error Detection	2019	SIGMOD	0.00012497164
1,546	KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing	2015	SIGMOD	0.00011446851
1,612	Detecting Data Errors: Where are we and what needs to be done?	2016	VLDB	0.00011142794
2,018	Statistical Distortion: Consequences of Data Cleaning	2012	VLDB	9.7764643e-05
3,478	Transform-Data-by-Example (TDE): An Extensible Search Engine for Data Transformations	2018	VLDB	7.054159e-05
4,464	Magellan: Toward Building Entity Matching Management Systems over Data Science Stacks	2016	VLDB	6.1606042e-05

Semantically Similar Papers

Overall Rank	Paper	Year	Venue	Pagerank
10,595	Optimized Batch Prompting for Cost-effective LLMs	2025	VLDB	4.1945683e-05
1,116	Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes	2024	VLDB	0.00013890154
10,973	Unstructured Data Fusion for Schema and Data Extraction	2024	SIGMOD	4.1945683e-05
7,026	Mind the Data Gap: Bridging LLMs to Enterprise Data Integration	2025	CIDR	4.8570811e-05
7,020	LLM for Data Management	2024	VLDB	4.8595728e-05
3,840	Revisiting Prompt Engineering via Declarative Crowdsourcing	2024	CIDR	6.7106924e-05
9,515	Automating the Enterprise with Foundation Models	2024	VLDB	4.3335877e-05
3,015	Chorus: Foundation Models for Unified Data Discovery and Exploration	2024	VLDB	7.7092391e-05
11,317	Data Management Opportunities for Foundation Models	2022	CIDR	4.1945683e-05
8,847	Towards Foundation Database Models	2025	CIDR	4.4371897e-05