Database Paper Browser

Back to papers

Can Foundation Models Wrangle Your Data?

Summary: Demonstrates that large foundation models, via prompting without task-specific fine-tuning, can generalize to five classical data cleaning and integration tasks and achieve state-of-the-art performance. Identifies limits on private/domain data and integration challenges for DM systems. (summarized by gpt-5-mini on Feb 09 2026)

Paper ID
13326
Venue
VLDB
Year
2023
Pagerank
0.00021169035
Overall Rank
517 | 96.41%
DOI
10.14778/3574245.3574258

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 50 of 55 citing papers.

Rank Citing Paper Year Venue Pagerank
1,116 Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes 2024 VLDB 0.00013890154
1,541 Symphony: Towards Natural Language Query Answering over Multi-modal Data Lakes 2023 CIDR 0.00011456579
2,587 Table-GPT: Table Fine-tuned GPT for Diverse Table Tasks 2024 SIGMOD 8.4924618e-05
3,015 Chorus: Foundation Models for Unified Data Discovery and Exploration 2024 VLDB 7.7092391e-05
3,114 GPTuner: A Manual-Reading Database Tuning System via GPT-Guided Bayesian Optimization 2024 VLDB 7.5451724e-05
3,335 DeepJoin: Joinable Table Discovery with Pre-trained Language Models 2023 VLDB 7.2065006e-05
3,508 spade: Synthesizing Data Quality Assertions for Large Language Model Pipelines 2024 VLDB 7.0271496e-05
3,840 Revisiting Prompt Engineering via Declarative Crowdsourcing 2024 CIDR 6.7106924e-05
3,876 The Design of an LLM-powered Unstructured Analytics System 2025 CIDR 6.6741456e-05
3,995 How Large Language Models Will Disrupt Data Management 2023 VLDB 6.5513237e-05
4,212 Unicorn: A Unified Multi-tasking Model for Supporting Matching Tasks in Data Integration 2023 SIGMOD 6.3555142e-05
4,535 Hybrid Querying Over Relational Databases and Large Language Models 2025 CIDR 6.1049669e-05
5,023 GenRewrite: Query Rewriting via Large Language Models 2026 SIGMOD 5.75363e-05
5,099 ArcheType: A Novel Framework for Open-Source Column Type Annotation using Large Language Models 2024 VLDB 5.6997784e-05
5,462 RetClean: Retrieval-Based Data Cleaning Using LLMs and Data Lakes 2024 VLDB 5.494769e-05
5,509 Can Large Language Models Predict Data Correlations from Column Names? 2023 VLDB 5.4703368e-05
5,928 SchemaPile: A Large Collection of Relational Database Schemas 2024 SIGMOD 5.2685946e-05
6,077 The Fast and the Private: Task-based Dataset Search 2024 CIDR 5.2229324e-05
6,092 Observatory: Characterizing Embeddings of Relational Tables 2024 VLDB 5.2138566e-05
6,553 How do Categorical Duplicates Affect ML? A New Benchmark and Empirical Analyses 2024 VLDB 5.0157344e-05
7,026 Mind the Data Gap: Bridging LLMs to Enterprise Data Integration 2025 CIDR 4.8570811e-05
7,048 Magneto: Combining Small and Large Language Models for Schema Matching 2025 VLDB 4.8520651e-05
7,152 Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity 2024 VLDB 4.8154191e-05
8,052 Generating Succinct Descriptions of Database Schemata for Cost-Efficient Prompting of Large Language Models 2024 VLDB 4.5953106e-05
8,204 ELEET: Efficient Learned Query Execution over Text and Tables 2024 VLDB 4.5594273e-05
8,207 SQLStorm: Taking Database Benchmarking into the LLM Era 2025 VLDB 4.5583637e-05
8,208 SMARTFEAT: Efficient Feature Construction through Feature-Level Foundation Model Interactions 2024 CIDR 4.5581306e-05
8,257 Automating and Optimizing Data-Centric What-If Analyses on Native Machine Learning Pipelines 2023 SIGMOD 4.5487511e-05
8,683 FormaT5: Abstention and Examples for Conditional Table Formatting with Natural Language 2024 VLDB 4.4686885e-05
8,736 Unveiling Challenges for LLMs in Enterprise Data Engineering 2026 VLDB 4.456315e-05
8,847 Towards Foundation Database Models 2025 CIDR 4.4371897e-05
9,348 GIDCL: A Graph-Enhanced Interpretable Data Cleaning Framework with Large Language Models 2024 SIGMOD 4.3526427e-05
9,389 DataVinci: Learning Syntactic and Semantic String Repairs 2025 SIGMOD 4.3441378e-05
9,476 Adda: Towards Efficient in-Database Feature Generation via LLM-based Agents 2025 SIGMOD 4.3341665e-05
9,479 Data Imputation with Limited Data Redundancy Using Data Lakes 2025 VLDB 4.3341665e-05
9,492 Lingua Manga : A Generic Large Language Model Centric System for Data Curation 2023 VLDB 4.3341665e-05
9,515 Automating the Enterprise with Foundation Models 2024 VLDB 4.3335877e-05
10,022 In-context Clustering-based Entity Resolution with Large Language Models: A Design Space Exploration 2026 SIGMOD 4.1945683e-05
10,064 Cut Costs, Not Accuracy: LLM-Powered Data Processing with Guarantees 2026 SIGMOD 4.1945683e-05
10,443 LLM-Matcher: A Name-Based Schema Matching Tool using Large Language Models 2025 SIGMOD 4.1945683e-05
10,465 A Cost-Effective LLM-based Approach to Identify Wildlife Trafficking in Online Marketplaces 2025 SIGMOD 4.1945683e-05
10,512 Auto-Test: Learning Semantic-Domain Constraints for Unsupervised Error Detection in Tables 2025 SIGMOD 4.1945683e-05
10,595 Optimized Batch Prompting for Cost-effective LLMs 2025 VLDB 4.1945683e-05
10,598 Auto-Prep: Holistic Prediction of Data Preparation Steps for Self-Service Business Intelligence 2025 VLDB 4.1945683e-05
10,610 Weak-to-Strong Prompts with Lightweight-to-Powerful LLMs for High-Accuracy, Low-Cost, and Explainable Data Transformation 2025 VLDB 4.1945683e-05
10,617 Deduplicated Sampling On-Demand 2025 VLDB 4.1945683e-05
10,628 CatDB: Data-catalog-guided, LLM-based Generation of Data-centric ML Pipelines 2025 VLDB 4.1945683e-05
10,675 On LLM-Enhanced Mixed-Type Data Imputation with High-Order Message Passing 2025 VLDB 4.1945683e-05
10,753 Cents: A Flexible and Cost-Effective Framework for LLM-Based Table Understanding 2025 VLDB 4.1945683e-05
10,835 Large Language Models for Spatial Analysis Queries 2025 VLDB 4.1945683e-05
Previous Page 1 / 2 Next

Outgoing Citations (Sorted by Pagerank)

Showing 15 of 15 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Previous Page 1 / 1 Next

Semantically Similar Papers