Chorus: Foundation Models for Unified Data Discovery and Exploration

Summary: Apply foundation models to unify data discovery/exploration, outperforming task-specific models and often human experts on table-class, column-type, and join-column tasks. Evaluate cross-model generalizability and nondeterminism, arguing for a foundation-model unification of data-management tasks. (summarized by gpt-5-mini on Feb 09 2026)

Paper ID: 13443
Venue: VLDB
Year: 2024
Pagerank: 7.7358219e-05
Overall Rank: 3,003 | 79.14%
DOI: 10.14778/3659437.3659461

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 19 of 19 citing papers.

Rank	Citing Paper	Year	Venue	Pagerank
1,839	DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing	2025	VLDB	0.00010351287
2,585	Table-GPT: Table Fine-tuned GPT for Diverse Table Tasks	2024	SIGMOD	8.4909917e-05
3,639	The Design of an LLM-powered Unstructured Analytics System	2025	CIDR	6.8886648e-05
5,098	ArcheType: A Novel Framework for Open-Source Column Type Annotation using Large Language Models	2024	VLDB	5.6943033e-05
5,429	Logical and Physical Optimizations for SQL Query Execution over Large Language Models	2025	SIGMOD	5.511638e-05
5,506	Can Large Language Models Predict Data Correlations from Column Names?	2023	VLDB	5.4711611e-05
5,929	SchemaPile: A Large Collection of Relational Database Schemas	2024	SIGMOD	5.2635297e-05
6,071	Observatory: Characterizing Embeddings of Relational Tables	2024	VLDB	5.2231739e-05
7,027	Mind the Data Gap: Bridging LLMs to Enterprise Data Integration	2025	CIDR	4.8524216e-05
7,045	Magneto: Combining Small and Large Language Models for Schema Matching	2025	VLDB	4.8474104e-05
8,732	Unveiling Challenges for LLMs in Enterprise Data Engineering	2026	VLDB	4.4520434e-05
9,516	Automating the Enterprise with Foundation Models	2024	VLDB	4.3294347e-05
10,142	AutoDDG: Automated Dataset Description Generation using Large Language Models	2026	SIGMOD	4.1905499e-05
10,475	A Cost-Effective LLM-based Approach to Identify Wildlife Trafficking in Online Marketplaces	2025	SIGMOD	4.1905499e-05
10,512	Self-Enhancing Video Data Management System for Compositional Events with Large Language Models	2025	SIGMOD	4.1905499e-05
10,603	Optimized Batch Prompting for Cost-effective LLMs	2025	VLDB	4.1905499e-05
10,618	Weak-to-Strong Prompts with Lightweight-to-Powerful LLMs for High-Accuracy, Low-Cost, and Explainable Data Transformation	2025	VLDB	4.1905499e-05
10,759	Cents: A Flexible and Cost-Effective Framework for LLM-Based Table Understanding	2025	VLDB	4.1905499e-05
10,864	Exploring Exploratory Querying	2025	VLDB	4.1905499e-05

Outgoing Citations (Sorted by Pagerank)

Showing 18 of 18 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Rank	Cited Paper	Year	Venue	Pagerank
31	Provenance Semirings	2007	PODS	0.00078516827
108	WebTables: Exploring the Power of Tables on the Web	2008	VLDB	0.00048345996
481	Mining Database Structure; Or, How to Build a Data Quality Browser	2002	SIGMOD	0.000221538
514	TURL: Table Understanding through Representation Learning	2021	VLDB	0.00021280726
516	Can Foundation Models Wrangle Your Data?	2023	VLDB	0.00021194444
936	Data Lake Management: Challenges and Opportunities	2019	VLDB	0.00015197838
1,088	Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes	2024	VLDB	0.00014158762
1,185	JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes	2019	SIGMOD	0.00013432692
1,742	Auctus: A Dataset Search Engine for Data Discovery and Augmentation	2021	VLDB	0.00010695388
2,513	Annotating Columns with Pre-trained Language Models	2022	SIGMOD	8.6155767e-05
2,895	Sato: Contextual Semantic Type Detection in Tables	2020	VLDB	7.9539265e-05
3,001	SANTOS: Relationship-based Semantic Table Union Search	2023	SIGMOD	7.739698e-05
3,254	Auto-Suggest: Learning-to-Recommend Data Preparation Steps Using Data Science Notebooks	2020	SIGMOD	7.3172324e-05
3,326	DeepJoin: Joinable Table Discovery with Pre-trained Language Models	2023	VLDB	7.2148323e-05
3,520	GitTables: A Large-Scale Corpus of Relational Tables	2023	SIGMOD	7.0136102e-05
4,105	Extracting Databases from Dark Data with DeepDive	2016	SIGMOD	6.4409563e-05
5,495	Fast Foreign-Key Detection in Microsoft SQL Server PowerPivot for Excel	2014	VLDB	5.4770205e-05
6,891	Towards NLP-Enhanced Data Profiling Tools	2022	CIDR	4.8891398e-05

Semantically Similar Papers

Overall Rank	Paper	Year	Venue	Pagerank
10,022	In-context Clustering-based Entity Resolution with Large Language Models: A Design Space Exploration	2026	SIGMOD	4.1905499e-05
1,505	Symphony: Towards Natural Language Query Answering over Multi-modal Data Lakes	2023	CIDR	0.00011601232
10,828	TableCopilot: A Table Assistant Empowered by Natural Language Conditional Table Discovery	2025	VLDB	4.1905499e-05
3,982	How Large Language Models Will Disrupt Data Management	2023	VLDB	6.5595332e-05
6,798	DTT: An Example-Driven Tabular Transformer for Joinability by Leveraging Large Language Models	2024	SIGMOD	4.9186164e-05
11,319	Data Management Opportunities for Foundation Models	2022	CIDR	4.1905499e-05
10,976	Unstructured Data Fusion for Schema and Data Extraction	2024	SIGMOD	4.1905499e-05
5,506	Can Large Language Models Predict Data Correlations from Column Names?	2023	VLDB	5.4711611e-05
8,847	Towards Foundation Database Models	2025	CIDR	4.4329366e-05
516	Can Foundation Models Wrangle Your Data?	2023	VLDB	0.00021194444