AutoDDG: Automated Dataset Description Generation using Large Language Models

Summary: AutoDDG generates dataset descriptions for tabular data by combining data-driven content summarization with LLM-based semantic enrichment, targeting missing/inaccurate metadata in data lakes/open portals. Proposes a multi-faceted evaluation (retrieval, reference-based, reference-free, human) and shows improved dataset search/retrieval at scale. (summarized by gpt-5-mini on Apr 11 2026)

Paper ID: 7453
Venue: SIGMOD
Year: 2026
Pagerank: 4.1905499e-05
Overall Rank: 10,142 | 29.52%
DOI: 10.1145/3786626

Incoming Non-self Citations Over Time

No non-self incoming citations found for this paper in this database.

Authors

Incoming Citations (Sorted by Pagerank)

Showing 0 of 0 citing papers.

Rank	Citing Paper	Year	Venue	Pagerank

Outgoing Citations (Sorted by Pagerank)

Showing 12 of 12 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Rank	Cited Paper	Year	Venue	Pagerank
514	TURL: Table Understanding through Representation Learning	2021	VLDB	0.00021280726
936	Data Lake Management: Challenges and Opportunities	2019	VLDB	0.00015197838
1,185	JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes	2019	SIGMOD	0.00013432692
1,742	Auctus: A Dataset Search Engine for Data Discovery and Augmentation	2021	VLDB	0.00010695388
2,513	Annotating Columns with Pre-trained Language Models	2022	SIGMOD	8.6155767e-05
2,895	Sato: Contextual Semantic Type Detection in Tables	2020	VLDB	7.9539265e-05
3,001	SANTOS: Relationship-based Semantic Table Union Search	2023	SIGMOD	7.739698e-05
3,003	Chorus: Foundation Models for Unified Data Discovery and Exploration	2024	VLDB	7.7358219e-05
3,520	GitTables: A Large-Scale Corpus of Relational Tables	2023	SIGMOD	7.0136102e-05
3,827	Correlation Sketches for Approximate Join-Correlation Queries	2021	SIGMOD	6.7195959e-05
5,098	ArcheType: A Novel Framework for Open-Source Column Type Annotation using Large Language Models	2024	VLDB	5.6943033e-05
5,756	Pneuma: Leveraging LLMs for Tabular Data Representation and Retrieval in an End-to-End System	2025	SIGMOD	5.3387063e-05

Semantically Similar Papers

Overall Rank	Paper	Year	Venue	Pagerank
10,285	Relational Deep Dive: Error-Aware Queries Over Unstructured Data	2026	VLDB	4.1905499e-05
5,538	Data-Driven Domain Discovery for Structured Datasets	2020	VLDB	5.4520759e-05
11,394	Automated Relational Data Explanation using External Semantic Knowledge	2022	VLDB	4.1905499e-05
13,112	Demonstrating CatDB: LLM-based Generation of Data-centric ML Pipelines	2025	SIGMOD	-
10,636	CatDB: Data-catalog-guided, LLM-based Generation of Data-centric ML Pipelines	2025	VLDB	4.1905499e-05
9,478	Adda: Towards Efficient in-Database Feature Generation via LLM-based Agents	2025	SIGMOD	4.3300131e-05
10,328	LLM-AutoDP: Automatic Data Processing via LLM Agents for Model Fine-tuning	2026	VLDB	4.1905499e-05
8,157	Automated Data Visualization from Natural Language via Large Language Models: An Exploratory Study	2024	SIGMOD	4.5701714e-05
10,341	Revisiting Task-Oriented Dataset Search in the Era of Large Language Models: Challenges, Benchmark, and Solution	2026	VLDB	4.1905499e-05
10,976	Unstructured Data Fusion for Schema and Data Extraction	2024	SIGMOD	4.1905499e-05