Data-Driven Domain Discovery for Structured Datasets
Summary: Data-driven domain discovery across heterogeneous tables uses cross-column value co-occurrence to derive context signatures and infer attribute domains. Robust to incomplete or noisy data, scales to millions of terms, and outperforms state-of-the-art on real urban datasets, enabling richer queries and integration. (summarized by gpt-5-nano on Feb 09 2026)
Incoming Non-self Citations Over Time
Authors
- 1. Masayo Ota
- 2. Heiko Müller
- 3. Juliana Freire
- 4. Divesh Srivastava
Incoming Citations (Sorted by Pagerank)
Showing 7 of 7 citing papers.
| Rank | Citing Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 2,517 | Annotating Columns with Pre-trained Language Models | 2022 | SIGMOD | 8.6092139e-05 |
| 2,836 | Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning | 2023 | VLDB | 8.0443826e-05 |
| 3,000 | SANTOS: Relationship-based Semantic Table Union Search | 2023 | SIGMOD | 7.7462128e-05 |
| 5,099 | ArcheType: A Novel Framework for Open-Source Column Type Annotation using Large Language Models | 2024 | VLDB | 5.6997784e-05 |
| 6,894 | TableDC: Deep Clustering for Tabular Data | 2025 | SIGMOD | 4.8925595e-05 |
| 7,643 | Cross Modal Data Discovery over Structured and Unstructured Data Lakes | 2023 | VLDB | 4.6901105e-05 |
| 10,685 | LakeVisage: Towards Scalable, Flexible and Interactive Visualization Recommendation for Data Discovery over Data Lakes | 2025 | VLDB | 4.1945683e-05 |
Previous
Page 1 / 1
Next
Outgoing Citations (Sorted by Pagerank)
Showing 6 of 6 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank | Cited Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 420 | InfoGather: Entity Augmentation and Attribute Discovery By Holistic Matching with Web Tables | 2012 | SIGMOD | 0.00023719065 |
| 610 | Goods: Organizing Google's Datasets | 2016 | SIGMOD | 0.00019232674 |
| 939 | Data Lake Management: Challenges and Opportunities | 2019 | VLDB | 0.00015187344 |
| 1,178 | Table Union Search on Open Data | 2018 | VLDB | 0.00013468118 |
| 2,209 | Data Integration: After the Teenage Years | 2017 | PODS | 9.2868035e-05 |
| 3,281 | Constance: An Intelligent Data Lake System | 2016 | SIGMOD | 7.2823287e-05 |
Previous
Page 1 / 1
Next
Semantically Similar Papers
| Overall Rank | Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 902 | Statistical Schema Matching across Web Query Interfaces | 2003 | SIGMOD | 0.00015486247 |
| 6,780 | Domain-Aware Multi-Truth Discovery from Conflicting Sources | 2018 | VLDB | 4.9277708e-05 |
| 1,001 | Recovering Semantics of Tables on the Web | 2011 | VLDB | 0.00014706505 |
| 3,823 | Automatic Discovery of Attributes in Relational Databases | 2011 | SIGMOD | 6.7261168e-05 |
| 1,178 | Table Union Search on Open Data | 2018 | VLDB | 0.00013468118 |
| 3,426 | Discovering Topical Structures of Databases | 2008 | SIGMOD | 7.1063105e-05 |
| 7,643 | Cross Modal Data Discovery over Structured and Unstructured Data Lakes | 2023 | VLDB | 4.6901105e-05 |
| 11,775 | Building Structured Databases of Factual Knowledge from Massive Text Corpora | 2017 | SIGMOD | 4.1945683e-05 |
| 9,549 | Attribute Domain Discovery for Hidden Web Databases | 2011 | SIGMOD | 4.3258142e-05 |
| 12,223 | Schema Clustering and Retrieval for Multi-domain Pay-As-You-Go Data Integration Systems | 2010 | SIGMOD | 4.1945683e-05 |