Back to papers
Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning
Summary: Starmie: unsupervised contrastive multi-column pretraining of PLM-based column encoders and cosine-based unionability with a filter-and-verify pipeline, raising table-union search quality by 6.8 MAP/recall. First to use HNSW for table-union (3,000× vs linear, 400× vs LSH).
(summarized by gpt-5-mini on Feb 09 2026)
- Paper ID
- 13031
- Venue
- VLDB
- Year
- 2023
- Pagerank
- 8.0443826e-05
- Overall Rank
- 2,836 | 80.28%
- DOI
-
10.14778/3587136.3587146
Incoming Non-self Citations Over Time
Incoming Citations (Sorted by Pagerank)
Showing 24 of 24 citing papers.
| Rank |
Citing Paper |
Year |
Venue |
Pagerank |
| 4,859 |
Integrating Data Lake Tables |
2023 |
VLDB |
5.8732433e-05 |
| 5,658 |
Databases Unbound: Querying All of the World's Bytes with AI |
2024 |
VLDB |
5.385675e-05 |
| 6,894 |
TableDC: Deep Clustering for Tabular Data |
2025 |
SIGMOD |
4.8925595e-05 |
| 7,048 |
Magneto: Combining Small and Large Language Models for Schema Matching |
2025 |
VLDB |
4.8520651e-05 |
| 7,582 |
LakeCompass: An End-to-End System for Data Maintenance, Search and Analysis in Data Lakes |
2024 |
VLDB |
4.7046388e-05 |
| 8,116 |
LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data Lakes |
2024 |
VLDB |
4.581507e-05 |
| 8,852 |
Watchog: A Light-weight Contrastive Learning based Framework for Column Annotation |
2023 |
SIGMOD |
4.4356508e-05 |
| 8,910 |
R2D2: Reducing Redundancy and Duplication in Data Lakes |
2023 |
SIGMOD |
4.427232e-05 |
| 9,961 |
QueryArtisan: Generating Data Manipulation Codes for Ad-hoc Analysis in Data Lakes |
2025 |
VLDB |
4.2294678e-05 |
| 10,109 |
Retrieve-and-Verify: A Table Context Selection Framework for Accurate Column Annotations |
2026 |
SIGMOD |
4.1945683e-05 |
| 10,168 |
FlowPilot: A Suggestion System for Designing Scientific Workflows |
2026 |
SIGMOD |
4.1945683e-05 |
| 10,197 |
Qualitative Join Discovery in Data Lakes using Examples |
2026 |
SIGMOD |
4.1945683e-05 |
| 10,510 |
Table Overlap Estimation through Graph Embeddings |
2025 |
SIGMOD |
4.1945683e-05 |
| 10,589 |
Birdie: Natural Language-Driven Table Discovery Using Differentiable Search Index |
2025 |
VLDB |
4.1945683e-05 |
| 10,685 |
LakeVisage: Towards Scalable, Flexible and Interactive Visualization Recommendation for Data Discovery over Data Lakes |
2025 |
VLDB |
4.1945683e-05 |
| 10,753 |
Cents: A Flexible and Cost-Effective Framework for LLM-Based Table Understanding |
2025 |
VLDB |
4.1945683e-05 |
| 10,754 |
OmniMatch: Joinability Discovery in Data Products |
2025 |
VLDB |
4.1945683e-05 |
| 10,823 |
TableCopilot: A Table Assistant Empowered by Natural Language Conditional Table Discovery |
2025 |
VLDB |
4.1945683e-05 |
| 10,836 |
Data Discovery in Data Lakes: Operations, Indexes, Systems |
2025 |
VLDB |
4.1945683e-05 |
| 10,842 |
ML-Asset Management: Curation, Discovery, and Utilization |
2025 |
VLDB |
4.1945683e-05 |
| 10,844 |
Panel on Neural Relational Data: Tabular Foundation Models, LLMs... or both? |
2025 |
VLDB |
4.1945683e-05 |
| 11,054 |
Enriching Relations with Additional Attributes for ER |
2024 |
VLDB |
4.1945683e-05 |
| 11,063 |
Searching Data Lakes for Nested and Joined Data |
2024 |
VLDB |
4.1945683e-05 |
| 11,097 |
Navigating Data Repositories: Utilizing Line Charts to Discover Relevant Datasets |
2024 |
VLDB |
4.1945683e-05 |
Outgoing Citations (Sorted by Pagerank)
Showing 27 of 27 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank |
Cited Paper |
Year |
Venue |
Pagerank |
| 34 |
Similarity Search in High Dimensions via Hashing |
1999 |
VLDB |
0.00076637636 |
| 107 |
WebTables: Exploring the Power of Tables on the Web |
2008 |
VLDB |
0.00048377684 |
| 221 |
Deep Entity Matching with Pre-Trained Language Models |
2021 |
VLDB |
0.00033121824 |
| 364 |
Annotating and Searching Web Tables Using Entities, Types and Relationships |
2010 |
VLDB |
0.00025637562 |
| 420 |
InfoGather: Entity Augmentation and Attribute Discovery By Holistic Matching with Web Tables |
2012 |
SIGMOD |
0.00023719065 |
| 513 |
TURL: Table Understanding through Representation Learning |
2021 |
VLDB |
0.00021288342 |
| 518 |
Data Integration for the Relational Web |
2009 |
VLDB |
0.00021158934 |
| 818 |
Finding Related Tables |
2012 |
SIGMOD |
0.00016311524 |
| 939 |
Data Lake Management: Challenges and Opportunities |
2019 |
VLDB |
0.00015187344 |
| 1,001 |
Recovering Semantics of Tables on the Web |
2011 |
VLDB |
0.00014706505 |
| 1,178 |
Table Union Search on Open Data |
2018 |
VLDB |
0.00013468118 |
| 1,187 |
JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes |
2019 |
SIGMOD |
0.00013443639 |
| 1,644 |
Finding Related Tables in Data Lakes for Interactive Data Science |
2020 |
SIGMOD |
0.00011041787 |
| 1,751 |
Auctus: A Dataset Search Engine for Data Discovery and Augmentation |
2021 |
VLDB |
0.00010683295 |
| 1,914 |
Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks |
2020 |
SIGMOD |
0.00010109102 |
| 2,141 |
LSH Ensemble: Internet-Scale Domain Search |
2016 |
VLDB |
9.4542625e-05 |
| 2,517 |
Annotating Columns with Pre-trained Language Models |
2022 |
SIGMOD |
8.6092139e-05 |
| 2,633 |
Schema Extraction for Tabular Data on the Web |
2013 |
VLDB |
8.4063569e-05 |
| 2,730 |
Open Data Integration |
2018 |
VLDB |
8.2126735e-05 |
| 2,888 |
Sato: Contextual Semantic Type Detection in Tables |
2020 |
VLDB |
7.9594996e-05 |
| 3,000 |
SANTOS: Relationship-based Semantic Table Union Search |
2023 |
SIGMOD |
7.7462128e-05 |
| 3,797 |
Stitching Web Tables for Improving Matching Quality |
2017 |
VLDB |
6.7597149e-05 |
| 4,801 |
CLAMS: Bringing Quality to Data Lakes |
2016 |
SIGMOD |
5.9115269e-05 |
| 4,859 |
Integrating Data Lake Tables |
2023 |
VLDB |
5.8732433e-05 |
| 4,967 |
Leva: Boosting Machine Learning Performance with Relational Embedding Data Augmentation |
2022 |
SIGMOD |
5.7956612e-05 |
| 5,529 |
Data-Driven Domain Discovery for Structured Datasets |
2020 |
VLDB |
5.4566641e-05 |
| 8,193 |
WarpGate: A Semantic Join Discovery System for Cloud Data Warehouses |
2023 |
CIDR |
4.5618596e-05 |
Semantically Similar Papers
| Overall Rank |
Paper |
Year |
Venue |
Pagerank |
| 1,187 |
JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes |
2019 |
SIGMOD |
0.00013443639 |
| 11,063 |
Searching Data Lakes for Nested and Joined Data |
2024 |
VLDB |
4.1945683e-05 |
| 8,116 |
LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data Lakes |
2024 |
VLDB |
4.581507e-05 |
| 1,178 |
Table Union Search on Open Data |
2018 |
VLDB |
0.00013468118 |
| 10,589 |
Birdie: Natural Language-Driven Table Discovery Using Differentiable Search Index |
2025 |
VLDB |
4.1945683e-05 |
| 7,643 |
Cross Modal Data Discovery over Structured and Unstructured Data Lakes |
2023 |
VLDB |
4.6901105e-05 |
| 7,582 |
LakeCompass: An End-to-End System for Data Maintenance, Search and Analysis in Data Lakes |
2024 |
VLDB |
4.7046388e-05 |
| 3,335 |
DeepJoin: Joinable Table Discovery with Pre-trained Language Models |
2023 |
VLDB |
7.2065006e-05 |
| 10,197 |
Qualitative Join Discovery in Data Lakes using Examples |
2026 |
SIGMOD |
4.1945683e-05 |
| 3,000 |
SANTOS: Relationship-based Semantic Table Union Search |
2023 |
SIGMOD |
7.7462128e-05 |