Database Paper Browser

Back to papers

Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning

Summary: Starmie: unsupervised contrastive multi-column pretraining of PLM-based column encoders and cosine-based unionability with a filter-and-verify pipeline, raising table-union search quality by 6.8 MAP/recall. First to use HNSW for table-union (3,000× vs linear, 400× vs LSH). (summarized by gpt-5-mini on Feb 09 2026)

Paper ID
13031
Venue
VLDB
Year
2023
Pagerank
8.0443826e-05
Overall Rank
2,836 | 80.28%
DOI
10.14778/3587136.3587146

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 24 of 24 citing papers.

Rank Citing Paper Year Venue Pagerank
4,859 Integrating Data Lake Tables 2023 VLDB 5.8732433e-05
5,658 Databases Unbound: Querying All of the World's Bytes with AI 2024 VLDB 5.385675e-05
6,894 TableDC: Deep Clustering for Tabular Data 2025 SIGMOD 4.8925595e-05
7,048 Magneto: Combining Small and Large Language Models for Schema Matching 2025 VLDB 4.8520651e-05
7,582 LakeCompass: An End-to-End System for Data Maintenance, Search and Analysis in Data Lakes 2024 VLDB 4.7046388e-05
8,116 LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data Lakes 2024 VLDB 4.581507e-05
8,852 Watchog: A Light-weight Contrastive Learning based Framework for Column Annotation 2023 SIGMOD 4.4356508e-05
8,910 R2D2: Reducing Redundancy and Duplication in Data Lakes 2023 SIGMOD 4.427232e-05
9,961 QueryArtisan: Generating Data Manipulation Codes for Ad-hoc Analysis in Data Lakes 2025 VLDB 4.2294678e-05
10,109 Retrieve-and-Verify: A Table Context Selection Framework for Accurate Column Annotations 2026 SIGMOD 4.1945683e-05
10,168 FlowPilot: A Suggestion System for Designing Scientific Workflows 2026 SIGMOD 4.1945683e-05
10,197 Qualitative Join Discovery in Data Lakes using Examples 2026 SIGMOD 4.1945683e-05
10,510 Table Overlap Estimation through Graph Embeddings 2025 SIGMOD 4.1945683e-05
10,589 Birdie: Natural Language-Driven Table Discovery Using Differentiable Search Index 2025 VLDB 4.1945683e-05
10,685 LakeVisage: Towards Scalable, Flexible and Interactive Visualization Recommendation for Data Discovery over Data Lakes 2025 VLDB 4.1945683e-05
10,753 Cents: A Flexible and Cost-Effective Framework for LLM-Based Table Understanding 2025 VLDB 4.1945683e-05
10,754 OmniMatch: Joinability Discovery in Data Products 2025 VLDB 4.1945683e-05
10,823 TableCopilot: A Table Assistant Empowered by Natural Language Conditional Table Discovery 2025 VLDB 4.1945683e-05
10,836 Data Discovery in Data Lakes: Operations, Indexes, Systems 2025 VLDB 4.1945683e-05
10,842 ML-Asset Management: Curation, Discovery, and Utilization 2025 VLDB 4.1945683e-05
10,844 Panel on Neural Relational Data: Tabular Foundation Models, LLMs... or both? 2025 VLDB 4.1945683e-05
11,054 Enriching Relations with Additional Attributes for ER 2024 VLDB 4.1945683e-05
11,063 Searching Data Lakes for Nested and Joined Data 2024 VLDB 4.1945683e-05
11,097 Navigating Data Repositories: Utilizing Line Charts to Discover Relevant Datasets 2024 VLDB 4.1945683e-05
Previous Page 1 / 1 Next

Outgoing Citations (Sorted by Pagerank)

Showing 27 of 27 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Rank Cited Paper Year Venue Pagerank
34 Similarity Search in High Dimensions via Hashing 1999 VLDB 0.00076637636
107 WebTables: Exploring the Power of Tables on the Web 2008 VLDB 0.00048377684
221 Deep Entity Matching with Pre-Trained Language Models 2021 VLDB 0.00033121824
364 Annotating and Searching Web Tables Using Entities, Types and Relationships 2010 VLDB 0.00025637562
420 InfoGather: Entity Augmentation and Attribute Discovery By Holistic Matching with Web Tables 2012 SIGMOD 0.00023719065
513 TURL: Table Understanding through Representation Learning 2021 VLDB 0.00021288342
518 Data Integration for the Relational Web 2009 VLDB 0.00021158934
818 Finding Related Tables 2012 SIGMOD 0.00016311524
939 Data Lake Management: Challenges and Opportunities 2019 VLDB 0.00015187344
1,001 Recovering Semantics of Tables on the Web 2011 VLDB 0.00014706505
1,178 Table Union Search on Open Data 2018 VLDB 0.00013468118
1,187 JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes 2019 SIGMOD 0.00013443639
1,644 Finding Related Tables in Data Lakes for Interactive Data Science 2020 SIGMOD 0.00011041787
1,751 Auctus: A Dataset Search Engine for Data Discovery and Augmentation 2021 VLDB 0.00010683295
1,914 Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks 2020 SIGMOD 0.00010109102
2,141 LSH Ensemble: Internet-Scale Domain Search 2016 VLDB 9.4542625e-05
2,517 Annotating Columns with Pre-trained Language Models 2022 SIGMOD 8.6092139e-05
2,633 Schema Extraction for Tabular Data on the Web 2013 VLDB 8.4063569e-05
2,730 Open Data Integration 2018 VLDB 8.2126735e-05
2,888 Sato: Contextual Semantic Type Detection in Tables 2020 VLDB 7.9594996e-05
3,000 SANTOS: Relationship-based Semantic Table Union Search 2023 SIGMOD 7.7462128e-05
3,797 Stitching Web Tables for Improving Matching Quality 2017 VLDB 6.7597149e-05
4,801 CLAMS: Bringing Quality to Data Lakes 2016 SIGMOD 5.9115269e-05
4,859 Integrating Data Lake Tables 2023 VLDB 5.8732433e-05
4,967 Leva: Boosting Machine Learning Performance with Relational Embedding Data Augmentation 2022 SIGMOD 5.7956612e-05
5,529 Data-Driven Domain Discovery for Structured Datasets 2020 VLDB 5.4566641e-05
8,193 WarpGate: A Semantic Join Discovery System for Cloud Data Warehouses 2023 CIDR 4.5618596e-05
Previous Page 1 / 1 Next

Semantically Similar Papers