Database Paper Browser

Back to papers

DeepJoin: Joinable Table Discovery with Pre-trained Language Models

Summary: DeepJoin fine-tunes a PLM to embed textual columns into fixed-length vectors, unifying equi- and semantic join discovery. With ANN for sublinear, GPU-accelerated retrieval and data augmentation, DeepJoin outperforms prior approximate methods and beats expert-labeled semantic exact joins. (summarized by gpt-5-mini on Feb 09 2026)

Paper ID
13094
Venue
VLDB
Year
2023
Pagerank
7.2065006e-05
Overall Rank
3,335 | 76.81%
DOI
10.14778/3603581.3603587

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 14 of 14 citing papers.

Rank Citing Paper Year Venue Pagerank
1,963 DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing 2025 VLDB 9.929429e-05
3,015 Chorus: Foundation Models for Unified Data Discovery and Exploration 2024 VLDB 7.7092391e-05
7,048 Magneto: Combining Small and Large Language Models for Schema Matching 2025 VLDB 4.8520651e-05
8,116 LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data Lakes 2024 VLDB 4.581507e-05
8,783 GEqO: ML-Accelerated Semantic Equivalence Detection 2023 SIGMOD 4.452825e-05
10,109 Retrieve-and-Verify: A Table Context Selection Framework for Accurate Column Annotations 2026 SIGMOD 4.1945683e-05
10,197 Qualitative Join Discovery in Data Lakes using Examples 2026 SIGMOD 4.1945683e-05
10,510 Table Overlap Estimation through Graph Embeddings 2025 SIGMOD 4.1945683e-05
10,589 Birdie: Natural Language-Driven Table Discovery Using Differentiable Search Index 2025 VLDB 4.1945683e-05
10,685 LakeVisage: Towards Scalable, Flexible and Interactive Visualization Recommendation for Data Discovery over Data Lakes 2025 VLDB 4.1945683e-05
10,754 OmniMatch: Joinability Discovery in Data Products 2025 VLDB 4.1945683e-05
10,823 TableCopilot: A Table Assistant Empowered by Natural Language Conditional Table Discovery 2025 VLDB 4.1945683e-05
10,829 Sort it Like You Mean It: Discovering Semantically Interesting Attribute Augmentations to Sort Tables 2025 VLDB 4.1945683e-05
11,063 Searching Data Lakes for Nested and Joined Data 2024 VLDB 4.1945683e-05
Previous Page 1 / 1 Next

Outgoing Citations (Sorted by Pagerank)

Showing 22 of 22 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Rank Cited Paper Year Venue Pagerank
221 Deep Entity Matching with Pre-Trained Language Models 2021 VLDB 0.00033121824
513 TURL: Table Understanding through Representation Learning 2021 VLDB 0.00021288342
517 Can Foundation Models Wrangle Your Data? 2023 VLDB 0.00021169035
1,178 Table Union Search on Open Data 2018 VLDB 0.00013468118
1,187 JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes 2019 SIGMOD 0.00013443639
1,463 ARDA: Automatic Relational Data Augmentation for Machine Learning 2020 VLDB 0.00011869295
1,644 Finding Related Tables in Data Lakes for Interactive Data Science 2020 SIGMOD 0.00011041787
2,141 LSH Ensemble: Internet-Scale Domain Search 2016 VLDB 9.4542625e-05
2,349 RPT: Relational Pre-trained Transformer Is Almost All You Need towards Democratizing Data Preparation 2021 VLDB 8.9876423e-05
2,517 Annotating Columns with Pre-trained Language Models 2022 SIGMOD 8.6092139e-05
2,888 Sato: Contextual Semantic Type Detection in Tables 2020 VLDB 7.9594996e-05
3,358 Organizing Data Lakes for Navigation 2020 SIGMOD 7.1784949e-05
3,735 Auto-Join: Joining Tables by Leveraging Transformations 2017 VLDB 6.8061318e-05
3,942 Ember: No-Code Context Enrichment via Similarity-Based Keyless Joins 2022 VLDB 6.6114622e-05
4,278 Similarity Query Processing for High-Dimensional Data 2020 VLDB 6.2953764e-05
4,850 SEMA-JOIN: Joining Semantically-Related Tables Using Big Table Corpora 2015 VLDB 5.8768452e-05
4,859 Integrating Data Lake Tables 2023 VLDB 5.8732433e-05
4,985 Pivot-based Metric Indexing 2017 VLDB 5.7856648e-05
5,096 Auto-Transform: Learning-to-Transform by Patterns 2020 VLDB 5.7011825e-05
5,179 SilkMoth: An Efficient Method for Finding Related Sets with Maximum Matching Constraints 2017 VLDB 5.6428428e-05
5,794 Discovering Related Data At Scale 2021 VLDB 5.3245122e-05
7,838 Auto-Validate: Unsupervised Data Validation Using Data-Domain Patterns Inferred from Data Lakes 2021 SIGMOD 4.6377995e-05
Previous Page 1 / 1 Next

Semantically Similar Papers