Back to papers
DeepJoin: Joinable Table Discovery with Pre-trained Language Models
Summary: DeepJoin fine-tunes a PLM to embed textual columns into fixed-length vectors, unifying equi- and semantic join discovery. With ANN for sublinear, GPU-accelerated retrieval and data augmentation, DeepJoin outperforms prior approximate methods and beats expert-labeled semantic exact joins.
(summarized by gpt-5-mini on Feb 09 2026)
- Paper ID
- 13094
- Venue
- VLDB
- Year
- 2023
- Pagerank
- 7.2065006e-05
- Overall Rank
- 3,335 | 76.81%
- DOI
-
10.14778/3603581.3603587
Incoming Non-self Citations Over Time
Incoming Citations (Sorted by Pagerank)
Showing 14 of 14 citing papers.
| Rank |
Citing Paper |
Year |
Venue |
Pagerank |
| 1,963 |
DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing |
2025 |
VLDB |
9.929429e-05 |
| 3,015 |
Chorus: Foundation Models for Unified Data Discovery and Exploration |
2024 |
VLDB |
7.7092391e-05 |
| 7,048 |
Magneto: Combining Small and Large Language Models for Schema Matching |
2025 |
VLDB |
4.8520651e-05 |
| 8,116 |
LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data Lakes |
2024 |
VLDB |
4.581507e-05 |
| 8,783 |
GEqO: ML-Accelerated Semantic Equivalence Detection |
2023 |
SIGMOD |
4.452825e-05 |
| 10,109 |
Retrieve-and-Verify: A Table Context Selection Framework for Accurate Column Annotations |
2026 |
SIGMOD |
4.1945683e-05 |
| 10,197 |
Qualitative Join Discovery in Data Lakes using Examples |
2026 |
SIGMOD |
4.1945683e-05 |
| 10,510 |
Table Overlap Estimation through Graph Embeddings |
2025 |
SIGMOD |
4.1945683e-05 |
| 10,589 |
Birdie: Natural Language-Driven Table Discovery Using Differentiable Search Index |
2025 |
VLDB |
4.1945683e-05 |
| 10,685 |
LakeVisage: Towards Scalable, Flexible and Interactive Visualization Recommendation for Data Discovery over Data Lakes |
2025 |
VLDB |
4.1945683e-05 |
| 10,754 |
OmniMatch: Joinability Discovery in Data Products |
2025 |
VLDB |
4.1945683e-05 |
| 10,823 |
TableCopilot: A Table Assistant Empowered by Natural Language Conditional Table Discovery |
2025 |
VLDB |
4.1945683e-05 |
| 10,829 |
Sort it Like You Mean It: Discovering Semantically Interesting Attribute Augmentations to Sort Tables |
2025 |
VLDB |
4.1945683e-05 |
| 11,063 |
Searching Data Lakes for Nested and Joined Data |
2024 |
VLDB |
4.1945683e-05 |
Outgoing Citations (Sorted by Pagerank)
Showing 22 of 22 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank |
Cited Paper |
Year |
Venue |
Pagerank |
| 221 |
Deep Entity Matching with Pre-Trained Language Models |
2021 |
VLDB |
0.00033121824 |
| 513 |
TURL: Table Understanding through Representation Learning |
2021 |
VLDB |
0.00021288342 |
| 517 |
Can Foundation Models Wrangle Your Data? |
2023 |
VLDB |
0.00021169035 |
| 1,178 |
Table Union Search on Open Data |
2018 |
VLDB |
0.00013468118 |
| 1,187 |
JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes |
2019 |
SIGMOD |
0.00013443639 |
| 1,463 |
ARDA: Automatic Relational Data Augmentation for Machine Learning |
2020 |
VLDB |
0.00011869295 |
| 1,644 |
Finding Related Tables in Data Lakes for Interactive Data Science |
2020 |
SIGMOD |
0.00011041787 |
| 2,141 |
LSH Ensemble: Internet-Scale Domain Search |
2016 |
VLDB |
9.4542625e-05 |
| 2,349 |
RPT: Relational Pre-trained Transformer Is Almost All You Need towards Democratizing Data Preparation |
2021 |
VLDB |
8.9876423e-05 |
| 2,517 |
Annotating Columns with Pre-trained Language Models |
2022 |
SIGMOD |
8.6092139e-05 |
| 2,888 |
Sato: Contextual Semantic Type Detection in Tables |
2020 |
VLDB |
7.9594996e-05 |
| 3,358 |
Organizing Data Lakes for Navigation |
2020 |
SIGMOD |
7.1784949e-05 |
| 3,735 |
Auto-Join: Joining Tables by Leveraging Transformations |
2017 |
VLDB |
6.8061318e-05 |
| 3,942 |
Ember: No-Code Context Enrichment via Similarity-Based Keyless Joins |
2022 |
VLDB |
6.6114622e-05 |
| 4,278 |
Similarity Query Processing for High-Dimensional Data |
2020 |
VLDB |
6.2953764e-05 |
| 4,850 |
SEMA-JOIN: Joining Semantically-Related Tables Using Big Table Corpora |
2015 |
VLDB |
5.8768452e-05 |
| 4,859 |
Integrating Data Lake Tables |
2023 |
VLDB |
5.8732433e-05 |
| 4,985 |
Pivot-based Metric Indexing |
2017 |
VLDB |
5.7856648e-05 |
| 5,096 |
Auto-Transform: Learning-to-Transform by Patterns |
2020 |
VLDB |
5.7011825e-05 |
| 5,179 |
SilkMoth: An Efficient Method for Finding Related Sets with Maximum Matching Constraints |
2017 |
VLDB |
5.6428428e-05 |
| 5,794 |
Discovering Related Data At Scale |
2021 |
VLDB |
5.3245122e-05 |
| 7,838 |
Auto-Validate: Unsupervised Data Validation Using Data-Domain Patterns Inferred from Data Lakes |
2021 |
SIGMOD |
4.6377995e-05 |
Semantically Similar Papers
| Overall Rank |
Paper |
Year |
Venue |
Pagerank |
| 3,635 |
A Deep Dive into Deep Learning Approaches for Text-to-SQL Systems |
2021 |
SIGMOD |
6.8981006e-05 |
| 3,735 |
Auto-Join: Joining Tables by Leveraging Transformations |
2017 |
VLDB |
6.8061318e-05 |
| 11,063 |
Searching Data Lakes for Nested and Joined Data |
2024 |
VLDB |
4.1945683e-05 |
| 10,589 |
Birdie: Natural Language-Driven Table Discovery Using Differentiable Search Index |
2025 |
VLDB |
4.1945683e-05 |
| 8,913 |
Making Table Understanding Work in Practice |
2022 |
CIDR |
4.427232e-05 |
| 2,836 |
Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning |
2023 |
VLDB |
8.0443826e-05 |
| 10,197 |
Qualitative Join Discovery in Data Lakes using Examples |
2026 |
SIGMOD |
4.1945683e-05 |
| 1,914 |
Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks |
2020 |
SIGMOD |
0.00010109102 |
| 8,899 |
Fast Approximate Similarity Join in Vector Databases |
2025 |
SIGMOD |
4.427232e-05 |
| 6,800 |
DTT: An Example-Driven Tabular Transformer for Joinability by Leveraging Large Language Models |
2024 |
SIGMOD |
4.9231471e-05 |