Database Paper Browser

Back to papers

JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes

Summary: JOSIE reframes joinability in data lakes as an overlap set similarity search, columns as value-sets and seeking intersections. An exact, cost-aware algorithm minimizes reads and inverted-index probes, beating prior overlap methods on large data lakes. (summarized by gpt-5-nano on Feb 09 2026)

Paper ID
5596
Venue
SIGMOD
Year
2019
Pagerank
0.00013443639
Overall Rank
1,187 | 91.75%
DOI
10.1145/3299869.3300065

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 50 of 51 citing papers.

Rank Citing Paper Year Venue Pagerank
939 Data Lake Management: Challenges and Opportunities 2019 VLDB 0.00015187344
1,644 Finding Related Tables in Data Lakes for Interactive Data Science 2020 SIGMOD 0.00011041787
2,836 Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning 2023 VLDB 8.0443826e-05
3,015 Chorus: Foundation Models for Unified Data Discovery and Exploration 2024 VLDB 7.7092391e-05
3,335 DeepJoin: Joinable Table Discovery with Pre-trained Language Models 2023 VLDB 7.2065006e-05
3,358 Organizing Data Lakes for Navigation 2020 SIGMOD 7.1784949e-05
3,824 Correlation Sketches for Approximate Join-Correlation Queries 2021 SIGMOD 6.7260705e-05
4,540 Automating Exploratory Data Analysis via Machine Learning: An Overview 2020 SIGMOD 6.1033443e-05
4,859 Integrating Data Lake Tables 2023 VLDB 5.8732433e-05
5,024 Towards Distribution-aware Query Answering in Data Markets 2022 VLDB 5.7535043e-05
5,280 Explaining Dataset Changes for Semantic Data Versioning with Explain-Da-V 2023 VLDB 5.5896735e-05
5,658 Databases Unbound: Querying All of the World's Bytes with AI 2024 VLDB 5.385675e-05
5,691 Putting Things into Context: Rich Explanations for Query Answers using Join Graphs 2021 SIGMOD 5.3684557e-05
5,794 Discovering Related Data At Scale 2021 VLDB 5.3245122e-05
5,952 Eraser: Eliminating Performance Regression on Learned Query Optimizer 2024 VLDB 5.2591691e-05
5,976 Responsible Data Integration: Next-generation Challenges 2022 SIGMOD 5.245976e-05
6,092 Observatory: Characterizing Embeddings of Relational Tables 2024 VLDB 5.2138566e-05
6,270 MATE: Multi-Attribute Table Extraction 2022 VLDB 5.1337451e-05
6,449 Causal Data Integration 2023 VLDB 5.0587746e-05
7,582 LakeCompass: An End-to-End System for Data Maintenance, Search and Analysis in Data Lakes 2024 VLDB 4.7046388e-05
7,643 Cross Modal Data Discovery over Structured and Unstructured Data Lakes 2023 VLDB 4.6901105e-05
8,116 LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data Lakes 2024 VLDB 4.581507e-05
8,193 WarpGate: A Semantic Join Discovery System for Cloud Data Warehouses 2023 CIDR 4.5618596e-05
8,503 A Demonstration of KGLac: A Data Discovery and Enrichment Platform for Data Science 2021 VLDB 4.496339e-05
8,618 Nexus: Correlation Discovery over Collections of Spatio-Temporal Tabular Data 2024 SIGMOD 4.4838259e-05
8,696 Effective Entity Augmentation By Querying External Data Sources 2023 VLDB 4.4660032e-05
8,736 Unveiling Challenges for LLMs in Enterprise Data Engineering 2026 VLDB 4.456315e-05
8,910 R2D2: Reducing Redundancy and Duplication in Data Lakes 2023 SIGMOD 4.427232e-05
8,917 Data Lakes Empowered by Knowledge Graph Technologies 2021 SIGMOD 4.427232e-05
8,974 DataLoom: Simplifying Data Loading with LLMs 2024 VLDB 4.4184286e-05
9,703 CaJaDE: Explaining Query Results by Augmenting Provenance with Context 2022 VLDB 4.3005882e-05
9,961 QueryArtisan: Generating Data Manipulation Codes for Ad-hoc Analysis in Data Lakes 2025 VLDB 4.2294678e-05
10,109 Retrieve-and-Verify: A Table Context Selection Framework for Accurate Column Annotations 2026 SIGMOD 4.1945683e-05
10,142 AutoDDG: Automated Dataset Description Generation using Large Language Models 2026 SIGMOD 4.1945683e-05
10,197 Qualitative Join Discovery in Data Lakes using Examples 2026 SIGMOD 4.1945683e-05
10,510 Table Overlap Estimation through Graph Embeddings 2025 SIGMOD 4.1945683e-05
10,589 Birdie: Natural Language-Driven Table Discovery Using Differentiable Search Index 2025 VLDB 4.1945683e-05
10,598 Auto-Prep: Holistic Prediction of Data Preparation Steps for Self-Service Business Intelligence 2025 VLDB 4.1945683e-05
10,685 LakeVisage: Towards Scalable, Flexible and Interactive Visualization Recommendation for Data Discovery over Data Lakes 2025 VLDB 4.1945683e-05
10,725 Suna: Scalable Causal Confounder Discovery over Relational Data 2025 VLDB 4.1945683e-05
10,754 OmniMatch: Joinability Discovery in Data Products 2025 VLDB 4.1945683e-05
10,823 TableCopilot: A Table Assistant Empowered by Natural Language Conditional Table Discovery 2025 VLDB 4.1945683e-05
10,836 Data Discovery in Data Lakes: Operations, Indexes, Systems 2025 VLDB 4.1945683e-05
10,951 Determining the Largest Overlap between Tables 2024 SIGMOD 4.1945683e-05
11,054 Enriching Relations with Additional Attributes for ER 2024 VLDB 4.1945683e-05
11,063 Searching Data Lakes for Nested and Joined Data 2024 VLDB 4.1945683e-05
11,168 Weighted Minwise Hashing Beats Linear Sketching for Inner Product Estimation 2023 PODS 4.1945683e-05
11,247 A Two-Level Signature Scheme for Stable Set Similarity Joins 2023 VLDB 4.1945683e-05
11,305 TokenJoin: Efficient Filtering for Set Similarity Join with Maximum Weighted Bipartite Matching 2023 VLDB 4.1945683e-05
11,379 Fast Dataset Search with Earth Mover’s Distance 2022 VLDB 4.1945683e-05
Previous Page 1 / 2 Next

Outgoing Citations (Sorted by Pagerank)

Showing 17 of 17 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Previous Page 1 / 1 Next

Semantically Similar Papers