Database Paper Browser

Back to papers

LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data Lakes

Summary: LakeBench: massive benchmark for discovering joinable and unionable tables in data lakes, with 16M real tables (1,600× larger than prior datasets) and 10k+ human-labeled queries across diverse categories. Evaluates effectiveness, efficiency, and scalability of join/union search methods and surfaces performance bottlenecks and open research challenges. (summarized by gpt-5-mini on Feb 09 2026)

Paper ID
13428
Venue
VLDB
Year
2024
Pagerank
4.581507e-05
Overall Rank
8,116 | 43.54%
DOI
10.14778/3659437.3659448

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 5 of 5 citing papers.

Previous Page 1 / 1 Next

Outgoing Citations (Sorted by Pagerank)

Showing 24 of 24 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Rank Cited Paper Year Venue Pagerank
420 InfoGather: Entity Augmentation and Attribute Discovery By Holistic Matching with Web Tables 2012 SIGMOD 0.00023719065
518 Data Integration for the Relational Web 2009 VLDB 0.00021158934
610 Goods: Organizing Google's Datasets 2016 SIGMOD 0.00019232674
818 Finding Related Tables 2012 SIGMOD 0.00016311524
903 To Join or Not to Join? Thinking Twice about Joins before Feature Selection 2016 SIGMOD 0.0001547016
939 Data Lake Management: Challenges and Opportunities 2019 VLDB 0.00015187344
1,178 Table Union Search on Open Data 2018 VLDB 0.00013468118
1,187 JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes 2019 SIGMOD 0.00013443639
1,277 The Data Civilizer System 2017 CIDR 0.00012879695
1,367 Answering Table Queries on the Web using Column Keywords 2012 VLDB 0.00012349783
1,463 ARDA: Automatic Relational Data Augmentation for Machine Learning 2020 VLDB 0.00011869295
2,141 LSH Ensemble: Internet-Scale Domain Search 2016 VLDB 9.4542625e-05
2,836 Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning 2023 VLDB 8.0443826e-05
2,888 Sato: Contextual Semantic Type Detection in Tables 2020 VLDB 7.9594996e-05
3,000 SANTOS: Relationship-based Semantic Table Union Search 2023 SIGMOD 7.7462128e-05
3,335 DeepJoin: Joinable Table Discovery with Pre-trained Language Models 2023 VLDB 7.2065006e-05
3,750 Data Acquisition for Improving Machine Learning Models 2021 VLDB 6.7895763e-05
4,102 GoodCore: Data-effective and Data-efficient Machine Learning through Coreset Selection over Incomplete Data 2023 SIGMOD 6.4522929e-05
5,381 Selective Data Acquisition in the Wild for Model Charging 2022 VLDB 5.5399508e-05
6,270 MATE: Multi-Attribute Table Extraction 2022 VLDB 5.1337451e-05
7,179 Coresets over Multiple Tables for Feature-rich and Data-efficient Machine Learning 2023 VLDB 4.8078895e-05
7,575 Human-in-the-loop Outlier Detection 2020 SIGMOD 4.7068909e-05
8,678 Progressive Deep Web Crawling Through Keyword Queries For Data Enrichment 2019 SIGMOD 4.4702119e-05
11,000 MisDetect: Iterative Mislabel Detection using Early Loss 2024 VLDB 4.1945683e-05
Previous Page 1 / 1 Next

Semantically Similar Papers