Discovering Related Data At Scale
Summary: ML-driven model to detect related columns across data streams from a month of lake queries. Scales to tens of millions of column-pairs and builds a data-relationship graph over 4.5 PB in ~80 minutes, with ~23% gains over state-of-the-art on labeled data. (summarized by gpt-5-nano on Feb 09 2026)
Incoming Non-self Citations Over Time
Authors
- 1. Sagar Bharadwaj
- 2. Praveen Gupta
- 3. Ranjita Bhagwan
- 4. Saikat Guha
Incoming Citations (Sorted by Pagerank)
Showing 7 of 7 citing papers.
| Rank | Citing Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 3,335 | DeepJoin: Joinable Table Discovery with Pre-trained Language Models | 2023 | VLDB | 7.2065006e-05 |
| 8,193 | WarpGate: A Semantic Join Discovery System for Cloud Data Warehouses | 2023 | CIDR | 4.5618596e-05 |
| 8,910 | R2D2: Reducing Redundancy and Duplication in Data Lakes | 2023 | SIGMOD | 4.427232e-05 |
| 9,928 | Fainder: A Fast and Accurate Index for Distribution-Aware Dataset Search | 2024 | VLDB | 4.2511622e-05 |
| 10,341 | A Theoretical Framework for Distribution-Aware Dataset Search | 2025 | PODS | 4.1945683e-05 |
| 10,754 | OmniMatch: Joinability Discovery in Data Products | 2025 | VLDB | 4.1945683e-05 |
| 10,836 | Data Discovery in Data Lakes: Operations, Indexes, Systems | 2025 | VLDB | 4.1945683e-05 |
Previous
Page 1 / 1
Next
Outgoing Citations (Sorted by Pagerank)
Showing 9 of 9 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank | Cited Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 22 | SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets | 2008 | VLDB | 0.0008456613 |
| 125 | Approximate String Joins in a Database (Almost) for Free | 2001 | VLDB | 0.00044847972 |
| 155 | Robust and Efficient Fuzzy Match for Online Data Cleaning | 2003 | SIGMOD | 0.00040637896 |
| 939 | Data Lake Management: Challenges and Opportunities | 2019 | VLDB | 0.00015187344 |
| 1,187 | JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes | 2019 | SIGMOD | 0.00013443639 |
| 1,277 | The Data Civilizer System | 2017 | CIDR | 0.00012879695 |
| 2,141 | LSH Ensemble: Internet-Scale Domain Search | 2016 | VLDB | 9.4542625e-05 |
| 4,595 | Juneau: Data Lake Management for Jupyter | 2019 | VLDB | 6.060188e-05 |
| 4,775 | Set Similarity Joins on MapReduce: An Experimental Survey | 2018 | VLDB | 5.9315784e-05 |
Previous
Page 1 / 1
Next
Semantically Similar Papers
| Overall Rank | Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 1,510 | Summarizing Relational Databases | 2009 | VLDB | 0.00011606901 |
| 1,914 | Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks | 2020 | SIGMOD | 0.00010109102 |
| 9,776 | Structure-Aware Machine Learning over Multi-Relational Databases | 2021 | SIGMOD | 4.2856106e-05 |
| 7,287 | Discovering Association Rules from Big Graphs | 2022 | VLDB | 4.7762276e-05 |
| 8,917 | Data Lakes Empowered by Knowledge Graph Technologies | 2021 | SIGMOD | 4.427232e-05 |
| 3,823 | Automatic Discovery of Attributes in Relational Databases | 2011 | SIGMOD | 6.7261168e-05 |
| 7,643 | Cross Modal Data Discovery over Structured and Unstructured Data Lakes | 2023 | VLDB | 4.6901105e-05 |
| 11,063 | Searching Data Lakes for Nested and Joined Data | 2024 | VLDB | 4.1945683e-05 |
| 1,644 | Finding Related Tables in Data Lakes for Interactive Data Science | 2020 | SIGMOD | 0.00011041787 |
| 818 | Finding Related Tables | 2012 | SIGMOD | 0.00016311524 |