Back to papers
Cross Modal Data Discovery over Structured and Unstructured Data Lakes
Summary: CMDL enables discovery across structured tables and unstructured documents via a multi-modal embedding that aligns text with tabular columns while preserving table structure. Weakly supervised, domain-agnostic training; outperforms search baselines and structured-only methods on three real-world data lakes.
(summarized by gpt-5-mini on Feb 09 2026)
- Paper ID
- 13172
- Venue
- VLDB
- Year
- 2023
- Pagerank
- 4.6901105e-05
- Overall Rank
- 7,643 | 46.83%
- DOI
-
10.14778/3611479.3611533
Incoming Non-self Citations Over Time
Incoming Citations (Sorted by Pagerank)
Showing 3 of 3 citing papers.
Outgoing Citations (Sorted by Pagerank)
Showing 16 of 16 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank |
Cited Paper |
Year |
Venue |
Pagerank |
| 254 |
Snorkel: Rapid Training Data Creation with Weak Supervision |
2018 |
VLDB |
0.00030540555 |
| 513 |
TURL: Table Understanding through Representation Learning |
2021 |
VLDB |
0.00021288342 |
| 610 |
Goods: Organizing Google's Datasets |
2016 |
SIGMOD |
0.00019232674 |
| 939 |
Data Lake Management: Challenges and Opportunities |
2019 |
VLDB |
0.00015187344 |
| 1,178 |
Table Union Search on Open Data |
2018 |
VLDB |
0.00013468118 |
| 1,187 |
JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes |
2019 |
SIGMOD |
0.00013443639 |
| 1,277 |
The Data Civilizer System |
2017 |
CIDR |
0.00012879695 |
| 1,552 |
Overview of Data Exploration Techniques |
2015 |
SIGMOD |
0.00011408814 |
| 2,141 |
LSH Ensemble: Internet-Scale Domain Search |
2016 |
VLDB |
9.4542625e-05 |
| 2,888 |
Sato: Contextual Semantic Type Detection in Tables |
2020 |
VLDB |
7.9594996e-05 |
| 3,735 |
Auto-Join: Joining Tables by Leveraging Transformations |
2017 |
VLDB |
6.8061318e-05 |
| 3,942 |
Ember: No-Code Context Enrichment via Similarity-Based Keyless Joins |
2022 |
VLDB |
6.6114622e-05 |
| 5,486 |
Fast Foreign-Key Detection in Microsoft SQL Server PowerPivot for Excel |
2014 |
VLDB |
5.4811603e-05 |
| 5,529 |
Data-Driven Domain Discovery for Structured Datasets |
2020 |
VLDB |
5.4566641e-05 |
| 6,165 |
When the Web is your Data Lake: Creating a Search Engine for Datasets on the Web |
2020 |
SIGMOD |
5.1728052e-05 |
| 9,361 |
An IDEA: An Ingestion Framework for Data Enrichment in AsterixDB |
2019 |
VLDB |
4.3506168e-05 |
Semantically Similar Papers
| Overall Rank |
Paper |
Year |
Venue |
Pagerank |
| 9,776 |
Structure-Aware Machine Learning over Multi-Relational Databases |
2021 |
SIGMOD |
4.2856106e-05 |
| 10,158 |
Efficient and Robust Out-Of-Distribution Vector Similarity Search with Cross-Distribution Monotonic Graph |
2026 |
SIGMOD |
4.1945683e-05 |
| 10,973 |
Unstructured Data Fusion for Schema and Data Extraction |
2024 |
SIGMOD |
4.1945683e-05 |
| 1,914 |
Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks |
2020 |
SIGMOD |
0.00010109102 |
| 3,335 |
DeepJoin: Joinable Table Discovery with Pre-trained Language Models |
2023 |
VLDB |
7.2065006e-05 |
| 5,794 |
Discovering Related Data At Scale |
2021 |
VLDB |
5.3245122e-05 |
| 3,426 |
Discovering Topical Structures of Databases |
2008 |
SIGMOD |
7.1063105e-05 |
| 7,256 |
Effective and Efficient Retrieval of Structured Entities |
2020 |
VLDB |
4.7869419e-05 |
| 2,836 |
Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning |
2023 |
VLDB |
8.0443826e-05 |
| 5,529 |
Data-Driven Domain Discovery for Structured Datasets |
2020 |
VLDB |
5.4566641e-05 |