Database Paper Browser

Back to papers

Data Lake Management: Challenges and Opportunities

Summary: Data lakes create new management problems—dataset discovery and evolving needs for extraction, cleaning, integration, versioning, and metadata. This tutorial surveys state-of-the-art approaches, challenges, and opportunities unique to data-lake ecosystems for guiding future research. (summarized by gpt-5-nano on Feb 09 2026)

Paper ID
11918
Venue
VLDB
Year
2019
Pagerank
0.00015187344
Overall Rank
939 | 93.47%
DOI
10.14778/3352063.3352116

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 40 of 40 citing papers.

Rank Citing Paper Year Venue Pagerank
1,116 Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes 2024 VLDB 0.00013890154
1,644 Finding Related Tables in Data Lakes for Interactive Data Science 2020 SIGMOD 0.00011041787
2,836 Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning 2023 VLDB 8.0443826e-05
3,000 SANTOS: Relationship-based Semantic Table Union Search 2023 SIGMOD 7.7462128e-05
3,015 Chorus: Foundation Models for Unified Data Discovery and Exploration 2024 VLDB 7.7092391e-05
3,114 GPTuner: A Manual-Reading Database Tuning System via GPT-Guided Bayesian Optimization 2024 VLDB 7.5451724e-05
4,749 Slice Tuner: A Selective Data Acquisition Framework for Accurate and Fair Machine Learning Models 2021 SIGMOD 5.9503689e-05
4,859 Integrating Data Lake Tables 2023 VLDB 5.8732433e-05
4,863 Data-Sharing Markets: Model, Protocol, and Algorithms to Incentivize the Formation of Data-Sharing Consortia 2023 SIGMOD 5.8697471e-05
4,957 Doing More with Less: Characterizing Dataset Downsampling for AutoML 2021 VLDB 5.8035715e-05
5,280 Explaining Dataset Changes for Semantic Data Versioning with Explain-Da-V 2023 VLDB 5.5896735e-05
5,529 Data-Driven Domain Discovery for Structured Datasets 2020 VLDB 5.4566641e-05
5,794 Discovering Related Data At Scale 2021 VLDB 5.3245122e-05
5,928 SchemaPile: A Large Collection of Relational Database Schemas 2024 SIGMOD 5.2685946e-05
6,081 Subgraph Matching over Graph Federation 2022 VLDB 5.2208051e-05
6,438 RONIN: Data Lake Exploration 2021 VLDB 5.0620163e-05
6,526 Data Collection and Quality Challenges for Deep Learning 2020 VLDB 5.0267429e-05
7,643 Cross Modal Data Discovery over Structured and Unstructured Data Lakes 2023 VLDB 4.6901105e-05
8,008 Entity Resolution On-Demand 2022 VLDB 4.6067684e-05
8,116 LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data Lakes 2024 VLDB 4.581507e-05
8,608 Unity Catalog: Open and Universal Governance for the Lakehouse and Beyond 2025 SIGMOD 4.4853979e-05
8,729 OneProvenance: Efficient Extraction of Dynamic Coarse-Grained Provenance From Database Query Event Logs 2023 VLDB 4.4582221e-05
8,974 DataLoom: Simplifying Data Loading with LLMs 2024 VLDB 4.4184286e-05
9,701 Towards Functional Decomposition of Storage Formats 2025 CIDR 4.3008468e-05
9,773 EquiTensors: Learning Fair Integrations of Heterogeneous Urban Data 2021 SIGMOD 4.2856106e-05
9,961 QueryArtisan: Generating Data Manipulation Codes for Ad-hoc Analysis in Data Lakes 2025 VLDB 4.2294678e-05
10,142 AutoDDG: Automated Dataset Description Generation using Large Language Models 2026 SIGMOD 4.1945683e-05
10,197 Qualitative Join Discovery in Data Lakes using Examples 2026 SIGMOD 4.1945683e-05
10,510 Table Overlap Estimation through Graph Embeddings 2025 SIGMOD 4.1945683e-05
10,645 OpenForge: Probabilistic Metadata Integration 2025 VLDB 4.1945683e-05
10,797 A Demonstration of QueryArtisan: Real-Time Data Lake Analysis via Dynamically Generated Data Manipulation Code 2025 VLDB 4.1945683e-05
10,803 GraphAr: An Efficient Storage Scheme for Graph Data in Data Lakes 2025 VLDB 4.1945683e-05
10,829 Sort it Like You Mean It: Discovering Semantically Interesting Attribute Augmentations to Sort Tables 2025 VLDB 4.1945683e-05
10,854 LiquidCache: Efficient Pushdown Caching for Cloud-Native Data Analytics 2025 VLDB 4.1945683e-05
10,895 Towards an Objective Metric for Data Value Through Relevance 2024 CIDR 4.1945683e-05
10,951 Determining the Largest Overlap between Tables 2024 SIGMOD 4.1945683e-05
11,006 FusionQuery: On-demand Fusion Queries over Multi-source Heterogeneous Data 2024 VLDB 4.1945683e-05
11,063 Searching Data Lakes for Nested and Joined Data 2024 VLDB 4.1945683e-05
11,076 KGFabric: A Scalable Knowledge Graph Warehouse for Enterprise Data Interconnection 2024 VLDB 4.1945683e-05
11,420 Detecting Layout Templates in Complex Multiregion Files 2022 VLDB 4.1945683e-05
Previous Page 1 / 1 Next

Outgoing Citations (Sorted by Pagerank)

Showing 26 of 26 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Rank Cited Paper Year Venue Pagerank
107 WebTables: Exploring the Power of Tables on the Web 2008 VLDB 0.00048377684
398 Big Data Integration 2013 VLDB 0.00024372588
420 InfoGather: Entity Augmentation and Attribute Discovery By Holistic Matching with Web Tables 2012 SIGMOD 0.00023719065
518 Data Integration for the Relational Web 2009 VLDB 0.00021158934
610 Goods: Organizing Google's Datasets 2016 SIGMOD 0.00019232674
818 Finding Related Tables 2012 SIGMOD 0.00016311524
833 Guided Data Repair 2011 VLDB 0.00016138432
1,178 Table Union Search on Open Data 2018 VLDB 0.00013468118
1,187 JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes 2019 SIGMOD 0.00013443639
1,277 The Data Civilizer System 2017 CIDR 0.00012879695
1,281 DataHub: Collaborative Data Science & Dataset Version Management at Scale 2015 CIDR 0.00012854744
1,367 Answering Table Queries on the Web using Column Keywords 2012 VLDB 0.00012349783
1,509 Discovering Queries based on Example Tuples 2014 SIGMOD 0.00011612727
1,565 Principles of Dataset Versioning: Exploring the Recreation/Storage Tradeoff 2015 VLDB 0.00011345567
1,858 Bootstrapping Pay-As-You-Go Data Integration Systems 2008 SIGMOD 0.00010301124
2,078 Sample-Driven Schema Mapping 2012 SIGMOD 9.599707e-05
2,141 LSH Ensemble: Internet-Scale Domain Search 2016 VLDB 9.4542625e-05
2,269 Ground: A Data Context Service 2017 CIDR 9.147379e-05
2,633 Schema Extraction for Tabular Data on the Web 2013 VLDB 8.4063569e-05
2,730 Open Data Integration 2018 VLDB 8.2126735e-05
3,281 Constance: An Intelligent Data Lake System 2016 SIGMOD 7.2823287e-05
3,690 Navigating the Data Lake with DATAMARAN: Automatically Extracting Structure from Log Datasets 2018 SIGMOD 6.8384476e-05
4,106 Extracting Databases from Dark Data with DeepDive 2016 SIGMOD 6.4456184e-05
4,801 CLAMS: Bringing Quality to Data Lakes 2016 SIGMOD 5.9115269e-05
5,789 Interactive Navigation of Open Data Linkages 2017 VLDB 5.3269741e-05
6,817 Error Diagnosis and Data Profiling with Data X-Ray 2015 VLDB 4.9171711e-05
Previous Page 1 / 1 Next

Semantically Similar Papers