Data Wrangling: The Challenging Journey from the Wild to the Lake
Summary: Characterizes data-wrangling pain points for data lakes—difficulties in acquisition, interpretation, description, maintenance, provenance, governance and scaling as sources multiply. Advocates shifting from “raw” lakes to curated data lakes via systematic curation, metadata, quality and governance pipelines to enable truly usable ad‑hoc analytics beyond enterprise IT. (summarized by gpt-5-mini on Feb 09 2026)
Incoming Non-self Citations Over Time
Authors
- 1. Ignacio Terrizzano
- 2. Peter Schwarz
- 3. Mary Roth
- 4. John E. Colino
Incoming Citations (Sorted by Pagerank)
Showing 9 of 9 citing papers.
| Rank | Citing Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 610 | Goods: Organizing Google's Datasets | 2016 | SIGMOD | 0.00019232674 |
| 1,463 | ARDA: Automatic Relational Data Augmentation for Machine Learning | 2020 | VLDB | 0.00011869295 |
| 3,281 | Constance: An Intelligent Data Lake System | 2016 | SIGMOD | 7.2823287e-05 |
| 3,690 | Navigating the Data Lake with DATAMARAN: Automatically Extracting Structure from Log Datasets | 2018 | SIGMOD | 6.8384476e-05 |
| 7,384 | The VADA Architecture for Cost-Effective Data Wrangling | 2017 | SIGMOD | 4.7445719e-05 |
| 7,745 | Crossing the finish line faster when paddling the Data Lake with KAYAK | 2017 | VLDB | 4.6618625e-05 |
| 9,660 | Meta-Mappings for Schema Mapping Reuse | 2019 | VLDB | 4.3107389e-05 |
| 11,217 | Efficient Approximation Framework for Attribute Recommendation | 2023 | SIGMOD | 4.1945683e-05 |
| 11,732 | CoreKG: a Knowledge Lake Service | 2018 | VLDB | 4.1945683e-05 |
Previous
Page 1 / 1
Next
Outgoing Citations (Sorted by Pagerank)
Showing 5 of 5 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank | Cited Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 346 | Don't Scrap It, Wrap It! A Wrapper Architecture for Legacy Data Sources | 1997 | VLDB | 0.00026656272 |
| 483 | Clio Grows Up: From Research Prototype to Industrial Tool | 2005 | SIGMOD | 0.00022125107 |
| 489 | Data Curation at Scale: The Data Tamer System | 2013 | CIDR | 0.00022030728 |
| 893 | Data Integration: The Teenage Years | 2006 | VLDB | 0.00015558352 |
| 4,776 | Exploiting Evidence from Unstructured Data to Enhance Master Data Management | 2012 | VLDB | 5.9314064e-05 |
Previous
Page 1 / 1
Next
Semantically Similar Papers
| Overall Rank | Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 11,288 | To UDFs and Beyond: Demonstration of a Fully Decomposed Data Processor for General Data Wrangling Tasks | 2023 | VLDB | 4.1945683e-05 |
| 12,286 | The Case for a Structured Approach to Managing Unstructured Data | 2009 | CIDR | 4.1945683e-05 |
| 7,745 | Crossing the finish line faster when paddling the Data Lake with KAYAK | 2017 | VLDB | 4.6618625e-05 |
| 7,117 | Crowdsourced Data Management: Overview and Challenges | 2017 | SIGMOD | 4.826509e-05 |
| 1,377 | Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics | 2021 | CIDR | 0.00012296941 |
| 11,732 | CoreKG: a Knowledge Lake Service | 2018 | VLDB | 4.1945683e-05 |
| 3,974 | Data Extraction and Transformation for the Data Warehouse | 1995 | SIGMOD | 6.573945e-05 |
| 7,384 | The VADA Architecture for Cost-Effective Data Wrangling | 2017 | SIGMOD | 4.7445719e-05 |
| 13,277 | The Challenge of Building Effective Data Lakes | 2020 | SIGMOD | - |
| 939 | Data Lake Management: Challenges and Opportunities | 2019 | VLDB | 0.00015187344 |