R2D2: Reducing Redundancy and Duplication in Data Lakes
Summary: R2D2 tackles table-level containment in data lakes with a three-stage pipeline: schema containment graph, min-max pruning, and content-level pruning—for scalable detection. It trims storage and access costs by deleting redundant datasets and reconstructing on demand under latency bounds; built on Spark (Azure Databricks/ADLS Gen2, AWS) for TB-scale lakes. (summarized by gpt-5-nano on Feb 09 2026)
Incoming Non-self Citations Over Time
Authors
- 1. Raunak Shah
- 2. Koyel Mukherjee
- 3. Atharv Tyagi
- 4. Sai Keerthana Karnam
- 5. Dhruv Joshi
- 6. Shivam Bhosale
- 7. Subrata Mitra
Incoming Citations (Sorted by Pagerank)
Showing 1 of 1 citing papers.
| Rank | Citing Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 10,867 | T-Assess: An Efficient Data Quality Assessment System Tailored for Trajectory Data | 2025 | VLDB | 4.1945683e-05 |
Previous
Page 1 / 1
Next
Outgoing Citations (Sorted by Pagerank)
Showing 15 of 15 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
Previous
Page 1 / 1
Next
Semantically Similar Papers
| Overall Rank | Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 5,236 | Online Deduplication for Databases | 2017 | SIGMOD | 5.611324e-05 |
| 280 | Eliminating Fuzzy Duplicates in Data Warehouses | 2002 | VLDB | 0.00029113044 |
| 7,059 | Adaptive and Robust Query Execution for Lakehouses at Scale | 2024 | VLDB | 4.8477825e-05 |
| 2,386 | Leveraging Aggregate Constraints For Deduplication | 2007 | SIGMOD | 8.9231895e-05 |
| 9,232 | AutoComp: Automated Data Compaction for Log-Structured Tables in Data Lakes | 2025 | SIGMOD | 4.3690661e-05 |
| 1,482 | Automating Large-Scale Data Quality Verification | 2018 | VLDB | 0.00011725533 |
| 9,479 | Data Imputation with Limited Data Redundancy Using Data Lakes | 2025 | VLDB | 4.3341665e-05 |
| 3,528 | Distributed Data Deduplication | 2016 | VLDB | 7.0066139e-05 |
| 7,907 | Petabyte-Scale Row-Level Operations in Data Lakehouses | 2024 | VLDB | 4.6205839e-05 |
| 7,061 | Serving Deep Learning Models with Deduplication from Relational Databases | 2022 | VLDB | 4.8463881e-05 |