Back to papers
Data Augmentation for ML-driven Data Preparation and Integration
Summary: Tutorial on DA for ML-driven data preparation and integration in data management. Covers task-specific operators, interpolation, conditional generation, and policy learning; links to active learning and weak supervision.
(summarized by gpt-5-nano on Feb 09 2026)
- Paper ID
- 12523
- Venue
- VLDB
- Year
- 2021
- Pagerank
- 4.2856106e-05
- Overall Rank
- 9,777 | 31.99%
- DOI
-
10.14778/3476311.3476403
Incoming Non-self Citations Over Time
Incoming Citations (Sorted by Pagerank)
Showing 2 of 2 citing papers.
Outgoing Citations (Sorted by Pagerank)
Showing 23 of 23 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank |
Cited Paper |
Year |
Venue |
Pagerank |
| 208 |
Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach |
2001 |
SIGMOD |
0.0003460594 |
| 221 |
Deep Entity Matching with Pre-Trained Language Models |
2021 |
VLDB |
0.00033121824 |
| 254 |
Snorkel: Rapid Training Data Creation with Weak Supervision |
2018 |
VLDB |
0.00030540555 |
| 300 |
Deep Learning for Entity Matching: A Design Space Exploration |
2018 |
SIGMOD |
0.00028441466 |
| 513 |
TURL: Table Understanding through Representation Learning |
2021 |
VLDB |
0.00021288342 |
| 1,215 |
Snuba: Automating Weak Supervision to Label Training Data |
2019 |
VLDB |
0.0001323375 |
| 1,267 |
Foofah: Transforming Data By Example |
2017 |
SIGMOD |
0.00012936483 |
| 1,337 |
HoloDetect: Few-Shot Learning for Error Detection |
2019 |
SIGMOD |
0.00012497164 |
| 1,533 |
Example-driven Design of Efficient Record Matching Queries |
2007 |
VLDB |
0.00011471971 |
| 1,546 |
KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing |
2015 |
SIGMOD |
0.00011446851 |
| 1,894 |
Baran: Effective Error Correction via a Unified Context Representation and Transfer Learning |
2020 |
VLDB |
0.0001018378 |
| 1,914 |
Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks |
2020 |
SIGMOD |
0.00010109102 |
| 2,097 |
Predictive Interaction for Data Transformation |
2015 |
CIDR |
9.5489822e-05 |
| 2,349 |
RPT: Relational Pre-trained Transformer Is Almost All You Need towards Democratizing Data Preparation |
2021 |
VLDB |
8.9876423e-05 |
| 2,421 |
Data Synthesis based on Generative Adversarial Networks |
2018 |
VLDB |
8.8514021e-05 |
| 2,767 |
A Comprehensive Benchmark Framework for Active Learning Methods in Entity Matching |
2020 |
SIGMOD |
8.1513883e-05 |
| 2,968 |
Raha: A Configuration-Free Error Detection System |
2019 |
SIGMOD |
7.7985097e-05 |
| 4,607 |
Data Integration and Machine Learning: A Natural Synergy |
2018 |
SIGMOD |
6.0538827e-05 |
| 4,884 |
Relational Data Synthesis using Generative Adversarial Networks: A Design Space Exploration |
2020 |
VLDB |
5.8540287e-05 |
| 5,978 |
Rotom: A Meta-Learned Data Augmentation Framework for Entity Matching, Data Cleaning, Text Classification, and Beyond |
2021 |
SIGMOD |
5.2453012e-05 |
| 6,526 |
Data Collection and Quality Challenges for Deep Learning |
2020 |
VLDB |
5.0267429e-05 |
| 7,613 |
ADnEV: Cross-Domain Schema Matching using Deep Similarity Matrix Adjustment and Evaluation |
2020 |
VLDB |
4.6961059e-05 |
| 8,042 |
Transform-Data-by-Example (TDE): Extensible Data Transformation in Excel |
2018 |
SIGMOD |
4.5994569e-05 |
Semantically Similar Papers
| Overall Rank |
Paper |
Year |
Venue |
Pagerank |
| 5,963 |
Automatic Data Acquisition for Deep Learning |
2021 |
VLDB |
5.2526794e-05 |
| 1,420 |
Data Management Challenges in Production Machine Learning |
2017 |
SIGMOD |
0.00012057956 |
| 6,526 |
Data Collection and Quality Challenges for Deep Learning |
2020 |
VLDB |
5.0267429e-05 |
| 1,463 |
ARDA: Automatic Relational Data Augmentation for Machine Learning |
2020 |
VLDB |
0.00011869295 |
| 7,020 |
LLM for Data Management |
2024 |
VLDB |
4.8595728e-05 |
| 5,976 |
Responsible Data Integration: Next-generation Challenges |
2022 |
SIGMOD |
5.245976e-05 |
| 5,028 |
Adaptive Data Augmentation for Supervised Learning over Missing Data |
2021 |
VLDB |
5.7506746e-05 |
| 1,532 |
Data Management in Machine Learning: Challenges, Techniques, and Systems |
2017 |
SIGMOD |
0.00011472681 |
| 4,607 |
Data Integration and Machine Learning: A Natural Synergy |
2018 |
SIGMOD |
6.0538827e-05 |
| 7,243 |
Data Integration and Machine Learning: A Natural Synergy |
2018 |
VLDB |
4.7913666e-05 |