Database Paper Browser

Back to papers

Data Curation at Scale: The Data Tamer System

Summary: Introduces Data Tamer: an end-to-end, scalable data curation system that uses ML for attribute identification, schema/table grouping, transformations, and deduplication with human-in-the-loop visualization to assemble composites from sequences of sources. Evaluated on real enterprise workloads (up to tens of thousands of sources), showing ≈90% reduction in curation cost versus deployed production software. (summarized by gpt-5-mini on Feb 09 2026)

Paper ID
195
Venue
CIDR
Year
2013
Pagerank
0.00022030728
Overall Rank
489 | 96.60%
DOI
-

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 38 of 38 citing papers.

Rank Citing Paper Year Venue Pagerank
192 HoloClean: Holistic Data Repairs with Probabilistic Inference 2017 VLDB 0.00035728858
300 Deep Learning for Entity Matching: A Design Space Exploration 2018 SIGMOD 0.00028441466
517 Can Foundation Models Wrangle Your Data? 2023 VLDB 0.00021169035
643 Corleone: Hands-Off Crowdsourcing for Entity Matching 2014 SIGMOD 0.00018754451
1,267 Foofah: Transforming Data By Example 2017 SIGMOD 0.00012936483
1,277 The Data Civilizer System 2017 CIDR 0.00012879695
1,337 HoloDetect: Few-Shot Learning for Error Detection 2019 SIGMOD 0.00012497164
1,612 Detecting Data Errors: Where are we and what needs to be done? 2016 VLDB 0.00011142794
1,627 Data Cleaning: Overview and Emerging Challenges 2016 SIGMOD 0.00011086905
1,833 Data Wrangling: The Challenging Journey from the Wild to the Lake 2015 CIDR 0.00010378976
2,097 Predictive Interaction for Data Transformation 2015 CIDR 9.5489822e-05
2,209 Data Integration: After the Teenage Years 2017 PODS 9.2868035e-05
2,498 Support the Data Enthusiast: Challenges for Next-Generation Data-Analysis Systems 2014 VLDB 8.6465331e-05
3,711 Saga: A Platform for Continuous Construction and Serving of Knowledge At Scale 2022 SIGMOD 6.823609e-05
4,451 CLAMShell: Speeding up Crowds for Low-latency Data Labeling 2016 VLDB 6.1738675e-05
4,607 Data Integration and Machine Learning: A Natural Synergy 2018 SIGMOD 6.0538827e-05
4,665 Argonaut: Macrotask Crowdsourcing for Complex Data Processing 2015 VLDB 6.0125329e-05
4,695 DataXFormer: An Interactive Data Transformation Tool 2015 SIGMOD 5.9927993e-05
5,586 QuERy: A Framework for Integrating Entity Resolution with Query Processing 2016 VLDB 5.4219548e-05
5,937 DataXFormer: Leveraging the Web for Semantic Transformations 2015 CIDR 5.2650964e-05
6,065 APEx: Accuracy-Aware Differentially Private Data Exploration 2019 SIGMOD 5.2291685e-05
6,354 Characterizing and Selecting Fresh Data Sources 2014 SIGMOD 5.0990729e-05
6,407 Just-In-Time Data Virtualization: Lightweight Data Management with ViDa 2015 CIDR 5.076547e-05
7,013 Qualitative Data Cleaning 2016 VLDB 4.8619024e-05
7,026 Mind the Data Gap: Bridging LLMs to Enterprise Data Integration 2025 CIDR 4.8570811e-05
7,237 CleanM: An Optimizable Query Language for Unified Scale-Out Data Cleaning 2017 VLDB 4.7928651e-05
7,243 Data Integration and Machine Learning: A Natural Synergy 2018 VLDB 4.7913666e-05
7,780 A Natural Language Interface for Querying General and Individual Knowledge 2015 VLDB 4.6533677e-05
8,008 Entity Resolution On-Demand 2022 VLDB 4.6067684e-05
8,092 Saga: A Scalable Framework for Optimizing Data Cleaning Pipelines for Machine Learning Applications 2023 SIGMOD 4.587921e-05
8,593 Wisteria: Nurturing Scalable Data Cleaning Infrastructure 2015 VLDB 4.4891474e-05
8,694 Managing General and Individual Knowledge in Crowd Mining Applications 2015 CIDR 4.4661379e-05
9,171 InsightNotes: Summary-Based Annotation Management in Relational Databases 2014 SIGMOD 4.3848773e-05
9,492 Lingua Manga : A Generic Large Language Model Centric System for Data Curation 2023 VLDB 4.3341665e-05
11,316 Kyrix-J: Visual Discovery of Connected Datasets in a Data Lake 2022 CIDR 4.1945683e-05
11,343 SPINE: Scaling up Programming-by-Negative-Example for String Filtering and Transformation 2022 SIGMOD 4.1945683e-05
11,896 An Information Provider's Wish List for a Next Generation Big Data End-to-End Information System 2015 CIDR 4.1945683e-05
11,937 Mindtagger: A Demonstration of Data Labeling in Knowledge Base Construction 2015 VLDB 4.1945683e-05
Previous Page 1 / 1 Next

Outgoing Citations (Sorted by Pagerank)

Showing 3 of 3 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Rank Cited Paper Year Venue Pagerank
107 WebTables: Exploring the Power of Tables on the Web 2008 VLDB 0.00048377684
112 Potter's Wheel: An Interactive Data Cleaning System 2001 VLDB 0.00047045036
2,921 Semi-Automatic Schema Integration in Clio 2007 VLDB 7.8994603e-05
Previous Page 1 / 1 Next

Semantically Similar Papers