Data Curation at Scale: The Data Tamer System
Summary: Introduces Data Tamer: an end-to-end, scalable data curation system that uses ML for attribute identification, schema/table grouping, transformations, and deduplication with human-in-the-loop visualization to assemble composites from sequences of sources. Evaluated on real enterprise workloads (up to tens of thousands of sources), showing ≈90% reduction in curation cost versus deployed production software. (summarized by gpt-5-mini on Feb 09 2026)
Incoming Non-self Citations Over Time
Authors
- 1. Michael Stonebraker
- 2. Daniel Bruckner
- 3. Ihab F. Ilyas
- 4. George Beskales
- 5. Mitch Cherniack
- 6. Stan Zdonik
- 7. Alexander Pagan
- 8. Shan Xu
Incoming Citations (Sorted by Pagerank)
Showing 38 of 38 citing papers.
Previous
Page 1 / 1
Next
Outgoing Citations (Sorted by Pagerank)
Showing 3 of 3 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank | Cited Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 107 | WebTables: Exploring the Power of Tables on the Web | 2008 | VLDB | 0.00048377684 |
| 112 | Potter's Wheel: An Interactive Data Cleaning System | 2001 | VLDB | 0.00047045036 |
| 2,921 | Semi-Automatic Schema Integration in Clio | 2007 | VLDB | 7.8994603e-05 |
Previous
Page 1 / 1
Next
Semantically Similar Papers
| Overall Rank | Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 3,724 | Toward Large Scale Integration: Building a MetaQuerier over Databases on the Web | 2005 | CIDR | 6.8173288e-05 |
| 5,058 | A Demo of the Data Civilizer System | 2017 | SIGMOD | 5.7280139e-05 |
| 11,906 | Knowledge Curation and Knowledge Fusion: Challenges, Models, and Applications | 2015 | SIGMOD | 4.1945683e-05 |
| 341 | CURE: An Efficient Clustering Algorithm for Large Databases | 1998 | SIGMOD | 0.00026810548 |
| 6,519 | Expand your Training Limits! Generating Training Data for ML-based Data Management | 2021 | SIGMOD | 5.0316686e-05 |
| 13,171 | Reimagining Deep Learning Systems Through the Lens of Data Systems | 2024 | VLDB | - |
| 5,729 | KATARA: Reliable Data Cleaning with Knowledge Bases and Crowdsourcing | 2015 | VLDB | 5.3506368e-05 |
| 4,619 | Crowd-Based Deduplication: An Adaptive Approach | 2015 | SIGMOD | 6.0444854e-05 |
| 2,946 | BigDansing: A System for Big Data Cleansing | 2015 | SIGMOD | 7.8372441e-05 |
| 1,277 | The Data Civilizer System | 2017 | CIDR | 0.00012879695 |