Data Management Challenges in Production Machine Learning
Summary: Survey of data-management challenges in production ML pipelines, focusing on understanding, validating, cleaning, and enriching training data. Connects to database literature and outlines open questions on data quality, provenance, validation, and enrichment not yet addressed by prior art. (summarized by gpt-5-nano on Feb 09 2026)
Incoming Non-self Citations Over Time
Authors
Incoming Citations (Sorted by Pagerank)
Showing 27 of 27 citing papers.
Previous
Page 1 / 1
Next
Outgoing Citations (Sorted by Pagerank)
Showing 9 of 9 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank | Cited Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 424 | Tuning Database Configuration Parameters with iTuned | 2009 | VLDB | 0.00023616398 |
| 449 | Approximate Query Processing: Taming the TeraBytes! A Tutorial | 2001 | VLDB | 0.00022846068 |
| 460 | SeeDB: Efficient Data-Driven Visualization Recommendations to Support Visual Analytics | 2015 | VLDB | 0.00022516069 |
| 610 | Goods: Organizing Google's Datasets | 2016 | SIGMOD | 0.00019232674 |
| 791 | ActiveClean: Interactive Data Cleaning For Statistical Modeling | 2016 | VLDB | 0.00016629664 |
| 903 | To Join or Not to Join? Thinking Twice about Joins before Feature Selection | 2016 | SIGMOD | 0.0001547016 |
| 1,000 | Intelligent Rollups in Multidimensional OLAP Data | 2001 | VLDB | 0.00014709252 |
| 1,137 | User-adaptive exploration of multidimensional data | 2000 | VLDB | 0.00013730532 |
| 1,161 | Querying and Mining of Time Series Data: Experimental Comparison of Representations and Distance Measures | 2008 | VLDB | 0.00013585236 |
Previous
Page 1 / 1
Next
Semantically Similar Papers
| Overall Rank | Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 11,317 | Data Management Opportunities for Foundation Models | 2022 | CIDR | 4.1945683e-05 |
| 6,228 | Managing ML Pipelines: Feature Stores and the Coming Wave of Embedding Ecosystems | 2021 | VLDB | 5.1470042e-05 |
| 10,842 | ML-Asset Management: Curation, Discovery, and Utilization | 2025 | VLDB | 4.1945683e-05 |
| 4,906 | Machine Learning for Big Data | 2013 | SIGMOD | 5.8389053e-05 |
| 939 | Data Lake Management: Challenges and Opportunities | 2019 | VLDB | 0.00015187344 |
| 9,118 | Towards Observability for Production Machine Learning Pipelines | 2022 | VLDB | 4.3928288e-05 |
| 6,526 | Data Collection and Quality Challenges for Deep Learning | 2020 | VLDB | 5.0267429e-05 |
| 7,655 | Machine Learning for Cloud Data Systems: the Progress so far and the Path Forward | 2021 | VLDB | 4.6872456e-05 |
| 2,456 | Production Machine Learning Pipelines: Empirical Analysis and Optimization Opportunities | 2021 | SIGMOD | 8.7733773e-05 |
| 1,532 | Data Management in Machine Learning: Challenges, Techniques, and Systems | 2017 | SIGMOD | 0.00011472681 |