Database Paper Browser

Back to papers

Data Management Challenges in Production Machine Learning

Summary: Survey of data-management challenges in production ML pipelines, focusing on understanding, validating, cleaning, and enriching training data. Connects to database literature and outlines open questions on data quality, provenance, validation, and enrichment not yet addressed by prior art. (summarized by gpt-5-nano on Feb 09 2026)

Paper ID
5340
Venue
SIGMOD
Year
2017
Pagerank
0.00012057956
Overall Rank
1,420 | 90.13%
DOI
10.1145/3035918.3054782

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 27 of 27 citing papers.

Rank Citing Paper Year Venue Pagerank
1,482 Automating Large-Scale Data Quality Verification 2018 VLDB 0.00011725533
1,891 Towards Model-based Pricing for Machine Learning in a Data Marketplace 2019 SIGMOD 0.00010194092
1,940 SliceLine: Fast, Linear-Algebra-based Slice Finding for ML Model Debugging 2021 SIGMOD 0.00010020173
2,456 Production Machine Learning Pipelines: Empirical Analysis and Optimization Opportunities 2021 SIGMOD 8.7733773e-05
2,753 Complaint-driven Training Data Debugging for Query 2.0 2020 SIGMOD 8.1724339e-05
3,145 Opportunities for Quantum Acceleration of Databases: Optimization of Queries and Transaction Schedules 2023 VLDB 7.4781724e-05
4,129 Are Key-Foreign Key Joins Safe to Avoid when Learning High-Capacity Classifiers? 2018 VLDB 6.428887e-05
4,197 Incremental View Maintenance with Triple Lock Factorization Benefits 2018 SIGMOD 6.367895e-05
4,424 PrIU: A Provenance-Based Approach for Incrementally Updating Regression Models 2020 SIGMOD 6.198474e-05
4,607 Data Integration and Machine Learning: A Natural Synergy 2018 SIGMOD 6.0538827e-05
4,787 The Relational Data Borg is Learning 2020 VLDB 5.9224501e-05
4,935 OmniFair: A Declarative System for Model-Agnostic Group Fairness in Machine Learning 2021 SIGMOD 5.8198727e-05
5,222 Enabling SQL-based Training Data Debugging for Federated Learning 2022 VLDB 5.6210545e-05
5,605 TPCx-AI - An Industry Standard Benchmark for Artificial Intelligence and Machine Learning Systems 2023 VLDB 5.4142007e-05
5,719 Survivability of Cloud Databases - Factors and Prediction 2018 SIGMOD 5.3550742e-05
6,134 Finding Label and Model Errors in Perception Data With Learned Observation Assertions 2022 SIGMOD 5.1943414e-05
6,526 Data Collection and Quality Challenges for Deep Learning 2020 VLDB 5.0267429e-05
6,993 Unit Testing Data with Deequ 2019 SIGMOD 4.8693227e-05
7,243 Data Integration and Machine Learning: A Natural Synergy 2018 VLDB 4.7913666e-05
7,411 ItemSuggest: A Data Management Platform for Machine Learned Ranking Services 2019 CIDR 4.7364436e-05
7,838 Auto-Validate: Unsupervised Data Validation Using Data-Domain Patterns Inferred from Data Lakes 2021 SIGMOD 4.6377995e-05
8,092 Saga: A Scalable Framework for Optimizing Data Cleaning Pipelines for Machine Learning Applications 2023 SIGMOD 4.587921e-05
8,514 UPLIFT: Parallelization Strategies for Feature Transformations in Machine Learning Workloads 2022 VLDB 4.4944285e-05
9,118 Towards Observability for Production Machine Learning Pipelines 2022 VLDB 4.3928288e-05
11,317 Data Management Opportunities for Foundation Models 2022 CIDR 4.1945683e-05
11,487 Toto - Benchmarking the Efficiency of a Cloud Service 2021 SIGMOD 4.1945683e-05
13,300 DEEM 2019: Workshop on Data Management for End-to-End Machine Learning 2019 SIGMOD -
Previous Page 1 / 1 Next

Outgoing Citations (Sorted by Pagerank)

Showing 9 of 9 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Previous Page 1 / 1 Next

Semantically Similar Papers