Back to papers
Production Machine Learning Pipelines: Empirical Analysis and Optimization Opportunities
Summary: Analyzes 3,000 production ML pipelines at Google via provenance graphs and 450k trainings to characterize lifespan, topology, and complexity. Introduces model graphlets, a data model for repeated components, and shows optimization opportunities—pruning wasted computation can cut costs by ~50% without delaying deployment.
(summarized by gpt-5-nano on Feb 09 2026)
- Paper ID
- 6255
- Venue
- SIGMOD
- Year
- 2021
- Pagerank
- 8.7733773e-05
- Overall Rank
- 2,456 | 82.92%
- DOI
-
10.1145/3448016.3457566
Incoming Non-self Citations Over Time
Incoming Citations (Sorted by Pagerank)
Showing 12 of 12 citing papers.
| Rank |
Citing Paper |
Year |
Venue |
Pagerank |
| 5,605 |
TPCx-AI - An Industry Standard Benchmark for Artificial Intelligence and Machine Learning Systems |
2023 |
VLDB |
5.4142007e-05 |
| 8,257 |
Automating and Optimizing Data-Centric What-If Analyses on Native Machine Learning Pipelines |
2023 |
SIGMOD |
4.5487511e-05 |
| 8,514 |
UPLIFT: Parallelization Strategies for Feature Transformations in Machine Learning Workloads |
2022 |
VLDB |
4.4944285e-05 |
| 8,737 |
Scheduling Data Processing Pipelines for Incremental Training on MLP-based Recommendation Models |
2025 |
SIGMOD |
4.456315e-05 |
| 8,743 |
CtxPipe: Context-aware Data Preparation Pipeline Construction for Machine Learning |
2024 |
SIGMOD |
4.456315e-05 |
| 8,786 |
AWARE: Workload-aware, Redundancy-exploiting Linear Algebra |
2023 |
SIGMOD |
4.4521262e-05 |
| 8,859 |
Pipemizer: An Optimizer for Analytics Data Pipelines |
2022 |
VLDB |
4.4344107e-05 |
| 9,231 |
Modyn: Data-Centric Machine Learning Pipeline Orchestration |
2025 |
SIGMOD |
4.3690661e-05 |
| 10,820 |
APEX-DAG: Library and Language independent Pipeline EXtraction |
2025 |
VLDB |
4.1945683e-05 |
| 11,052 |
Efficiently Mitigating the Impact of Data Drift on Machine Learning Pipelines |
2024 |
VLDB |
4.1945683e-05 |
| 11,216 |
Demystifying the QoS and QoE of Edge-hosted Video Streaming Applications in the Wild with SNESet |
2023 |
SIGMOD |
4.1945683e-05 |
| 11,241 |
Enabling Secure and Efficient Data Analytics Pipeline Evolution with Trusted Execution Environment |
2023 |
VLDB |
4.1945683e-05 |
Outgoing Citations (Sorted by Pagerank)
Showing 14 of 14 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank |
Cited Paper |
Year |
Venue |
Pagerank |
| 140 |
The MADlib Analytics Library or MAD Skills, the SQL |
2012 |
VLDB |
0.00042270404 |
| 610 |
Goods: Organizing Google's Datasets |
2016 |
SIGMOD |
0.00019232674 |
| 1,420 |
Data Management Challenges in Production Machine Learning |
2017 |
SIGMOD |
0.00012057956 |
| 1,666 |
HELIX: Holistic Optimization for Accelerating Iterative Machine Learning |
2019 |
VLDB |
0.0001096361 |
| 2,027 |
Titian: Data Provenance Support in Spark |
2016 |
VLDB |
9.7437067e-05 |
| 2,028 |
Putting Lipstick on Pig: Enabling Database-style Workflow Provenance |
2012 |
VLDB |
9.7433981e-05 |
| 2,163 |
Elastic Machine Learning Algorithms in Amazon SageMaker |
2020 |
SIGMOD |
9.3949234e-05 |
| 2,269 |
Ground: A Data Context Service |
2017 |
CIDR |
9.147379e-05 |
| 2,463 |
noWorkflow: a Tool for Collecting, Analyzing, and Managing Provenance from Python Scripts |
2017 |
VLDB |
8.7561396e-05 |
| 2,804 |
Extending Relational Query Processing with ML Inference |
2020 |
CIDR |
8.0935487e-05 |
| 4,787 |
The Relational Data Borg is Learning |
2020 |
VLDB |
5.9224501e-05 |
| 5,086 |
Improving Reproducibility of Data Science Pipelines through Transparent Provenance Capture |
2020 |
VLDB |
5.7078462e-05 |
| 5,802 |
An Optimal Labeling Scheme for Workflow Provenance Using Skeleton Labels |
2010 |
SIGMOD |
5.3209459e-05 |
| 6,299 |
Incremental View Maintenance For Collection Programming |
2016 |
PODS |
5.1225782e-05 |
Semantically Similar Papers
| Overall Rank |
Paper |
Year |
Venue |
Pagerank |
| 2,170 |
tf.data: A Machine Learning Data Processing Framework |
2021 |
VLDB |
9.3821603e-05 |
| 3,698 |
Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning Preprocessing Pipelines |
2022 |
SIGMOD |
6.8340435e-05 |
| 11,629 |
Leveraging Organizational Resources to Adapt Models to New Data Modalities |
2020 |
VLDB |
4.1945683e-05 |
| 6,053 |
Optimizing Machine Learning Workloads in Collaborative Environments |
2020 |
SIGMOD |
5.2326838e-05 |
| 11,317 |
Data Management Opportunities for Foundation Models |
2022 |
CIDR |
4.1945683e-05 |
| 8,163 |
Capturing and Querying Fine-grained Provenance of Preprocessing Pipelines in Data Science |
2021 |
VLDB |
4.5723431e-05 |
| 11,313 |
Towards Observability for Machine Learning Pipelines |
2022 |
CIDR |
4.1945683e-05 |
| 1,420 |
Data Management Challenges in Production Machine Learning |
2017 |
SIGMOD |
0.00012057956 |
| 9,118 |
Towards Observability for Production Machine Learning Pipelines |
2022 |
VLDB |
4.3928288e-05 |
| 6,469 |
Materialization and Reuse Optimizations for Production Data Science Pipelines |
2022 |
SIGMOD |
5.0519488e-05 |