Back to papers
Materialization and Reuse Optimizations for Production Data Science Pipelines
Summary: Proposes budgeted materialization to precompute and store pipeline artifacts, reducing redundant data processing in retraining ML pipelines. Introduces a DAG-based reuse planner to fuse pipelines and reuse artifacts, delivering up to 10x training speedups.
(summarized by gpt-5-nano on Feb 09 2026)
- Paper ID
- 6493
- Venue
- SIGMOD
- Year
- 2022
- Pagerank
- 5.0519488e-05
- Overall Rank
- 6,469 | 55.00%
- DOI
-
10.1145/3514221.3526186
Incoming Non-self Citations Over Time
Incoming Citations (Sorted by Pagerank)
Showing 4 of 4 citing papers.
Outgoing Citations (Sorted by Pagerank)
Showing 18 of 18 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank |
Cited Paper |
Year |
Venue |
Pagerank |
| 126 |
Space-Efficient Online Computation of Quantile Summaries |
2001 |
SIGMOD |
0.00044744986 |
| 179 |
Efficient and Extensible Algorithms for Multi Query Optimization |
2000 |
SIGMOD |
0.00037672155 |
| 667 |
Incremental Knowledge Base Construction Using DeepDive |
2015 |
VLDB |
0.00018440557 |
| 761 |
Materialization Optimizations for Feature Selection Workloads |
2014 |
SIGMOD |
0.00017053783 |
| 947 |
MRShare: Sharing Across Multiple Queries in MapReduce |
2010 |
VLDB |
0.00015114576 |
| 977 |
Pipelining in Multi-Query Optimization |
2001 |
PODS |
0.0001488881 |
| 1,565 |
Principles of Dataset Versioning: Exploring the Recreation/Storage Tradeoff |
2015 |
VLDB |
0.00011345567 |
| 1,666 |
HELIX: Holistic Optimization for Accelerating Iterative Machine Learning |
2019 |
VLDB |
0.0001096361 |
| 1,788 |
On-the-Fly Sharing for Streamed Aggregation |
2006 |
SIGMOD |
0.00010555742 |
| 2,152 |
MISTIQUE: A System to Store and Query Model Intermediates for Model Diagnosis |
2018 |
SIGMOD |
9.4239787e-05 |
| 2,205 |
ReStore: Reusing Results of MapReduce Jobs |
2012 |
VLDB |
9.2920002e-05 |
| 3,378 |
General Incremental Sliding-Window Aggregation |
2015 |
VLDB |
7.1622572e-05 |
| 3,703 |
Multi-Query Optimization in MapReduce Framework |
2014 |
VLDB |
6.8289978e-05 |
| 4,576 |
The Missing Piece in Complex Analytics: Low Latency, Scalable Model Management and Serving with Velox |
2015 |
CIDR |
6.0721464e-05 |
| 6,053 |
Optimizing Machine Learning Workloads in Collaborative Environments |
2020 |
SIGMOD |
5.2326838e-05 |
| 6,330 |
Efficient Construction of Approximate Ad-Hoc ML models Through Materialization and Reuse |
2018 |
VLDB |
5.1077416e-05 |
| 8,075 |
AJoin: Ad-hoc Stream Joins at Scale |
2020 |
VLDB |
4.5917655e-05 |
| 8,653 |
ApproxML: Efficient Approximate Ad-Hoc ML Models Through Materialization and Reuse |
2019 |
VLDB |
4.475291e-05 |
Semantically Similar Papers
| Overall Rank |
Paper |
Year |
Venue |
Pagerank |
| 1,420 |
Data Management Challenges in Production Machine Learning |
2017 |
SIGMOD |
0.00012057956 |
| 3,698 |
Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning Preprocessing Pipelines |
2022 |
SIGMOD |
6.8340435e-05 |
| 2,170 |
tf.data: A Machine Learning Data Processing Framework |
2021 |
VLDB |
9.3821603e-05 |
| 9,118 |
Towards Observability for Production Machine Learning Pipelines |
2022 |
VLDB |
4.3928288e-05 |
| 5,567 |
Optimizing Data Pipelines for Machine Learning in Feature Stores |
2023 |
VLDB |
5.4305348e-05 |
| 4,774 |
LIMA: Fine-grained Lineage Tracing and Reuse in Machine Learning Systems |
2021 |
SIGMOD |
5.9316087e-05 |
| 8,257 |
Automating and Optimizing Data-Centric What-If Analyses on Native Machine Learning Pipelines |
2023 |
SIGMOD |
4.5487511e-05 |
| 6,330 |
Efficient Construction of Approximate Ad-Hoc ML models Through Materialization and Reuse |
2018 |
VLDB |
5.1077416e-05 |
| 6,053 |
Optimizing Machine Learning Workloads in Collaborative Environments |
2020 |
SIGMOD |
5.2326838e-05 |
| 2,456 |
Production Machine Learning Pipelines: Empirical Analysis and Optimization Opportunities |
2021 |
SIGMOD |
8.7733773e-05 |