Back to papers
SparkCruise: Workload Optimization in Managed Spark Clusters at Microsoft
Summary: SparkCruise injects a workload-driven feedback loop into the Spark SQL optimizer to optimize large workloads without accessing user data. Analysis of production Spark SQL workloads vs. TPC-DS demonstrates online learning and a computation-reuse optimization.
(summarized by gpt-5-nano on Feb 09 2026)
- Paper ID
- 12517
- Venue
- VLDB
- Year
- 2021
- Pagerank
- 4.5607121e-05
- Overall Rank
- 8,197 | 42.98%
- DOI
-
10.14778/3476311.3476388
Incoming Non-self Citations Over Time
Incoming Citations (Sorted by Pagerank)
Showing 5 of 5 citing papers.
Outgoing Citations (Sorted by Pagerank)
Showing 13 of 13 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank |
Cited Paper |
Year |
Venue |
Pagerank |
| 22 |
SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets |
2008 |
VLDB |
0.0008456613 |
| 66 |
Spark SQL: Relational Data Processing in Spark |
2015 |
SIGMOD |
0.00061639801 |
| 303 |
Generic Schema Matching with Cupid |
2001 |
VLDB |
0.00028301477 |
| 542 |
Shark: SQL and Rich Analytics at Scale |
2013 |
SIGMOD |
0.00020595648 |
| 801 |
SageDB: A Learned Database System |
2019 |
CIDR |
0.00016505496 |
| 1,922 |
Selecting Subexpressions to Materialize at Datacenter Scale |
2018 |
VLDB |
0.00010082599 |
| 2,083 |
Towards a Learning Optimizer for Shared Clouds |
2019 |
VLDB |
9.5834572e-05 |
| 2,129 |
IDEBench: A Benchmark for Interactive Data Exploration |
2020 |
SIGMOD |
9.480002e-05 |
| 3,625 |
Cost Models for Big Data Query Processing: Learning, Retrofitting, and Our Findings |
2020 |
SIGMOD |
6.9055212e-05 |
| 3,789 |
DIAMetrics: Benchmarking Query Engines at Scale |
2020 |
VLDB |
6.7644737e-05 |
| 4,174 |
Computation Reuse in Analytics Job Service at Microsoft |
2018 |
SIGMOD |
6.3856219e-05 |
| 6,040 |
Steering Query Optimizers: A Practical Take on Big Data Workloads |
2021 |
SIGMOD |
5.2412035e-05 |
| 9,735 |
SparkCruise: Handsfree Computation Reuse in Spark |
2019 |
VLDB |
4.2942813e-05 |
Semantically Similar Papers
| Overall Rank |
Paper |
Year |
Venue |
Pagerank |
| 11,011 |
Intelligent Pooling: Proactive Resource Provisioning in Large-scale Cloud Service |
2024 |
VLDB |
4.1945683e-05 |
| 10,414 |
Rockhopper: A Robust Optimizer for Spark Configuration Tuning in Production Environment |
2025 |
SIGMOD |
4.1945683e-05 |
| 9,155 |
Towards Resource Efficiency: Practical Insights into Large-Scale Spark Workloads at ByteDance |
2024 |
VLDB |
4.3849295e-05 |
| 3,535 |
Scaling Spark in the Real World: Performance and Usability |
2015 |
VLDB |
6.9992495e-05 |
| 9,124 |
Dynamic Speculative Optimizations for SQL Compilation in Apache Spark |
2020 |
VLDB |
4.391961e-05 |
| 6,871 |
Towards General and Efficient Online Tuning for Spark |
2023 |
VLDB |
4.8997004e-05 |
| 5,297 |
Continuous Cloud-Scale Query Optimization and Processing |
2013 |
VLDB |
5.5801669e-05 |
| 8,506 |
New Query Optimization Techniques in the Spark Engine of Azure Synapse |
2022 |
VLDB |
4.4957661e-05 |
| 8,617 |
A Spark Optimizer for Adaptive, Fine-Grained Parameter Tuning |
2024 |
VLDB |
4.4846425e-05 |
| 9,735 |
SparkCruise: Handsfree Computation Reuse in Spark |
2019 |
VLDB |
4.2942813e-05 |