Database Paper Browser

Back to papers

Cost Models for Big Data Query Processing: Learning, Retrofitting, and Our Findings

Summary: Workload-driven cost models for big data queries, integrated into a Cascade-style optimizer to optimize plans and containers. In production, Cleo/SCOPE sees 2–3 orders higher accuracy and 20x correlation; ~70% plan changes cut latency and save resources. (summarized by gpt-5-nano on Feb 09 2026)

Paper ID
5797
Venue
SIGMOD
Year
2020
Pagerank
6.9055212e-05
Overall Rank
3,625 | 74.79%
DOI
10.1145/3318464.3380584

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 40 of 40 citing papers.

Rank Citing Paper Year Venue Pagerank
2,954 Magpie: Python at Speed and Scale using Cloud Backends 2021 CIDR 7.8262582e-05
3,169 QueryFormer: A Tree Transformer Model for Query Plan Representation 2022 VLDB 7.4498425e-05
3,828 Zero-Shot Cost Models for Out-of-the-box Learned Cost Prediction 2022 VLDB 6.7208524e-05
3,875 Cloudy with High Chance of DBMS: A 10-year Prediction for Enterprise-Grade ML 2020 CIDR 6.675257e-05
4,690 Deploying a Steered Query Optimizer in Production at Microsoft 2022 SIGMOD 5.997226e-05
5,334 LEON: A New Framework for ML-Aided Query Optimization 2023 VLDB 5.5649836e-05
5,368 Fine-Grained Modeling and Optimization for Intelligent Resource Management in Big Data Processing 2022 VLDB 5.5457532e-05
6,040 Steering Query Optimizers: A Practical Take on Big Data Workloads 2021 SIGMOD 5.2412035e-05
6,261 The Cosmos Big Data Platform at Microsoft: Over a Decade of Progress and a Decade to Look Forward 2021 VLDB 5.1350714e-05
6,775 A Unified Transferable Model for ML-Enhanced DBMS 2022 CIDR 4.9299192e-05
6,879 Detect, Distill and Update: Learned DB Systems Facing Out of Distribution Data 2023 SIGMOD 4.8971368e-05
6,885 PilotScope: Steering Databases with Machine Learning Drivers 2024 VLDB 4.895386e-05
7,467 Yannakakis+: Practical Acyclic Query Evaluation with Theoretical Guarantees 2025 SIGMOD 4.7218691e-05
7,655 Machine Learning for Cloud Data Systems: the Progress so far and the Path Forward 2021 VLDB 4.6872456e-05
7,684 AutoToken: Predicting Peak Parallelism for Big Data Analytics at Microsoft 2020 VLDB 4.6796855e-05
7,753 Rethinking Learned Cost Models: Why Start from Scratch? 2023 SIGMOD 4.660151e-05
7,828 Modeling Shifting Workloads for Learned Database Systems 2024 SIGMOD 4.6407986e-05
7,889 Cost-Intelligent Data Analytics in the Cloud 2024 CIDR 4.6253386e-05
8,041 DISTILL: Low-Overhead Data-Driven Techniques for Filtering and Costing Indexes for Scalable Index Tuning 2022 VLDB 4.5998045e-05
8,131 Sibyl: Forecasting Time-Evolving Query Workloads 2024 SIGMOD 4.5784634e-05
8,197 SparkCruise: Workload Optimization in Managed Spark Clusters at Microsoft 2021 VLDB 4.5607121e-05
8,220 PerfGuard: Deploying ML-for-Systems without Performance Regressions, Almost! 2021 VLDB 4.5557328e-05
8,416 Towards Building Autonomous Data Services on Azure 2023 SIGMOD 4.5196199e-05
8,582 Towards Query Optimizer as a Service (QOaaS) in a Unified LakeHouse Ecosystem: Can One QO Rule Them All? 2025 CIDR 4.492033e-05
8,774 Tiresias: Enabling Predictive Autonomous Storage and Indexing 2022 VLDB 4.4559995e-05
8,834 ByteCard: Enhancing ByteDance’s Data Warehouse with Learned Cardinality Estimation 2024 SIGMOD 4.4394021e-05
8,956 T3: Accurate and Fast Performance Prediction for Relational Database Systems With Compiled Decision Trees 2025 SIGMOD 4.4214154e-05
9,194 Phoebe: A Learning-based Checkpoint Optimizer 2021 VLDB 4.3761777e-05
9,600 Optimizing Dataflow Systems for Scalable Interactive Visualization 2024 SIGMOD 4.3177432e-05
9,852 Machine Unlearning in Learned Databases: An Experimental Analysis 2024 SIGMOD 4.2714575e-05
9,930 Wii: Dynamic Budget Reallocation In Index Tuning 2024 SIGMOD 4.2510122e-05
10,125 Understanding and Detecting Query Performance Regression in Practical Index Tuning: [Experiments & Analysis] 2026 SIGMOD 4.1945683e-05
10,219 Practical Parameterized Query Optimization via Efficient Plan Reuse and List-wise Ranking 2026 SIGMOD 4.1945683e-05
10,271 OBELISK: Efficient Offline Query Planning with Bayesian Optimization-Informed Language Model Reasoning 2026 VLDB 4.1945683e-05
10,491 Intra-Query Runtime Elasticity for Cloud-Native Data Analysis 2025 SIGMOD 4.1945683e-05
10,543 Esc: An Early-Stopping Checker for Budget-aware Index Tuning 2025 VLDB 4.1945683e-05
10,627 Robust Plan Evaluation based on Approximate Probabilistic Machine Learning 2025 VLDB 4.1945683e-05
10,849 AXE: A Task Decomposition Approach to Learned LSM Tuning 2025 VLDB 4.1945683e-05
10,859 Graph Transformers for Query Plan Representation: Potentials and Challenges 2025 VLDB 4.1945683e-05
11,267 Anser: Adaptive Information Sharing Framework of AnalyticDB 2023 VLDB 4.1945683e-05
Previous Page 1 / 1 Next

Outgoing Citations (Sorted by Pagerank)

Showing 24 of 24 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Rank Cited Paper Year Venue Pagerank
22 SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets 2008 VLDB 0.0008456613
71 How Good Are Query Optimizers, Really? 2016 VLDB 0.00059038975
102 The Case for Learned Index Structures 2018 SIGMOD 0.00049545203
182 LEO - DB2's LEarning Optimizer 2001 VLDB 0.00036962631
204 Learned Cardinalities: Estimating Correlated Joins with Deep Learning 2019 CIDR 0.00034784455
333 Neo: A Learned Query Optimizer 2019 VLDB 0.00027206884
801 SageDB: A Learned Database System 2019 CIDR 0.00016505496
953 Runtime Measurements in the Cloud: Observing, Analyzing, and Reducing Variance 2010 VLDB 0.00015095431
1,019 Robust Estimation of Resource Consumption for SQL Queries using Statistical Techniques 2012 VLDB 0.00014625603
1,228 Toward a Progress Indicator for Database Queries 2004 SIGMOD 0.00013164884
1,254 Selectivity Estimation for Range Predicates using Lightweight Models 2019 VLDB 0.00013027411
1,512 Estimating Progress of Execution for SQL Queries 2004 SIGMOD 0.00011597041
1,666 HELIX: Holistic Optimization for Accelerating Iterative Machine Learning 2019 VLDB 0.0001096361
1,922 Selecting Subexpressions to Materialize at Datacenter Scale 2018 VLDB 0.00010082599
2,083 Towards a Learning Optimizer for Shared Clouds 2019 VLDB 9.5834572e-05
2,817 Recurring Job Optimization in Scope 2012 SIGMOD 8.0677653e-05
3,038 Azure Data Lake Store: A Hyperscale Distributed File Service for Big Data Analytics 2017 SIGMOD 7.6717218e-05
3,901 Automated Verification of Query Equivalence Using Satisfiability Modulo Theories 2019 VLDB 6.6499845e-05
4,132 Advanced Join Strategies for Large-Scale Distributed Computation 2014 VLDB 6.4241067e-05
4,174 Computation Reuse in Analytics Job Service at Microsoft 2018 SIGMOD 6.3856219e-05
5,297 Continuous Cloud-Scale Query Optimization and Processing 2013 VLDB 5.5801669e-05
7,387 Bubble Execution: Resource-aware Reliable Analytics at Cloud Scale 2018 VLDB 4.7438193e-05
9,266 Redoop Infrastructure for Recurring Big Data Queries 2014 VLDB 4.3667196e-05
9,735 SparkCruise: Handsfree Computation Reuse in Spark 2019 VLDB 4.2942813e-05
Previous Page 1 / 1 Next

Semantically Similar Papers