Back to papers
Towards an Optimized GROUP BY Abstraction for Large-Scale Machine Learning
Summary: Proposes grouped learning, a GROUP BY-like abstraction for ML over subgroups. Presents Gradient Accumulation Parallelism (GAP) and a hybrid task/data-parallel approach in Kingpin on Ray, delivering up to 4x–14x speedups vs. state-of-the-art.
(summarized by gpt-5-nano on Feb 09 2026)
- Paper ID
- 12410
- Venue
- VLDB
- Year
- 2021
- Pagerank
- 4.3698672e-05
- Overall Rank
- 9,222 | 35.85%
- DOI
-
10.14778/3476249.3476284
Incoming Non-self Citations Over Time
Incoming Citations (Sorted by Pagerank)
Showing 4 of 4 citing papers.
Outgoing Citations (Sorted by Pagerank)
Showing 19 of 19 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank |
Cited Paper |
Year |
Venue |
Pagerank |
| 140 |
The MADlib Analytics Library or MAD Skills, the SQL |
2012 |
VLDB |
0.00042270404 |
| 411 |
PyTorch Distributed: Experiences on Accelerating Data Parallel Training |
2020 |
VLDB |
0.00023906921 |
| 557 |
SystemML: Declarative Machine Learning on Spark |
2016 |
VLDB |
0.00020197988 |
| 658 |
Towards a Unified Architecture for in-RDBMS Analytics |
2012 |
SIGMOD |
0.00018506577 |
| 683 |
Cerebro: A Data System for Optimized Deep Learning Model Selection |
2020 |
VLDB |
0.00018195476 |
| 834 |
Learning Linear Regression Models over Factorized Joins |
2016 |
SIGMOD |
0.00016135159 |
| 850 |
Scaling Factorization Machines to Relational Data |
2013 |
VLDB |
0.00015955971 |
| 1,167 |
Learning Generalized Linear Models Over Normalized Data |
2015 |
SIGMOD |
0.00013547713 |
| 1,279 |
Towards Linear Algebra over Normalized Data |
2017 |
VLDB |
0.00012868394 |
| 1,391 |
Ease.ml: Towards Multi-tenant Resource Sharing for Machine Learning Workloads |
2018 |
VLDB |
0.0001223506 |
| 1,402 |
Hybrid Parallelization Strategies for Large-Scale Machine Learning in SystemML |
2014 |
VLDB |
0.00012180605 |
| 2,122 |
SystemDS: A Declarative Machine Learning System for the End-to-End Data Science Lifecycle |
2020 |
CIDR |
9.4989076e-05 |
| 2,194 |
Enabling and Optimizing Non-linear Feature Interactions in Factorized Linear Algebra |
2019 |
SIGMOD |
9.3138337e-05 |
| 3,918 |
On Optimizing Operator Fusion Plans for Large-Scale Machine Learning in SystemML |
2018 |
VLDB |
6.6315176e-05 |
| 4,159 |
F: Regression Models over Factorized Views |
2016 |
VLDB |
6.3993326e-05 |
| 4,785 |
Demonstration of Santoku: Optimizing Machine Learning over Normalized Data |
2015 |
VLDB |
5.9236989e-05 |
| 4,975 |
An Experimental Evaluation of Large Scale GBDT Systems |
2019 |
VLDB |
5.79026e-05 |
| 8,864 |
Cerebro: A Layered Data Platform for Scalable Deep Learning |
2021 |
CIDR |
4.4326439e-05 |
| 9,117 |
Ease.ml in Action: Towards Multi-tenant Declarative Learning Services |
2018 |
VLDB |
4.3928617e-05 |
Semantically Similar Papers