Back to papers
Malleus: Straggler-Resilient Hybrid Parallel Training of Large-scale Models via Malleable Data and Model Parallelization
Summary: Malleus enables straggler-resilient hybrid training via per-GPU profiling and a planning algorithm that optimizes GPU groups, pipelines, layers, and data. It re-plans and migrates state on the fly to sustain stability, operating under dynamic straggler distributions and delivering 2.63–5.28x efficiency on LLMs up to 110B.
(summarized by gpt-5-nano on Feb 09 2026)
- Paper ID
- 7241
- Venue
- SIGMOD
- Year
- 2025
- Pagerank
- 4.1945683e-05
- Overall Rank
- 10,492 | 27.01%
- DOI
-
10.1145/3725322
Incoming Non-self Citations Over Time
No non-self incoming citations found for this paper in this database.
Incoming Citations (Sorted by Pagerank)
Showing 1 of 1 citing papers.
Outgoing Citations (Sorted by Pagerank)
Showing 18 of 18 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank |
Cited Paper |
Year |
Venue |
Pagerank |
| 1,942 |
Heterogeneity-aware Distributed Parameter Servers |
2017 |
SIGMOD |
0.00010012691 |
| 2,902 |
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel |
2023 |
VLDB |
7.93939e-05 |
| 3,114 |
GPTuner: A Manual-Reading Database Tuning System via GPT-Guided Bayesian Optimization |
2024 |
VLDB |
7.5451724e-05 |
| 3,662 |
The Dawn of Natural Language to SQL: Are We Fully Ready? |
2024 |
VLDB |
6.8672143e-05 |
| 3,698 |
Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning Preprocessing Pipelines |
2022 |
SIGMOD |
6.8340435e-05 |
| 3,808 |
SketchML: Accelerating Distributed Machine Learning with Data Sketches |
2018 |
SIGMOD |
6.7455428e-05 |
| 5,099 |
ArcheType: A Novel Framework for Open-Source Column Type Annotation using Large Language Models |
2024 |
VLDB |
5.6997784e-05 |
| 5,552 |
GoldMiner: Elastic Scaling of Training Data Pre-Processing Pipelines for Deep Learning |
2023 |
SIGMOD |
5.4402488e-05 |
| 5,720 |
BAGUA: Scaling up Distributed Learning with System Relaxations |
2022 |
VLDB |
5.3527734e-05 |
| 6,377 |
Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism |
2023 |
VLDB |
5.0911095e-05 |
| 7,152 |
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity |
2024 |
VLDB |
4.8154191e-05 |
| 7,227 |
Data and AI Model Markets: Opportunities for Data and Model Sharing, Discovery, and Integration |
2023 |
VLDB |
4.7952919e-05 |
| 7,536 |
Angel-PTM: A Scalable and Economical Large-scale Pre-training System in Tencent |
2023 |
VLDB |
4.7176331e-05 |
| 8,080 |
Biathlon: Harnessing Model Resilience for Accelerating ML Inference Pipelines |
2024 |
VLDB |
4.5911668e-05 |
| 8,092 |
Saga: A Scalable Framework for Optimizing Data Cleaning Pipelines for Machine Learning Applications |
2023 |
SIGMOD |
4.587921e-05 |
| 8,808 |
FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement |
2023 |
SIGMOD |
4.4454035e-05 |
| 9,319 |
How Can We Train Deep Learning Models Across Clouds and Continents? An Experimental Study |
2024 |
VLDB |
4.3556432e-05 |
| 9,326 |
BladeDISC: Optimizing Dynamic Shape Machine Learning Workloads via Compiler Approach |
2023 |
SIGMOD |
4.3556432e-05 |
Semantically Similar Papers