Database Paper Browser

Back to papers

Malleus: Straggler-Resilient Hybrid Parallel Training of Large-scale Models via Malleable Data and Model Parallelization

Summary: Malleus enables straggler-resilient hybrid training via per-GPU profiling and a planning algorithm that optimizes GPU groups, pipelines, layers, and data. It re-plans and migrates state on the fly to sustain stability, operating under dynamic straggler distributions and delivering 2.63–5.28x efficiency on LLMs up to 110B. (summarized by gpt-5-nano on Feb 09 2026)

Paper ID
7241
Venue
SIGMOD
Year
2025
Pagerank
4.1945683e-05
Overall Rank
10,492 | 27.01%
DOI
10.1145/3725322

Incoming Non-self Citations Over Time

No non-self incoming citations found for this paper in this database.

Authors

Incoming Citations (Sorted by Pagerank)

Showing 1 of 1 citing papers.

Previous Page 1 / 1 Next

Outgoing Citations (Sorted by Pagerank)

Showing 18 of 18 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Rank Cited Paper Year Venue Pagerank
1,942 Heterogeneity-aware Distributed Parameter Servers 2017 SIGMOD 0.00010012691
2,902 PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel 2023 VLDB 7.93939e-05
3,114 GPTuner: A Manual-Reading Database Tuning System via GPT-Guided Bayesian Optimization 2024 VLDB 7.5451724e-05
3,662 The Dawn of Natural Language to SQL: Are We Fully Ready? 2024 VLDB 6.8672143e-05
3,698 Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning Preprocessing Pipelines 2022 SIGMOD 6.8340435e-05
3,808 SketchML: Accelerating Distributed Machine Learning with Data Sketches 2018 SIGMOD 6.7455428e-05
5,099 ArcheType: A Novel Framework for Open-Source Column Type Annotation using Large Language Models 2024 VLDB 5.6997784e-05
5,552 GoldMiner: Elastic Scaling of Training Data Pre-Processing Pipelines for Deep Learning 2023 SIGMOD 5.4402488e-05
5,720 BAGUA: Scaling up Distributed Learning with System Relaxations 2022 VLDB 5.3527734e-05
6,377 Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism 2023 VLDB 5.0911095e-05
7,152 Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity 2024 VLDB 4.8154191e-05
7,227 Data and AI Model Markets: Opportunities for Data and Model Sharing, Discovery, and Integration 2023 VLDB 4.7952919e-05
7,536 Angel-PTM: A Scalable and Economical Large-scale Pre-training System in Tencent 2023 VLDB 4.7176331e-05
8,080 Biathlon: Harnessing Model Resilience for Accelerating ML Inference Pipelines 2024 VLDB 4.5911668e-05
8,092 Saga: A Scalable Framework for Optimizing Data Cleaning Pipelines for Machine Learning Applications 2023 SIGMOD 4.587921e-05
8,808 FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement 2023 SIGMOD 4.4454035e-05
9,319 How Can We Train Deep Learning Models Across Clouds and Continents? An Experimental Study 2024 VLDB 4.3556432e-05
9,326 BladeDISC: Optimizing Dynamic Shape Machine Learning Workloads via Compiler Approach 2023 SIGMOD 4.3556432e-05
Previous Page 1 / 1 Next

Semantically Similar Papers