SDPipe: A Semi-Decentralized Framework for Heterogeneity-aware Pipeline-parallel Training
Summary: SDPipe: a semi-decentralized pipeline-parallel training framework that decentralizes heavy model synchronization while centralizing lightweight group scheduling to tolerate cloud heterogeneity. Hybrid design cuts sync overhead and straggler impact, boosting scalability and convergence versus pure parameter-server or All-Reduce approaches. (summarized by gpt-5-mini on Feb 09 2026)
Incoming Non-self Citations Over Time
Authors
- 1. Xupeng Miao
- 2. Yining Shi
- 3. Zhi Yang
- 4. Bin Cui
- 5. Zhihao Jia
Incoming Citations (Sorted by Pagerank)
Showing 3 of 3 citing papers.
| Rank | Citing Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 9,677 | Apt-Serve: Adaptive Request Scheduling on Hybrid Cache for Scalable LLM Inference Serving | 2025 | SIGMOD | 4.3047774e-05 |
| 9,805 | MEMO: Fine-grained Tensor Management For Ultra-long Context LLM Training | 2025 | SIGMOD | 4.2805224e-05 |
| 10,298 | NeutronCloud: Resource-Aware Distributed GNN Training in Fluctuating Cloud Environments | 2026 | VLDB | 4.1945683e-05 |
Previous
Page 1 / 1
Next
Outgoing Citations (Sorted by Pagerank)
Showing 9 of 9 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank | Cited Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 411 | PyTorch Distributed: Experiences on Accelerating Data Parallel Training | 2020 | VLDB | 0.00023906921 |
| 1,044 | DimmWitted: A Study of Main-Memory Statistical Analytics | 2014 | VLDB | 0.00014475229 |
| 1,942 | Heterogeneity-aware Distributed Parameter Servers | 2017 | SIGMOD | 0.00010012691 |
| 2,677 | HET: Scaling out Huge Embedding Model Training via Cache-enabled Distributed Framework | 2022 | VLDB | 8.3268401e-05 |
| 2,791 | Towards Demystifying Serverless Machine Learning Training | 2021 | SIGMOD | 8.1206618e-05 |
| 5,333 | Heterogeneity-Aware Distributed Machine Learning Training via Partial Reduce | 2021 | SIGMOD | 5.5656575e-05 |
| 6,377 | Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism | 2023 | VLDB | 5.0911095e-05 |
| 7,536 | Angel-PTM: A Scalable and Economical Large-scale Pre-training System in Tencent | 2023 | VLDB | 4.7176331e-05 |
| 8,808 | FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement | 2023 | SIGMOD | 4.4454035e-05 |
Previous
Page 1 / 1
Next