Back to papers
PyTorch Distributed: Experiences on Accelerating Data Parallel Training
Summary: Design, implementation, and evaluation of PyTorch Distributed Data Parallel for scalable data-parallel training on GPUs. Unique practical optimizations—gradient bucketing, compute/communication overlap, and skipped gradient sync—achieving near-linear scaling to 256 GPUs.
(summarized by gpt-5-nano on Feb 09 2026)
- Paper ID
- 12186
- Venue
- VLDB
- Year
- 2020
- Pagerank
- 0.00023906921
- Overall Rank
- 411 | 97.15%
- DOI
-
10.14778/3415478.3415530
Incoming Non-self Citations Over Time
Incoming Citations (Sorted by Pagerank)
Showing 28 of 28 citing papers.
| Rank |
Citing Paper |
Year |
Venue |
Pagerank |
| 2,677 |
HET: Scaling out Huge Embedding Model Training via Cache-enabled Distributed Framework |
2022 |
VLDB |
8.3268401e-05 |
| 2,791 |
Towards Demystifying Serverless Machine Learning Training |
2021 |
SIGMOD |
8.1206618e-05 |
| 2,902 |
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel |
2023 |
VLDB |
7.93939e-05 |
| 3,025 |
NeutronStar: Distributed GNN Training with Hybrid Dependency Management |
2022 |
SIGMOD |
7.6906935e-05 |
| 3,254 |
Query Processing on Tensor Computation Runtimes |
2022 |
VLDB |
7.3161051e-05 |
| 5,052 |
HET-GMP: A Graph-based System Approach to Scaling Large Embedding Model Training |
2022 |
SIGMOD |
5.7337977e-05 |
| 5,333 |
Heterogeneity-Aware Distributed Machine Learning Training via Partial Reduce |
2021 |
SIGMOD |
5.5656575e-05 |
| 5,720 |
BAGUA: Scaling up Distributed Learning with System Relaxations |
2022 |
VLDB |
5.3527734e-05 |
| 5,821 |
Tensor Relational Algebra for Distributed Machine Learning System Design |
2021 |
VLDB |
5.3134851e-05 |
| 6,377 |
Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism |
2023 |
VLDB |
5.0911095e-05 |
| 7,152 |
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity |
2024 |
VLDB |
4.8154191e-05 |
| 8,126 |
SDPipe: A Semi-Decentralized Framework for Heterogeneity-aware Pipeline-parallel Training |
2023 |
VLDB |
4.5796615e-05 |
| 8,520 |
mLoRA: Fine-Tuning LoRA Adapters via Highly-Efficient Pipeline Parallelism in Multiple GPUs |
2025 |
VLDB |
4.4937074e-05 |
| 8,607 |
Harmony: Overcoming the Hurdles of GPU Memory Capacity to Train Massive DNN Models on Commodity Servers |
2022 |
VLDB |
4.4855009e-05 |
| 8,712 |
ANN Softmax: Acceleration of Extreme Classification Training |
2022 |
VLDB |
4.4626362e-05 |
| 8,808 |
FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement |
2023 |
SIGMOD |
4.4454035e-05 |
| 8,864 |
Cerebro: A Layered Data Platform for Scalable Deep Learning |
2021 |
CIDR |
4.4326439e-05 |
| 9,222 |
Towards an Optimized GROUP BY Abstraction for Large-Scale Machine Learning |
2021 |
VLDB |
4.3698672e-05 |
| 9,319 |
How Can We Train Deep Learning Models Across Clouds and Continents? An Experimental Study |
2024 |
VLDB |
4.3556432e-05 |
| 9,326 |
BladeDISC: Optimizing Dynamic Shape Machine Learning Workloads via Compiler Approach |
2023 |
SIGMOD |
4.3556432e-05 |
| 9,603 |
Saturn: An Optimized Data System for Multi-Large-Model Deep Learning Workloads |
2024 |
VLDB |
4.3177432e-05 |
| 9,694 |
EinDecomp: Decomposition of Declaratively-Specified Machine Learning and Numerical Computations for Parallel Execution |
2025 |
VLDB |
4.3025567e-05 |
| 10,089 |
Hydraulis: Balancing Large Transformer Model Training via Co-designing Parallel Strategies and Data Assignment |
2026 |
SIGMOD |
4.1945683e-05 |
| 10,580 |
GPEmu: A GPU Emulator for Faster and Cheaper Prototyping and Evaluation of Deep Learning System Research |
2025 |
VLDB |
4.1945683e-05 |
| 10,626 |
LobRA: Multi-tenant Fine-tuning over Heterogeneous Data |
2025 |
VLDB |
4.1945683e-05 |
| 10,638 |
Heta: Distributed Training of Heterogeneous Graph Neural Networks |
2025 |
VLDB |
4.1945683e-05 |
| 10,656 |
Effective and Efficient Distributed Temporal Graph Learning through Hotspot Memory Sharing |
2025 |
VLDB |
4.1945683e-05 |
| 13,122 |
DECK: Experiences on Delta Checkpointing for Industrial Recommendation Systems |
2025 |
VLDB |
- |
Outgoing Citations (Sorted by Pagerank)
Showing 0 of 0 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank |
Cited Paper |
Year |
Venue |
Pagerank |
Semantically Similar Papers
| Overall Rank |
Paper |
Year |
Venue |
Pagerank |
| 6,471 |
Dynamic Parameter Allocation in Parameter Servers |
2020 |
VLDB |
5.0511668e-05 |
| 1,103 |
Large Graph Convolutional Network Training with GPU-Oriented Data Communication Architecture |
2021 |
VLDB |
0.00014025101 |
| 5,377 |
Parallel Training of Knowledge Graph Embedding Models: A Comparison of Techniques |
2022 |
VLDB |
5.5410858e-05 |
| 9,395 |
NeutronTP: Load-Balanced Distributed Full-Graph GNN Training with Tensor Parallelism |
2025 |
VLDB |
4.3441378e-05 |
| 8,126 |
SDPipe: A Semi-Decentralized Framework for Heterogeneity-aware Pipeline-parallel Training |
2023 |
VLDB |
4.5796615e-05 |
| 9,596 |
Scalable Graph Convolutional Network Training on Distributed-Memory Systems |
2023 |
VLDB |
4.319218e-05 |
| 8,735 |
TensorSocket: Shared Data Loading for Deep Learning Training |
2026 |
SIGMOD |
4.456315e-05 |
| 9,965 |
Distributed Learning of Fully Connected Neural Networks using Independent Subnet Training |
2022 |
VLDB |
4.2269436e-05 |
| 5,333 |
Heterogeneity-Aware Distributed Machine Learning Training via Partial Reduce |
2021 |
SIGMOD |
5.5656575e-05 |
| 2,902 |
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel |
2023 |
VLDB |
7.93939e-05 |