PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
Summary: Introduces PyTorch Fully Sharded Data Parallel (FSDP), an industry-grade, non-intrusive sharding framework co-designed with PyTorch internals (Tensor, dispatcher, CUDA allocator) to enable training of much larger models than DDP. FSDP bundles memory and communication optimizations across hardware to achieve near-linear TFLOPS scalability and DDP-comparable throughput while drastically reducing memory footprint. (summarized by gpt-5-mini on Feb 09 2026)
Incoming Non-self Citations Over Time
Authors
- 1. Yanli Zhao
- 2. Andrew Gu
- 3. Rohan Varma
- 4. Liang Luo
- 5. Chien-Chin Huang
- 6. Min Xu
- 7. Less Wright
- 8. Hamid Shojanazeri
- 9. Myle Ott
- 10. Sam Shleifer
- 11. Alban Desmaison
- 12. Can Balioglu
- 13. Pritam Damania
- 14. Bernard Nguyen
- 15. Geeta Chauhan
- 16. Yuchen Hao
- 17. Ajit Mathews
- 18. Shen Li
Incoming Citations (Sorted by Pagerank)
Showing 11 of 11 citing papers.
Previous
Page 1 / 1
Next
Outgoing Citations (Sorted by Pagerank)
Showing 2 of 2 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank | Cited Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 411 | PyTorch Distributed: Experiences on Accelerating Data Parallel Training | 2020 | VLDB | 0.00023906921 |
| 2,352 | MiCS: Near-linear Scaling for Training Gigantic Model on Public Cloud | 2023 | VLDB | 8.9766205e-05 |
Previous
Page 1 / 1
Next