DLRover-RM: Resource Optimization for Deep Recommendation Models Training in the Cloud
Summary: Elastic framework for DLRM training that builds a DLRM-specific resource–performance model and a three-stage heuristic to auto-allocate and dynamically adjust GPU/CPU/memory to boost utilization. Adds cloud-instability mitigation; deployed at AntGroup with 31% lower JCT and +15% CPU/+20% memory. (summarized by gpt-5-mini on Feb 09 2026)
Incoming Non-self Citations Over Time
Authors
- 1. Qinlong Wang
- 2. Tingfeng Lan
- 3. Yinghao Tang
- 4. Bo Sang
- 5. Ziling Huang
- 6. Yiheng Du
- 7. Haitao Zhang
- 8. Jian Sha
- 9. Hui Lu
- 10. Yuanchun Zhou
- 11. Ke Zhang
- 12. Mingjie Tang
Incoming Citations (Sorted by Pagerank)
Showing 1 of 1 citing papers.
| Rank | Citing Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 10,532 | IncrCP: Decomposing and Orchestrating Incremental Checkpoints for Effective Recommendation Model Training | 2025 | VLDB | 4.1945683e-05 |
Previous
Page 1 / 1
Next
Outgoing Citations (Sorted by Pagerank)
Showing 6 of 6 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank | Cited Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 2,688 | Accelerating Recommendation System Training by Leveraging Popular Choices | 2022 | VLDB | 8.2991144e-05 |
| 3,114 | GPTuner: A Manual-Reading Database Tuning System via GPT-Guided Bayesian Optimization | 2024 | VLDB | 7.5451724e-05 |
| 4,180 | FastFlow: Accelerating Deep Learning Model Training with Smart Offloading of Input Data Pipeline | 2023 | VLDB | 6.3793352e-05 |
| 4,616 | Eigen: End-to-end Resource Optimization for Large-Scale Databases on the Cloud | 2023 | VLDB | 6.045723e-05 |
| 4,802 | Resource Elasticity for Large-Scale Machine Learning | 2015 | SIGMOD | 5.9114415e-05 |
| 7,536 | Angel-PTM: A Scalable and Economical Large-scale Pre-training System in Tencent | 2023 | VLDB | 4.7176331e-05 |
Previous
Page 1 / 1
Next