Efficient Fault Tolerance for Recommendation Model Training via Erasure Coding
Summary: ECRec applies erasure coding tailored to DLRM large, sparse embedding tables with a hybrid erasure-code/replication strategy that correctly and efficiently updates redundant parameters. Implemented on XDL, it avoids training pauses on failure, cuts overhead up to 66%, speeds recovery up to 9.8×, and continues with only 7–13% throughput loss. (summarized by gpt-5-mini on Feb 09 2026)
Incoming Non-self Citations Over Time
Authors
- 1. Tianyu Zhang
- 2. Kaige Liu
- 3. Jack Kosaian
- 4. Juncheng Yang
- 5. Rashmi Vinayak
Incoming Citations (Sorted by Pagerank)
Showing 1 of 1 citing papers.
| Rank | Citing Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 10,532 | IncrCP: Decomposing and Orchestrating Incremental Checkpoints for Effective Recommendation Model Training | 2025 | VLDB | 4.1945683e-05 |
Previous
Page 1 / 1
Next
Outgoing Citations (Sorted by Pagerank)
Showing 3 of 3 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank | Cited Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 50 | A Case for Redundant Arrays of Inexpensive Disks (RAID) | 1988 | SIGMOD | 0.00067394827 |
| 2,688 | Accelerating Recommendation System Training by Leveraging Popular Choices | 2022 | VLDB | 8.2991144e-05 |
| 3,669 | XORing Elephants: Novel Erasure Codes for Big Data | 2013 | VLDB | 6.8584744e-05 |
Previous
Page 1 / 1
Next