Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
Summary: Flash-LLM: tensor-core framework using “Load-as-Sparse, Compute-as-Dense” SpMM to exploit unstructured sparsity by trading redundant compute to cut memory bandwidth and improve tensor-core utilization. Delivers 1.5–2.9× SpMM and up to 3.8× end-to-end tokens/sec vs Sputnik, SparTA, DeepSpeed, FasterTransformer. (summarized by gpt-5-mini on Feb 09 2026)
Incoming Non-self Citations Over Time
Authors
- 1. Haojun Xia
- 2. Zhen Zheng
- 3. Yuchao Li
- 4. Donglin Zhuang
- 5. Zhongzhu Zhou
- 6. Xiafei Qiu
- 7. Yong Li
- 8. Wei Lin
- 9. Shuaiwen Leon Song
Incoming Citations (Sorted by Pagerank)
Showing 3 of 3 citing papers.
| Rank | Citing Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 8,520 | mLoRA: Fine-Tuning LoRA Adapters via Highly-Efficient Pipeline Parallelism in Multiple GPUs | 2025 | VLDB | 4.4937074e-05 |
| 9,677 | Apt-Serve: Adaptive Request Scheduling on Hybrid Cache for Scalable LLM Inference Serving | 2025 | SIGMOD | 4.3047774e-05 |
| 10,492 | Malleus: Straggler-Resilient Hybrid Parallel Training of Large-scale Models via Malleable Data and Model Parallelization | 2025 | SIGMOD | 4.1945683e-05 |
Previous
Page 1 / 1
Next
Outgoing Citations (Sorted by Pagerank)
Showing 11 of 11 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
Previous
Page 1 / 1
Next