Back to papers
RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference
Summary: RetroInfer rethinks KV-cache management for long-context LLM inference as a vector storage problem, offloading cache to CPU and retrieving only attention-relevant tokens. Its wave index and wave buffer jointly target sparse-attention accuracy/cost tradeoffs and GPU-CPU data movement, yielding full-attention accuracy with much higher throughput.
(summarized by gpt-5.4-mini on Apr 12 2026)
- Paper ID
- 14256
- Venue
- VLDB
- Year
- 2026
- Pagerank
- 4.1945683e-05
- Overall Rank
- 10,222 | 28.89%
- DOI
-
10.14778/3796195.3796212
Incoming Non-self Citations Over Time
No non-self incoming citations found for this paper in this database.
Incoming Citations (Sorted by Pagerank)
Showing 0 of 0 citing papers.
| Rank |
Citing Paper |
Year |
Venue |
Pagerank |
Outgoing Citations (Sorted by Pagerank)
Showing 20 of 20 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank |
Cited Paper |
Year |
Venue |
Pagerank |
| 212 |
Fast Approximate Nearest Neighbor Search With The Navigating Spreading-out Graph |
2019 |
VLDB |
0.00033913475 |
| 495 |
Milvus: A Purpose-Built Vector Data Management System |
2021 |
SIGMOD |
0.00021767688 |
| 1,636 |
PASE: PostgreSQL Ultra-High-Dimensional Approximate Nearest Neighbor Search Extension |
2020 |
SIGMOD |
0.00011053863 |
| 2,262 |
Manu: A Cloud Native Vector Database Management System |
2022 |
VLDB |
9.1624446e-05 |
| 2,320 |
High-Throughput Vector Similarity Search in Knowledge Graphs |
2023 |
SIGMOD |
9.0366225e-05 |
| 2,523 |
ACORN: Performant and Predicate-Agnostic Search Over Vector Embeddings and Structured Data |
2024 |
SIGMOD |
8.604576e-05 |
| 2,725 |
HVS: Hierarchical Graph Structure Based on Voronoi Diagrams for Solving Approximate Nearest Neighbor Search |
2022 |
VLDB |
8.2294908e-05 |
| 2,971 |
Towards Efficient Index Construction and Approximate Nearest Neighbor Search in High-Dimensional Spaces |
2023 |
VLDB |
7.7970531e-05 |
| 3,225 |
DeltaPQ: Lossless Product Quantization Code Compression for High Dimensional Similarity Search |
2020 |
VLDB |
7.3463484e-05 |
| 3,680 |
SingleStore-V: An Integrated Vector Database System in SingleStore |
2024 |
VLDB |
6.8496415e-05 |
| 4,544 |
ScaleStore: A Fast and Cost-Efficient Storage Engine using DRAM, NVMe, and RDMA |
2022 |
SIGMOD |
6.1000636e-05 |
| 4,583 |
Virtual-Memory Assisted Buffer Management |
2023 |
SIGMOD |
6.0676378e-05 |
| 5,233 |
RoarGraph: A Projected Bipartite Graph for Efficient Cross-Modal Approximate Nearest Neighbor Search |
2024 |
VLDB |
5.6131833e-05 |
| 6,357 |
PQCache: Product Quantization-based KVCache for Long Context LLM Inference |
2025 |
SIGMOD |
5.0970739e-05 |
| 6,376 |
DET-LSH: A Locality-Sensitive Hashing Scheme with Dynamic Encoding Tree for Approximate Nearest Neighbor Search |
2024 |
VLDB |
5.0916875e-05 |
| 6,389 |
Chat2Data: An Interactive Data Analysis System with RAG, Vector Databases and LLMs |
2024 |
VLDB |
5.0844009e-05 |
| 6,840 |
LeanStore: A High-Performance Storage Engine for NVMe SSDs |
2024 |
VLDB |
4.9109345e-05 |
| 8,687 |
TigerVector: Supporting Vector Search in Graph Databases for Advanced RAGs |
2025 |
SIGMOD |
4.4675056e-05 |
| 9,103 |
AlayaDB: The Data Foundation for Efficient and Effective Long-context LLM Inference |
2025 |
SIGMOD |
4.3958197e-05 |
| 9,677 |
Apt-Serve: Adaptive Request Scheduling on Hybrid Cache for Scalable LLM Inference Serving |
2025 |
SIGMOD |
4.3047774e-05 |
Semantically Similar Papers
| Overall Rank |
Paper |
Year |
Venue |
Pagerank |
| 10,776 |
GaussDB-Vector: A Large-Scale Persistent Real-Time Vector Database for LLM Applications |
2025 |
VLDB |
4.1945683e-05 |
| 13,138 |
Database Perspective on LLM Inference Systems |
2025 |
VLDB |
- |
| 9,103 |
AlayaDB: The Data Foundation for Efficient and Effective Long-context LLM Inference |
2025 |
SIGMOD |
4.3958197e-05 |
| 10,066 |
DepCache: A KV Cache Management Framework for GraphRAG with Dependency Attention |
2026 |
SIGMOD |
4.1945683e-05 |
| 13,135 |
ContextCache: Context-Aware Semantic Cache for Multi-Turn Queries in Large Language Models |
2025 |
VLDB |
- |
| 9,805 |
MEMO: Fine-grained Tensor Management For Ultra-long Context LLM Training |
2025 |
SIGMOD |
4.2805224e-05 |
| 10,170 |
From Prefix Cache to Fusion RAG Cache: Accelerating LLM Inference in Retrieval-Augmented Generation |
2026 |
SIGMOD |
4.1945683e-05 |
| 10,020 |
HotPrefix: Hotness-Aware KV Cache Scheduling for Efficient Prefix Sharing in LLM Inference Systems |
2026 |
SIGMOD |
4.1945683e-05 |
| 3,565 |
Cache-Craft: Managing Chunk-Caches for Efficient Retrieval-Augmented Generation |
2025 |
SIGMOD |
6.9655362e-05 |
| 6,357 |
PQCache: Product Quantization-based KVCache for Long Context LLM Inference |
2025 |
SIGMOD |
5.0970739e-05 |