Database Paper Browser

Back to papers

tf.data: A Machine Learning Data Processing Framework

Summary: tf.data offers composable, parameterized operators for efficient ML input pipelines; the runtime overlaps I/O and compute with minimal tuning. Google workloads show data processing dominates resources and motivates cross-job sharing and storage projection. (summarized by gpt-5-nano on Feb 09 2026)

Paper ID
12503
Venue
VLDB
Year
2021
Pagerank
9.3821603e-05
Overall Rank
2,170 | 84.91%
DOI
10.14778/3476311.3476374

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 16 of 16 citing papers.

Rank Citing Paper Year Venue Pagerank
3,407 End-to-end Optimization of Machine Learning Prediction Queries 2022 SIGMOD 7.1295646e-05
3,698 Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning Preprocessing Pipelines 2022 SIGMOD 6.8340435e-05
4,180 FastFlow: Accelerating Deep Learning Model Training with Smart Offloading of Input Data Pipeline 2023 VLDB 6.3793352e-05
5,552 GoldMiner: Elastic Scaling of Training Data Pre-Processing Pipelines for Deep Learning 2023 SIGMOD 5.4402488e-05
6,057 Progressive Compressed Records: Taking a Byte out of Deep Learning Data 2021 VLDB 5.2317752e-05
7,469 Bullion: A Column Store for Machine Learning 2025 CIDR 4.7204398e-05
8,348 FusionFlow: Accelerating Data Preprocessing for Machine Learning with CPU-GPU Cooperation 2024 VLDB 4.5410024e-05
8,514 UPLIFT: Parallelization Strategies for Feature Transformations in Machine Learning Workloads 2022 VLDB 4.4944285e-05
8,735 TensorSocket: Shared Data Loading for Deep Learning Training 2026 SIGMOD 4.456315e-05
8,737 Scheduling Data Processing Pipelines for Incremental Training on MLP-based Recommendation Models 2025 SIGMOD 4.456315e-05
9,231 Modyn: Data-Centric Machine Learning Pipeline Orchestration 2025 SIGMOD 4.3690661e-05
9,805 MEMO: Fine-grained Tensor Management For Ultra-long Context LLM Training 2025 SIGMOD 4.2805224e-05
10,183 Mixtera: A Data Plane for Foundation Model Training 2026 SIGMOD 4.1945683e-05
10,220 FlatStor: An Efficient Embedded-Index Based Columnar Data Layout for Multimodal Data Workloads 2026 VLDB 4.1945683e-05
10,580 GPEmu: A GPU Emulator for Faster and Cheaper Prototyping and Evaluation of Deep Learning System Research 2025 VLDB 4.1945683e-05
10,770 cedar: Optimized and Unified Machine Learning Input Data Pipelines 2025 VLDB 4.1945683e-05
Previous Page 1 / 1 Next

Outgoing Citations (Sorted by Pagerank)

Showing 6 of 6 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Previous Page 1 / 1 Next

Semantically Similar Papers