Back to papers
tf.data: A Machine Learning Data Processing Framework
Summary: tf.data offers composable, parameterized operators for efficient ML input pipelines; the runtime overlaps I/O and compute with minimal tuning. Google workloads show data processing dominates resources and motivates cross-job sharing and storage projection.
(summarized by gpt-5-nano on Feb 09 2026)
- Paper ID
- 12503
- Venue
- VLDB
- Year
- 2021
- Pagerank
- 9.3821603e-05
- Overall Rank
- 2,170 | 84.91%
- DOI
-
10.14778/3476311.3476374
Incoming Non-self Citations Over Time
Incoming Citations (Sorted by Pagerank)
Showing 16 of 16 citing papers.
| Rank |
Citing Paper |
Year |
Venue |
Pagerank |
| 3,407 |
End-to-end Optimization of Machine Learning Prediction Queries |
2022 |
SIGMOD |
7.1295646e-05 |
| 3,698 |
Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning Preprocessing Pipelines |
2022 |
SIGMOD |
6.8340435e-05 |
| 4,180 |
FastFlow: Accelerating Deep Learning Model Training with Smart Offloading of Input Data Pipeline |
2023 |
VLDB |
6.3793352e-05 |
| 5,552 |
GoldMiner: Elastic Scaling of Training Data Pre-Processing Pipelines for Deep Learning |
2023 |
SIGMOD |
5.4402488e-05 |
| 6,057 |
Progressive Compressed Records: Taking a Byte out of Deep Learning Data |
2021 |
VLDB |
5.2317752e-05 |
| 7,469 |
Bullion: A Column Store for Machine Learning |
2025 |
CIDR |
4.7204398e-05 |
| 8,348 |
FusionFlow: Accelerating Data Preprocessing for Machine Learning with CPU-GPU Cooperation |
2024 |
VLDB |
4.5410024e-05 |
| 8,514 |
UPLIFT: Parallelization Strategies for Feature Transformations in Machine Learning Workloads |
2022 |
VLDB |
4.4944285e-05 |
| 8,735 |
TensorSocket: Shared Data Loading for Deep Learning Training |
2026 |
SIGMOD |
4.456315e-05 |
| 8,737 |
Scheduling Data Processing Pipelines for Incremental Training on MLP-based Recommendation Models |
2025 |
SIGMOD |
4.456315e-05 |
| 9,231 |
Modyn: Data-Centric Machine Learning Pipeline Orchestration |
2025 |
SIGMOD |
4.3690661e-05 |
| 9,805 |
MEMO: Fine-grained Tensor Management For Ultra-long Context LLM Training |
2025 |
SIGMOD |
4.2805224e-05 |
| 10,183 |
Mixtera: A Data Plane for Foundation Model Training |
2026 |
SIGMOD |
4.1945683e-05 |
| 10,220 |
FlatStor: An Efficient Embedded-Index Based Columnar Data Layout for Multimodal Data Workloads |
2026 |
VLDB |
4.1945683e-05 |
| 10,580 |
GPEmu: A GPU Emulator for Faster and Cheaper Prototyping and Evaluation of Deep Learning System Research |
2025 |
VLDB |
4.1945683e-05 |
| 10,770 |
cedar: Optimized and Unified Machine Learning Input Data Pipelines |
2025 |
VLDB |
4.1945683e-05 |
Outgoing Citations (Sorted by Pagerank)
Showing 6 of 6 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
Semantically Similar Papers
| Overall Rank |
Paper |
Year |
Venue |
Pagerank |
| 8,348 |
FusionFlow: Accelerating Data Preprocessing for Machine Learning with CPU-GPU Cooperation |
2024 |
VLDB |
4.5410024e-05 |
| 4,003 |
Data Platform for Machine Learning |
2019 |
SIGMOD |
6.54347e-05 |
| 5,552 |
GoldMiner: Elastic Scaling of Training Data Pre-Processing Pipelines for Deep Learning |
2023 |
SIGMOD |
5.4402488e-05 |
| 8,514 |
UPLIFT: Parallelization Strategies for Feature Transformations in Machine Learning Workloads |
2022 |
VLDB |
4.4944285e-05 |
| 2,172 |
Spinning Fast Iterative Data Flows |
2012 |
VLDB |
9.3706587e-05 |
| 2,456 |
Production Machine Learning Pipelines: Empirical Analysis and Optimization Opportunities |
2021 |
SIGMOD |
8.7733773e-05 |
| 3,698 |
Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning Preprocessing Pipelines |
2022 |
SIGMOD |
6.8340435e-05 |
| 10,770 |
cedar: Optimized and Unified Machine Learning Input Data Pipelines |
2025 |
VLDB |
4.1945683e-05 |
| 3,491 |
TensorFlow Data Validation: Data Analysis and Validation in Continuous ML Pipelines |
2020 |
SIGMOD |
7.0451276e-05 |
| 4,180 |
FastFlow: Accelerating Deep Learning Model Training with Smart Offloading of Input Data Pipeline |
2023 |
VLDB |
6.3793352e-05 |