Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning Preprocessing Pipelines

Summary: Analyzes data preprocessing pipelines across four domains, exposing bottlenecks and throughput–storage trade-offs. Presents an open-source profiler that auto-tunes preprocessing, delivering 3x–13x throughput gains with equivalent pipelines. (summarized by gpt-5-nano on Feb 09 2026)

Paper ID: 6301
Venue: SIGMOD
Year: 2022
Pagerank: 7.4298791e-05
Overall Rank: 3,506 | 75.64%
DOI: 10.1145/3514221.3517848

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 9 of 9 citing papers.

Rank	Citing Paper	Year	Venue	Pagerank
3,715	FastFlow: Accelerating Deep Learning Model Training with Smart Offloading of Input Data Pipeline	2023	VLDB	7.243454e-05
5,219	GoldMiner: Elastic Scaling of Training Data Pre-Processing Pipelines for Deep Learning	2023	SIGMOD	6.3813971e-05
8,774	TensorSocket: Shared Data Loading for Deep Learning Training	2026	SIGMOD	5.4311509e-05
8,776	Scheduling Data Processing Pipelines for Incremental Training on MLP-based Recommendation Models	2025	SIGMOD	5.4311509e-05
9,661	Apt-Serve: Adaptive Request Scheduling on Hybrid Cache for Scalable LLM Inference Serving	2025	SIGMOD	5.2956801e-05
9,777	MEMO: Fine-grained Tensor Management For Ultra-long Context LLM Training	2025	SIGMOD	5.2743647e-05
9,778	The Image Calculator: 10x Faster Image-AI Inference by Replacing JPEG with Self-designing Storage Format	2024	SIGMOD	5.2743647e-05
10,502	Malleus: Straggler-Resilient Hybrid Parallel Training of Large-scale Models via Malleable Data and Model Parallelization	2025	SIGMOD	5.1725247e-05
10,776	cedar: Optimized and Unified Machine Learning Input Data Pipelines	2025	VLDB	5.1725247e-05

Outgoing Citations (Sorted by Pagerank)

Showing 3 of 3 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Rank	Cited Paper	Year	Venue	Pagerank
1,424	Analyzing and Mitigating Data Stalls in DNN Training	2021	VLDB	0.00010921601
1,992	tf.data: A Machine Learning Data Processing Framework	2021	VLDB	9.4299705e-05
3,225	Jointly Optimizing Preprocessing and Inference for DNN-based Visual Analytics	2021	VLDB	7.6926736e-05

Semantically Similar Papers

Overall Rank	Paper	Year	Venue	Pagerank
6,235	Materialization and Reuse Optimizations for Production Data Science Pipelines	2022	SIGMOD	6.0072575e-05
8,774	TensorSocket: Shared Data Loading for Deep Learning Training	2026	SIGMOD	5.4311509e-05
5,219	GoldMiner: Elastic Scaling of Training Data Pre-Processing Pipelines for Deep Learning	2023	SIGMOD	6.3813971e-05
3,225	Jointly Optimizing Preprocessing and Inference for DNN-based Visual Analytics	2021	VLDB	7.6926736e-05
2,623	Production Machine Learning Pipelines: Empirical Analysis and Optimization Opportunities	2021	SIGMOD	8.3971735e-05
5,586	DiffPrep: Differentiable Data Preprocessing Pipeline Search for Learning over Tabular Data	2023	SIGMOD	6.2252661e-05
1,992	tf.data: A Machine Learning Data Processing Framework	2021	VLDB	9.4299705e-05
1,424	Analyzing and Mitigating Data Stalls in DNN Training	2021	VLDB	0.00010921601
8,180	FusionFlow: Accelerating Data Preprocessing for Machine Learning with CPU-GPU Cooperation	2024	VLDB	5.5389119e-05
3,715	FastFlow: Accelerating Deep Learning Model Training with Smart Offloading of Input Data Pipeline	2023	VLDB	7.243454e-05