Expand your Training Limits! Generating Training Data for ML-based Data Management
Summary: DataFarm generates and labels large, heterogeneous query workloads for ML-driven data management. A data-driven whitebox learner uses small workloads and data to synthesize jobs, delivering up to 9x labeling gains (R^2) and 54x cost reductions vs prior work. (summarized by gpt-5-nano on Feb 09 2026)
Incoming Non-self Citations Over Time
Authors
Incoming Citations (Sorted by Pagerank)
Showing 9 of 9 citing papers.
| Rank | Citing Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 3,828 | Zero-Shot Cost Models for Out-of-the-box Learned Cost Prediction | 2022 | VLDB | 6.7208524e-05 |
| 7,753 | Rethinking Learned Cost Models: Why Start from Scratch? | 2023 | SIGMOD | 4.660151e-05 |
| 8,735 | TensorSocket: Shared Data Loading for Deep Learning Training | 2026 | SIGMOD | 4.456315e-05 |
| 8,956 | T3: Accurate and Fast Performance Prediction for Relational Database Systems With Compiled Decision Trees | 2025 | SIGMOD | 4.4214154e-05 |
| 9,006 | Hit the Gym: Accelerating Query Execution to Efficiently Bootstrap Behavior Models for Self-Driving Database Management Systems | 2024 | VLDB | 4.4101482e-05 |
| 9,292 | Farm Your ML-based Query Optimizer's Food! - Human-Guided Training Data Generation - | 2022 | CIDR | 4.3619543e-05 |
| 9,467 | Database Gyms | 2023 | CIDR | 4.3346412e-05 |
| 9,600 | Optimizing Dataflow Systems for Scalable Interactive Visualization | 2024 | SIGMOD | 4.3177432e-05 |
| 10,840 | Learned Cost Models for Query Optimization: From Batch to Streaming Systems | 2025 | VLDB | 4.1945683e-05 |
Previous
Page 1 / 1
Next
Outgoing Citations (Sorted by Pagerank)
Showing 21 of 21 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
Previous
Page 1 / 1
Next
Semantically Similar Papers
| Overall Rank | Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 5,861 | Machine Learning for Databases | 2021 | VLDB | 5.298883e-05 |
| 1,532 | Data Management in Machine Learning: Challenges, Techniques, and Systems | 2017 | SIGMOD | 0.00011472681 |
| 6,775 | A Unified Transferable Model for ML-Enhanced DBMS | 2022 | CIDR | 4.9299192e-05 |
| 4,549 | Database-Agnostic Workload Management | 2019 | CIDR | 6.0926728e-05 |
| 8,637 | Machine Learning for Data Management: Problems and Solutions | 2018 | SIGMOD | 4.479892e-05 |
| 9,776 | Structure-Aware Machine Learning over Multi-Relational Databases | 2021 | SIGMOD | 4.2856106e-05 |
| 608 | DeepDB: Learn from Data, not from Queries! | 2020 | VLDB | 0.00019235898 |
| 11,650 | Query-Driven Learning for Next Generation Predictive Modeling & Analytics | 2019 | SIGMOD | 4.1945683e-05 |
| 10,843 | Machine Learning for Graph Data Management and Query Processing | 2025 | VLDB | 4.1945683e-05 |
| 9,292 | Farm Your ML-based Query Optimizer's Food! - Human-Guided Training Data Generation - | 2022 | CIDR | 4.3619543e-05 |