SubStrat: A Subset-Based Optimization Strategy for Faster AutoML
Summary: SubStrat: a wrapper that speeds AutoML by optimizing dataset size rather than the configuration search, using a genetic algorithm to find small representative subsets that preserve target characteristics and running AutoML on them. Then refines the found pipeline via a short, restricted AutoML on the full data; across Auto-Sklearn/TPOT/H2O achieves ~76% runtime reduction with ~4.15% average accuracy loss. (summarized by gpt-5-mini on Feb 09 2026)
Incoming Non-self Citations Over Time
Authors
Incoming Citations (Sorted by Pagerank)
Showing 2 of 2 citing papers.
| Rank | Citing Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 10,252 | CAPS: Cost-Aware ML Pipeline Selection | 2026 | VLDB | 4.1945683e-05 |
| 10,881 | Datamap-Driven Tabular Coreset Selection for Classifier Training | 2025 | VLDB | 4.1945683e-05 |
Previous
Page 1 / 1
Next
Outgoing Citations (Sorted by Pagerank)
Showing 7 of 7 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank | Cited Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 1,391 | Ease.ml: Towards Multi-tenant Resource Sharing for Machine Learning Workloads | 2018 | VLDB | 0.0001223506 |
| 2,122 | SystemDS: A Declarative Machine Learning System for the End-to-End Data Science Lifecycle | 2020 | CIDR | 9.4989076e-05 |
| 2,163 | Elastic Machine Learning Algorithms in Amazon SageMaker | 2020 | SIGMOD | 9.3949234e-05 |
| 2,384 | Oracle AutoML: A Fast and Predictive AutoML Pipeline | 2020 | VLDB | 8.925354e-05 |
| 2,839 | VolcanoML: Speeding up End-to-End AutoML via Scalable Search Space Decomposition | 2021 | VLDB | 8.0378978e-05 |
| 5,304 | A Scalable AutoML Approach Based on Graph Neural Networks | 2022 | VLDB | 5.5779335e-05 |
| 8,382 | Assassin: an Automatic classification system based on algorithm selection | 2021 | VLDB | 4.5309467e-05 |
Previous
Page 1 / 1
Next
Semantically Similar Papers
| Overall Rank | Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 8,653 | ApproxML: Efficient Approximate Ad-Hoc ML Models Through Materialization and Reuse | 2019 | VLDB | 4.475291e-05 |
| 5,963 | Automatic Data Acquisition for Deep Learning | 2021 | VLDB | 5.2526794e-05 |
| 10,504 | Subgroup Discovery with Small and Alternative Feature Sets | 2025 | SIGMOD | 4.1945683e-05 |
| 10,252 | CAPS: Cost-Aware ML Pipeline Selection | 2026 | VLDB | 4.1945683e-05 |
| 5,242 | Towards Benchmarking Feature Type Inference for AutoML Platforms | 2021 | SIGMOD | 5.6074743e-05 |
| 8,257 | Automating and Optimizing Data-Centric What-If Analyses on Native Machine Learning Pipelines | 2023 | SIGMOD | 4.5487511e-05 |
| 2,839 | VolcanoML: Speeding up End-to-End AutoML via Scalable Search Space Decomposition | 2021 | VLDB | 8.0378978e-05 |
| 5,304 | A Scalable AutoML Approach Based on Graph Neural Networks | 2022 | VLDB | 5.5779335e-05 |
| 2,384 | Oracle AutoML: A Fast and Predictive AutoML Pipeline | 2020 | VLDB | 8.925354e-05 |
| 4,957 | Doing More with Less: Characterizing Dataset Downsampling for AutoML | 2021 | VLDB | 5.8035715e-05 |