Towards Benchmarking Feature Type Inference for AutoML Platforms
Summary: First benchmark for ML-driven feature type inference in AutoML; presents a 9,921-sample, 9-class labeled dataset to standardize evaluation. ML-based typing yields 14% avg lift (up to 38%), beats industrial tools on 47/60 downstream models, and the dataset, models, and leaderboards are publicly released. (summarized by gpt-5-nano on Feb 09 2026)
Incoming Non-self Citations Over Time
Authors
- 1. Vraj Shah
- 2. Jonathan Lacanlale
- 3. Premanand Kumar
- 4. Kevin Yang
- 5. Arun Kumar
Incoming Citations (Sorted by Pagerank)
Showing 9 of 9 citing papers.
| Rank | Citing Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 3,934 | SimpleTS: An Efficient and Universal Model Selection Framework for Time Series Forecasting | 2023 | VLDB | 6.6175631e-05 |
| 5,280 | Explaining Dataset Changes for Semantic Data Versioning with Explain-Da-V | 2023 | VLDB | 5.5896735e-05 |
| 5,928 | SchemaPile: A Large Collection of Relational Database Schemas | 2024 | SIGMOD | 5.2685946e-05 |
| 6,553 | How do Categorical Duplicates Affect ML? A New Benchmark and Empirical Analyses | 2024 | VLDB | 5.0157344e-05 |
| 7,807 | Pollock: A Data Loading Benchmark | 2023 | VLDB | 4.6457732e-05 |
| 8,121 | Automation of Data Prep, ML, and Data Science: New Cure or Snake Oil? | 2021 | SIGMOD | 4.5809305e-05 |
| 8,514 | UPLIFT: Parallelization Strategies for Feature Transformations in Machine Learning Workloads | 2022 | VLDB | 4.4944285e-05 |
| 9,379 | GIO: Generating Efficient Matrix and Frame Readers for Custom Data Formats by Example | 2023 | SIGMOD | 4.3462787e-05 |
| 10,628 | CatDB: Data-catalog-guided, LLM-based Generation of Data-centric ML Pipelines | 2025 | VLDB | 4.1945683e-05 |
Previous
Page 1 / 1
Next
Outgoing Citations (Sorted by Pagerank)
Showing 9 of 9 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank | Cited Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 254 | Snorkel: Rapid Training Data Creation with Weak Supervision | 2018 | VLDB | 0.00030540555 |
| 1,215 | Snuba: Automating Weak Supervision to Label Training Data | 2019 | VLDB | 0.0001323375 |
| 1,482 | Automating Large-Scale Data Quality Verification | 2018 | VLDB | 0.00011725533 |
| 3,478 | Transform-Data-by-Example (TDE): An Extensible Search Engine for Data Transformations | 2018 | VLDB | 7.054159e-05 |
| 4,196 | Overton: A Data System for Monitoring and Improving Machine-Learned Products | 2020 | CIDR | 6.3686231e-05 |
| 4,749 | Slice Tuner: A Selective Data Acquisition Framework for Accurate and Fair Machine Learning Models | 2021 | SIGMOD | 5.9503689e-05 |
| 5,929 | ActiveClean: An Interactive Data Cleaning Framework For Modern Machine Learning | 2016 | SIGMOD | 5.2682177e-05 |
| 6,416 | Synthesizing Type-Detection Logic for Rich Semantic Data Types using Open-source Code | 2018 | SIGMOD | 5.072267e-05 |
| 7,812 | Foofah: A Programming-By-Example System for Synthesizing Data Transformation Programs | 2017 | SIGMOD | 4.6443197e-05 |
Previous
Page 1 / 1
Next