To Join or Not to Join? Thinking Twice about Joins before Feature Selection
Summary: Safe-join avoidance for feature selection in normalized datasets: many join-derived features can be dropped without hurting ML accuracy. Experiments on real normalized datasets show accurate safety predictions and substantial runtime savings for popular feature selection methods. (summarized by gpt-5-nano on Feb 09 2026)
Incoming Non-self Citations Over Time
Authors
- 1. Arun Kumar
- 2. Jeffrey Naughton
- 3. Jignesh M. Patel
- 4. Xiaojin Zhu
Incoming Citations (Sorted by Pagerank)
Showing 38 of 38 citing papers.
Previous
Page 1 / 1
Next
Outgoing Citations (Sorted by Pagerank)
Showing 8 of 8 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank | Cited Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 140 | The MADlib Analytics Library or MAD Skills, the SQL | 2012 | VLDB | 0.00042270404 |
| 543 | MLbase: A Distributed Machine-learning System | 2013 | CIDR | 0.00020526854 |
| 679 | Skew-Aware Automatic Database Partitioning in Shared-Nothing, Parallel OLTP Systems | 2012 | SIGMOD | 0.00018215154 |
| 761 | Materialization Optimizations for Feature Selection Workloads | 2014 | SIGMOD | 0.00017053783 |
| 1,158 | Simulation of Database-Valued Markov Chains Using SimSQL | 2013 | SIGMOD | 0.0001361064 |
| 1,167 | Learning Generalized Linear Models Over Normalized Data | 2015 | SIGMOD | 0.00013547713 |
| 2,915 | Brainwash: A Data System for Feature Engineering | 2013 | CIDR | 7.9078385e-05 |
| 7,273 | Feature Selection in Enterprise Analytics: A Demonstration using an R-based Data Analytics System | 2013 | VLDB | 4.7810804e-05 |
Previous
Page 1 / 1
Next
Semantically Similar Papers
| Overall Rank | Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 7,819 | Main Memory Adaptive Denormalization | 2016 | SIGMOD | 4.6432769e-05 |
| 5,511 | On Producing Join Results Early | 2003 | PODS | 5.4699346e-05 |
| 8,417 | The Case for Learned In-Memory Joins | 2023 | VLDB | 4.5194164e-05 |
| 5,567 | Optimizing Data Pipelines for Machine Learning in Feature Stores | 2023 | VLDB | 5.4305348e-05 |
| 3,990 | FactorJoin: A New Cardinality Estimation Framework for Join Queries | 2023 | SIGMOD | 6.5581983e-05 |
| 3,735 | Auto-Join: Joining Tables by Leveraging Transformations | 2017 | VLDB | 6.8061318e-05 |
| 7,179 | Coresets over Multiple Tables for Feature-rich and Data-efficient Machine Learning | 2023 | VLDB | 4.8078895e-05 |
| 1,279 | Towards Linear Algebra over Normalized Data | 2017 | VLDB | 0.00012868394 |
| 4,129 | Are Key-Foreign Key Joins Safe to Avoid when Learning High-Capacity Classifiers? | 2018 | VLDB | 6.428887e-05 |
| 1,167 | Learning Generalized Linear Models Over Normalized Data | 2015 | SIGMOD | 0.00013547713 |