Database Paper Browser

Back to papers

To Join or Not to Join? Thinking Twice about Joins before Feature Selection

Summary: Safe-join avoidance for feature selection in normalized datasets: many join-derived features can be dropped without hurting ML accuracy. Experiments on real normalized datasets show accurate safety predictions and substantial runtime savings for popular feature selection methods. (summarized by gpt-5-nano on Feb 09 2026)

Paper ID
5143
Venue
SIGMOD
Year
2016
Pagerank
0.0001547016
Overall Rank
903 | 93.72%
DOI
10.1145/2882903.2882952

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 38 of 38 citing papers.

Rank Citing Paper Year Venue Pagerank
1,279 Towards Linear Algebra over Normalized Data 2017 VLDB 0.00012868394
1,420 Data Management Challenges in Production Machine Learning 2017 SIGMOD 0.00012057956
1,463 ARDA: Automatic Relational Data Augmentation for Machine Learning 2020 VLDB 0.00011869295
1,532 Data Management in Machine Learning: Challenges, Techniques, and Systems 2017 SIGMOD 0.00011472681
1,644 Finding Related Tables in Data Lakes for Interactive Data Science 2020 SIGMOD 0.00011041787
1,666 HELIX: Holistic Optimization for Accelerating Iterative Machine Learning 2019 VLDB 0.0001096361
2,154 DIFF: A Relational Interface for Large-Scale Data Explanation 2019 VLDB 9.4208667e-05
2,194 Enabling and Optimizing Non-linear Feature Interactions in Factorized Linear Algebra 2019 SIGMOD 9.3138337e-05
3,277 A Layered Aggregate Engine for Analytics Workloads 2019 SIGMOD 7.2871625e-05
3,443 Distributed Join Algorithms on Thousands of Cores 2017 VLDB 7.0887214e-05
3,473 AI Meets Database: AI4DB and DB4AI 2021 SIGMOD 7.062864e-05
3,750 Data Acquisition for Improving Machine Learning Models 2021 VLDB 6.7895763e-05
3,942 Ember: No-Code Context Enrichment via Similarity-Based Keyless Joins 2022 VLDB 6.6114622e-05
4,129 Are Key-Foreign Key Joins Safe to Avoid when Learning High-Capacity Classifiers? 2018 VLDB 6.428887e-05
4,395 Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models 2017 VLDB 6.2244283e-05
4,967 Leva: Boosting Machine Learning Performance with Relational Embedding Data Augmentation 2022 SIGMOD 5.7956612e-05
5,691 Putting Things into Context: Rich Explanations for Query Answers using Join Graphs 2021 SIGMOD 5.3684557e-05
5,806 BlinkML: Efficient Maximum Likelihood Estimation with Probabilistic Guarantees 2019 SIGMOD 5.3200643e-05
5,978 Rotom: A Meta-Learned Data Augmentation Framework for Entity Matching, Data Cleaning, Text Classification, and Beyond 2021 SIGMOD 5.2453012e-05
6,270 MATE: Multi-Attribute Table Extraction 2022 VLDB 5.1337451e-05
6,330 Efficient Construction of Approximate Ad-Hoc ML models Through Materialization and Reuse 2018 VLDB 5.1077416e-05
6,378 Mitigating the Impedance Mismatch between Prediction Query Execution and Database Engine 2025 SIGMOD 5.0909804e-05
6,538 Tuple-oriented Compression for Large-scale Mini-batch Stochastic Gradient Descent 2019 SIGMOD 5.023239e-05
7,179 Coresets over Multiple Tables for Feature-rich and Data-efficient Machine Learning 2023 VLDB 4.8078895e-05
7,602 Causal Feature Selection for Algorithmic Fairness 2022 SIGMOD 4.6988081e-05
7,723 Mind the Gap: Bridging Multi-Domain Query Workloads with EmptyHeaded 2017 VLDB 4.6676712e-05
8,116 LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data Lakes 2024 VLDB 4.581507e-05
8,281 Optimizing Data Acquisition to Enhance Machine Learning Performance 2024 VLDB 4.5435639e-05
8,432 SPRINTER: A Fast n-ary Join Query Processing Method for Complex OLAP Queries 2020 SIGMOD 4.5153924e-05
8,653 ApproxML: Efficient Approximate Ad-Hoc ML Models Through Materialization and Reuse 2019 VLDB 4.475291e-05
8,921 Leveraging Similarity Joins for Signal Reconstruction 2018 VLDB 4.427232e-05
10,177 InferF: Declarative Factorization of AI/ML Inferences over Joins 2026 SIGMOD 4.1945683e-05
10,269 Database Views as Explanations for Relational Deep Learning 2026 VLDB 4.1945683e-05
11,035 Relational Query Synthesis ⋈ Decision Tree Learning 2024 VLDB 4.1945683e-05
11,054 Enriching Relations with Additional Attributes for ER 2024 VLDB 4.1945683e-05
11,187 Regularized Pairwise Relationship based Analytics for Structured Data 2023 SIGMOD 4.1945683e-05
11,476 Enforcing Constraints for Machine Learning Systems via Declarative Feature Selection: An Experimental Study 2021 SIGMOD 4.1945683e-05
11,742 Learning Efficiently Over Heterogeneous Databases 2018 VLDB 4.1945683e-05
Previous Page 1 / 1 Next

Outgoing Citations (Sorted by Pagerank)

Showing 8 of 8 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Previous Page 1 / 1 Next

Semantically Similar Papers