Database Paper Browser

Back to papers

Can Large Language Models Predict Data Correlations from Column Names?

Summary: Introduces a Kaggle-derived benchmark for data-correlation analysis and systematically evaluates multiple language models on the task of predicting correlated column pairs from names alone across correlation metrics and accuracy measures. Finds schema text carries useful signal—prediction quality varies with name length, word-ratio and column types—informing NLP-enhanced tuning and profiling. (summarized by gpt-5-mini on Feb 09 2026)

Paper ID
13298
Venue
VLDB
Year
2023
Pagerank
5.4703368e-05
Overall Rank
5,509 | 61.68%
DOI
10.14778/3625054.3625066

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 4 of 4 citing papers.

Previous Page 1 / 1 Next

Outgoing Citations (Sorted by Pagerank)

Showing 26 of 26 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Rank Cited Paper Year Venue Pagerank
1 Access Path Selection in a Relational Database Management System 1979 SIGMOD 0.0040449103
71 How Good Are Query Optimizers, Really? 2016 VLDB 0.00059038975
224 CORDS: Automatic Discovery of Correlations and Soft Functional Dependencies 2004 SIGMOD 0.00032746205
333 Neo: A Learned Query Optimizer 2019 VLDB 0.00027206884
517 Can Foundation Models Wrangle Your Data? 2023 VLDB 0.00021169035
535 ATHENA: An Ontology-Driven System for Natural Language Querying over Relational Data Stores 2016 VLDB 0.00020727678
567 NaLIR: An Interactive Natural Language Interface for Querying Relational Databases 2014 SIGMOD 0.00019966681
1,116 Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes 2024 VLDB 0.00013890154
1,407 DB-BERT: A Database Tuning Tool that "Reads the Manual" 2022 SIGMOD 0.00012146739
1,547 Lightweight Graphical Models for Selectivity Estimation Without Independence Assumptions 2011 VLDB 0.00011442359
1,625 Data Profiling with Metanome 2015 VLDB 0.00011094926
1,643 CodexDB: Synthesizing Code for Query Processing from Natural Language Instructions using GPT-3 Codex 2022 VLDB 0.0001104256
1,737 QuickSel: Quick Selectivity Learning with Mixture Models 2020 SIGMOD 0.00010720294
1,974 BHUNT: Automatic Discovery of Fuzzy Algebraic Constraints in Relational Data 2003 VLDB 9.8866171e-05
2,057 From Natural Language Processing to Neural Databases 2021 VLDB 9.6624862e-05
2,219 SkinnerDB: Regret-Bounded Query Evaluation via Reinforcement Learning 2019 SIGMOD 9.2623533e-05
2,349 RPT: Relational Pre-trained Transformer Is Almost All You Need towards Democratizing Data Preparation 2021 VLDB 8.9876423e-05
3,015 Chorus: Foundation Models for Unified Data Discovery and Exploration 2024 VLDB 7.7092391e-05
3,651 Conditional Selectivity for Statistics on Query Expressions 2004 SIGMOD 6.8768678e-05
4,784 Divide & Conquer-based Inclusion Dependency Discovery 2015 VLDB 5.9240851e-05
4,816 Scrutinizer: Fact Checking Statistical Claims 2020 VLDB 5.900769e-05
4,913 UDO: Universal Database Optimization using Reinforcement Learning 2021 VLDB 5.8316231e-05
4,934 From BERT to GPT-3 Codex: Harnessing the Potential of Very Large Language Models for Data Management 2022 VLDB 5.8198826e-05
5,981 DataPrep.EDA: Task-Centric Exploratory Data Analysis for Statistical Modeling in Python 2021 SIGMOD 5.2448986e-05
6,890 Towards NLP-Enhanced Data Profiling Tools 2022 CIDR 4.8928923e-05
8,615 The Case for NLP-Enhanced Database Tuning: Towards Tuning Tools that "Read the Manual" 2021 VLDB 4.484683e-05
Previous Page 1 / 1 Next

Semantically Similar Papers