Back to papers
Can Large Language Models Predict Data Correlations from Column Names?
Summary: Introduces a Kaggle-derived benchmark for data-correlation analysis and systematically evaluates multiple language models on the task of predicting correlated column pairs from names alone across correlation metrics and accuracy measures. Finds schema text carries useful signal—prediction quality varies with name length, word-ratio and column types—informing NLP-enhanced tuning and profiling.
(summarized by gpt-5-mini on Feb 09 2026)
- Paper ID
- 13298
- Venue
- VLDB
- Year
- 2023
- Pagerank
- 5.4703368e-05
- Overall Rank
- 5,509 | 61.68%
- DOI
-
10.14778/3625054.3625066
Incoming Non-self Citations Over Time
Incoming Citations (Sorted by Pagerank)
Showing 4 of 4 citing papers.
Outgoing Citations (Sorted by Pagerank)
Showing 26 of 26 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank |
Cited Paper |
Year |
Venue |
Pagerank |
| 1 |
Access Path Selection in a Relational Database Management System |
1979 |
SIGMOD |
0.0040449103 |
| 71 |
How Good Are Query Optimizers, Really? |
2016 |
VLDB |
0.00059038975 |
| 224 |
CORDS: Automatic Discovery of Correlations and Soft Functional Dependencies |
2004 |
SIGMOD |
0.00032746205 |
| 333 |
Neo: A Learned Query Optimizer |
2019 |
VLDB |
0.00027206884 |
| 517 |
Can Foundation Models Wrangle Your Data? |
2023 |
VLDB |
0.00021169035 |
| 535 |
ATHENA: An Ontology-Driven System for Natural Language Querying over Relational Data Stores |
2016 |
VLDB |
0.00020727678 |
| 567 |
NaLIR: An Interactive Natural Language Interface for Querying Relational Databases |
2014 |
SIGMOD |
0.00019966681 |
| 1,116 |
Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes |
2024 |
VLDB |
0.00013890154 |
| 1,407 |
DB-BERT: A Database Tuning Tool that "Reads the Manual" |
2022 |
SIGMOD |
0.00012146739 |
| 1,547 |
Lightweight Graphical Models for Selectivity Estimation Without Independence Assumptions |
2011 |
VLDB |
0.00011442359 |
| 1,625 |
Data Profiling with Metanome |
2015 |
VLDB |
0.00011094926 |
| 1,643 |
CodexDB: Synthesizing Code for Query Processing from Natural Language Instructions using GPT-3 Codex |
2022 |
VLDB |
0.0001104256 |
| 1,737 |
QuickSel: Quick Selectivity Learning with Mixture Models |
2020 |
SIGMOD |
0.00010720294 |
| 1,974 |
BHUNT: Automatic Discovery of Fuzzy Algebraic Constraints in Relational Data |
2003 |
VLDB |
9.8866171e-05 |
| 2,057 |
From Natural Language Processing to Neural Databases |
2021 |
VLDB |
9.6624862e-05 |
| 2,219 |
SkinnerDB: Regret-Bounded Query Evaluation via Reinforcement Learning |
2019 |
SIGMOD |
9.2623533e-05 |
| 2,349 |
RPT: Relational Pre-trained Transformer Is Almost All You Need towards Democratizing Data Preparation |
2021 |
VLDB |
8.9876423e-05 |
| 3,015 |
Chorus: Foundation Models for Unified Data Discovery and Exploration |
2024 |
VLDB |
7.7092391e-05 |
| 3,651 |
Conditional Selectivity for Statistics on Query Expressions |
2004 |
SIGMOD |
6.8768678e-05 |
| 4,784 |
Divide & Conquer-based Inclusion Dependency Discovery |
2015 |
VLDB |
5.9240851e-05 |
| 4,816 |
Scrutinizer: Fact Checking Statistical Claims |
2020 |
VLDB |
5.900769e-05 |
| 4,913 |
UDO: Universal Database Optimization using Reinforcement Learning |
2021 |
VLDB |
5.8316231e-05 |
| 4,934 |
From BERT to GPT-3 Codex: Harnessing the Potential of Very Large Language Models for Data Management |
2022 |
VLDB |
5.8198826e-05 |
| 5,981 |
DataPrep.EDA: Task-Centric Exploratory Data Analysis for Statistical Modeling in Python |
2021 |
SIGMOD |
5.2448986e-05 |
| 6,890 |
Towards NLP-Enhanced Data Profiling Tools |
2022 |
CIDR |
4.8928923e-05 |
| 8,615 |
The Case for NLP-Enhanced Database Tuning: Towards Tuning Tools that "Read the Manual" |
2021 |
VLDB |
4.484683e-05 |
Semantically Similar Papers
| Overall Rank |
Paper |
Year |
Venue |
Pagerank |
| 984 |
Natural language to SQL: Where are we today? |
2020 |
VLDB |
0.00014857465 |
| 3,824 |
Correlation Sketches for Approximate Join-Correlation Queries |
2021 |
SIGMOD |
6.7260705e-05 |
| 8,155 |
Automated Data Visualization from Natural Language via Large Language Models: An Exploratory Study |
2024 |
SIGMOD |
4.5745248e-05 |
| 3,995 |
How Large Language Models Will Disrupt Data Management |
2023 |
VLDB |
6.5513237e-05 |
| 5,928 |
SchemaPile: A Large Collection of Relational Database Schemas |
2024 |
SIGMOD |
5.2685946e-05 |
| 10,217 |
This is Going to Sound Crazy, But What If We Used Large Language Models to Boost Automatic Database Tuning Algorithms By Leveraging Prior History? We Will Find Better Configurations More Quickly Than Retraining From Scratch! |
2026 |
SIGMOD |
4.1945683e-05 |
| 4,908 |
Combining Small Language Models and Large Language Models for Zero-Shot NL2SQL |
2024 |
VLDB |
5.8339245e-05 |
| 5,437 |
SNAILS: Schema Naming Assessments for Improved LLM-Based SQL Inference |
2025 |
SIGMOD |
5.5033018e-05 |
| 6,890 |
Towards NLP-Enhanced Data Profiling Tools |
2022 |
CIDR |
4.8928923e-05 |
| 2,517 |
Annotating Columns with Pre-trained Language Models |
2022 |
SIGMOD |
8.6092139e-05 |