Back to papers
SchemaPile: A Large Collection of Relational Database Schemas
Summary: SchemaPile: massive GitHub-mined corpus of 221K relational database schemas, 1.7M tables, 10M columns, 700K FKs, and rich integrity metadata/content — far beyond single-table corpora. Positions schema-level training/evaluation data for LLMs and data management tasks like FK detection, header detection, and SQL parsing.
(summarized by gpt-5.4-mini on May 24 2026)
- Paper ID
- 6935
- Venue
- SIGMOD
- Year
- 2024
- Pagerank
- 5.2685946e-05
- Overall Rank
- 5,928 | 58.77%
- DOI
-
10.1145/3654975
Incoming Non-self Citations Over Time
Incoming Citations (Sorted by Pagerank)
Showing 3 of 3 citing papers.
Outgoing Citations (Sorted by Pagerank)
Showing 15 of 15 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank |
Cited Paper |
Year |
Venue |
Pagerank |
| 107 |
WebTables: Exploring the Power of Tables on the Web |
2008 |
VLDB |
0.00048377684 |
| 185 |
DuckDB: an Embeddable Analytical Database |
2019 |
SIGMOD |
0.00036538405 |
| 517 |
Can Foundation Models Wrangle Your Data? |
2023 |
VLDB |
0.00021169035 |
| 939 |
Data Lake Management: Challenges and Opportunities |
2019 |
VLDB |
0.00015187344 |
| 1,001 |
Recovering Semantics of Tables on the Web |
2011 |
VLDB |
0.00014706505 |
| 1,482 |
Automating Large-Scale Data Quality Verification |
2018 |
VLDB |
0.00011725533 |
| 1,612 |
Detecting Data Errors: Where are we and what needs to be done? |
2016 |
VLDB |
0.00011142794 |
| 2,945 |
Few-shot Text-to-SQL Translation using Structure and Content Prompt Learning |
2023 |
SIGMOD |
7.8377395e-05 |
| 2,965 |
SQLShare: Results from a Multi-Year SQL-as-a-Service Experiment |
2016 |
SIGMOD |
7.8059273e-05 |
| 3,015 |
Chorus: Foundation Models for Unified Data Discovery and Exploration |
2024 |
VLDB |
7.7092391e-05 |
| 3,520 |
GitTables: A Large-Scale Corpus of Relational Tables |
2023 |
SIGMOD |
7.0131061e-05 |
| 3,995 |
How Large Language Models Will Disrupt Data Management |
2023 |
VLDB |
6.5513237e-05 |
| 5,242 |
Towards Benchmarking Feature Type Inference for AutoML Platforms |
2021 |
SIGMOD |
5.6074743e-05 |
| 7,102 |
Mondrian: Spreadsheet Layout Detection |
2022 |
SIGMOD |
4.8307982e-05 |
| 7,807 |
Pollock: A Data Loading Benchmark |
2023 |
VLDB |
4.6457732e-05 |
Semantically Similar Papers
| Overall Rank |
Paper |
Year |
Venue |
Pagerank |
| 10,973 |
Unstructured Data Fusion for Schema and Data Extraction |
2024 |
SIGMOD |
4.1945683e-05 |
| 4,908 |
Combining Small Language Models and Large Language Models for Zero-Shot NL2SQL |
2024 |
VLDB |
5.8339245e-05 |
| 10,785 |
GalaxyWeaver: Autonomous Table-to-Graph Conversion and Schema Optimization with Large Language Models |
2025 |
VLDB |
4.1945683e-05 |
| 1,422 |
SchemaSQL - A Language for Interoperability in Relational Multi-database Systems |
1996 |
VLDB |
0.00012056887 |
| 10,305 |
Schuyler: Self-Supervised Clustering of Tables in Relational Databases |
2026 |
VLDB |
4.1945683e-05 |
| 5,509 |
Can Large Language Models Predict Data Correlations from Column Names? |
2023 |
VLDB |
5.4703368e-05 |
| 5,437 |
SNAILS: Schema Naming Assessments for Improved LLM-Based SQL Inference |
2025 |
SIGMOD |
5.5033018e-05 |
| 8,052 |
Generating Succinct Descriptions of Database Schemata for Cost-Efficient Prompting of Large Language Models |
2024 |
VLDB |
4.5953106e-05 |
| 10,210 |
SchemaRAG: A Schema-aware Retrieval-Augmented Generation Framework for Text-to-SQL |
2026 |
SIGMOD |
4.1945683e-05 |
| 3,520 |
GitTables: A Large-Scale Corpus of Relational Tables |
2023 |
SIGMOD |
7.0131061e-05 |