SchemaPile: A Large Collection of Relational Database Schemas

Summary: SchemaPile: massive GitHub-mined corpus of 221K relational database schemas, 1.7M tables, 10M columns, 700K FKs, and rich integrity metadata/content — far beyond single-table corpora. Positions schema-level training/evaluation data for LLMs and data management tasks like FK detection, header detection, and SQL parsing. (summarized by gpt-5.4-mini on May 24 2026)

Paper ID: 6936
Venue: SIGMOD
Year: 2024
Pagerank: 5.2635297e-05
Overall Rank: 5,929 | 58.80%
DOI: 10.1145/3654975

Incoming Non-self Citations Over Time

Authors

Incoming Citations (Sorted by Pagerank)

Showing 3 of 3 citing papers.

Rank	Citing Paper	Year	Venue	Pagerank
3,978	OmniSQL: Synthesizing High-quality Text-to-SQL Data at Scale	2025	VLDB	6.5662694e-05
5,448	SNAILS: Schema Naming Assessments for Improved LLM-Based SQL Inference	2025	SIGMOD	5.4980173e-05
10,197	Qualitative Join Discovery in Data Lakes using Examples	2026	SIGMOD	4.1905499e-05

Outgoing Citations (Sorted by Pagerank)

Showing 15 of 15 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Rank	Cited Paper	Year	Venue	Pagerank
108	WebTables: Exploring the Power of Tables on the Web	2008	VLDB	0.00048345996
185	DuckDB: an Embeddable Analytical Database	2019	SIGMOD	0.00036529607
516	Can Foundation Models Wrangle Your Data?	2023	VLDB	0.00021194444
936	Data Lake Management: Challenges and Opportunities	2019	VLDB	0.00015197838
1,005	Recovering Semantics of Tables on the Web	2011	VLDB	0.00014694038
1,403	Detecting Data Errors: Where are we and what needs to be done?	2016	VLDB	0.00012180046
1,481	Automating Large-Scale Data Quality Verification	2018	VLDB	0.00011715754
2,948	Few-shot Text-to-SQL Translation using Structure and Content Prompt Learning	2023	SIGMOD	7.8301799e-05
2,959	SQLShare: Results from a Multi-Year SQL-as-a-Service Experiment	2016	SIGMOD	7.8116974e-05
3,003	Chorus: Foundation Models for Unified Data Discovery and Exploration	2024	VLDB	7.7358219e-05
3,520	GitTables: A Large-Scale Corpus of Relational Tables	2023	SIGMOD	7.0136102e-05
3,982	How Large Language Models Will Disrupt Data Management	2023	VLDB	6.5595332e-05
5,244	Towards Benchmarking Feature Type Inference for AutoML Platforms	2021	SIGMOD	5.6021738e-05
7,100	Mondrian: Spreadsheet Layout Detection	2022	SIGMOD	4.826163e-05
7,810	Pollock: A Data Loading Benchmark	2023	VLDB	4.6415099e-05

Semantically Similar Papers

Overall Rank	Paper	Year	Venue	Pagerank
4,908	Combining Small Language Models and Large Language Models for Zero-Shot NL2SQL	2024	VLDB	5.835596e-05
10,791	GalaxyWeaver: Autonomous Table-to-Graph Conversion and Schema Optimization with Large Language Models	2025	VLDB	4.1905499e-05
1,420	SchemaSQL - A Language for Interoperability in Relational Multi-database Systems	1996	VLDB	0.00012068815
10,285	Relational Deep Dive: Error-Aware Queries Over Unstructured Data	2026	VLDB	4.1905499e-05
10,317	Schuyler: Self-Supervised Clustering of Tables in Relational Databases	2026	VLDB	4.1905499e-05
5,506	Can Large Language Models Predict Data Correlations from Column Names?	2023	VLDB	5.4711611e-05
5,448	SNAILS: Schema Naming Assessments for Improved LLM-Based SQL Inference	2025	SIGMOD	5.4980173e-05
8,054	Generating Succinct Descriptions of Database Schemata for Cost-Efficient Prompting of Large Language Models	2024	VLDB	4.5909042e-05
10,210	SchemaRAG: A Schema-aware Retrieval-Augmented Generation Framework for Text-to-SQL	2026	SIGMOD	4.1905499e-05
3,520	GitTables: A Large-Scale Corpus of Relational Tables	2023	SIGMOD	7.0136102e-05