OmniSQL: Synthesizing High-quality Text-to-SQL Data at Scale
Summary: Scalable synthesis framework producing SynSQL‑2.5M: 2.5M text-to-SQL samples across ~16k synthetic databases, each with DB, SQL, NL question, and chain-of-thought, addressing data scarcity and reliance on closed-source prompting. Trains OmniSQL (7B/14B/32B), open-source, matching or surpassing larger closed/open LLMs (e.g., GPT‑4o, DeepSeek‑V3). (summarized by gpt-5-mini on Feb 09 2026)
Incoming Non-self Citations Over Time
Authors
- 1. Haoyang Li
- 2. Shang Wu
- 3. Xiaokang Zhang
- 4. Xinmei Huang
- 5. Jing Zhang
- 6. Fuxin Jiang
- 7. Shuai Wang
- 8. Tieying Zhang
- 9. Jianjun Chen
- 10. Rui Shi
- 11. Hong Chen
- 12. Cuiping Li
Incoming Citations (Sorted by Pagerank)
Showing 9 of 9 citing papers.
| Rank | Citing Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 8,896 | SQL-Factory: A Multi-Agent Framework for High-Quality and Large-Scale SQL Generation | 2026 | VLDB | 4.427232e-05 |
| 9,995 | Text-to-SQL Benchmarks are Broken: An In-Depth Analysis of Annotation Errors | 2026 | CIDR | 4.1945683e-05 |
| 10,155 | DIVER: A Robust Text-to-SQL System with Dynamic Interactive Value Linking and Evidence Reasoning | 2026 | SIGMOD | 4.1945683e-05 |
| 10,194 | PRISM: Navigating Cost–Accuracy Trade-offs for NL2SQL | 2026 | SIGMOD | 4.1945683e-05 |
| 10,221 | NL2SQLBench: A Modular Benchmarking Framework for LLM-Enabled NL2SQL Solutions | 2026 | VLDB | 4.1945683e-05 |
| 10,242 | SQL-Exchange: Transforming SQL Queries Across Domains | 2026 | VLDB | 4.1945683e-05 |
| 10,268 | OpenSQL: Data-Efficient Text-to-SQL for Open-Source LLMs via Synthesized Intermediate Supervision | 2026 | VLDB | 4.1945683e-05 |
| 10,327 | Pervasive Annotation Errors Break Text-to-SQL Benchmarks and Leaderboards | 2026 | VLDB | 4.1945683e-05 |
| 10,837 | Natural Language to SQL: State of the Art and Open Problems | 2025 | VLDB | 4.1945683e-05 |
Previous
Page 1 / 1
Next
Outgoing Citations (Sorted by Pagerank)
Showing 8 of 8 cited papers.
Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.
| Rank | Cited Paper | Year | Venue | Pagerank |
|---|---|---|---|---|
| 369 | Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation | 2024 | VLDB | 0.0002547515 |
| 998 | CodeS: Towards Building Open-source Language Models for Text-to-SQL | 2024 | SIGMOD | 0.00014729379 |
| 1,732 | CatSQL: Towards Real World Natural Language to SQL Applications | 2023 | VLDB | 0.00010732004 |
| 2,321 | DBPal: A Fully Pluggable NL2SQL Training Pipeline | 2020 | SIGMOD | 9.03609e-05 |
| 2,433 | ScienceBenchmark: A Complex Real-World Benchmark for Evaluating Natural Language to SQL Systems | 2024 | VLDB | 8.8285962e-05 |
| 2,945 | Few-shot Text-to-SQL Translation using Structure and Content Prompt Learning | 2023 | SIGMOD | 7.8377395e-05 |
| 3,520 | GitTables: A Large-Scale Corpus of Relational Tables | 2023 | SIGMOD | 7.0131061e-05 |
| 5,928 | SchemaPile: A Large Collection of Relational Database Schemas | 2024 | SIGMOD | 5.2685946e-05 |
Previous
Page 1 / 1
Next