Text-to-SQL Benchmarks are Broken: An In-Depth Analysis of Annotation Errors

Summary: Audit of BIRD and Spider 2.0‑Snow finds 52.8% and 66.1% annotation errors (wrong gold SQLs, ambiguity), invalidating much benchmark signal. Re-evaluation of five models shows −3% to +31% shifts and up to three-rank changes, demanding higher-quality benchmarks and improved annotation pipelines. (summarized by gpt-5-mini on Feb 09 2026)

Paper ID: 598
Venue: CIDR
Year: 2026
Pagerank: 4.1905499e-05
Overall Rank: 9,994 | 30.55%
DOI: -

Incoming Non-self Citations Over Time

No non-self incoming citations found for this paper in this database.

Authors

Incoming Citations (Sorted by Pagerank)

Showing 0 of 0 citing papers.

Rank	Citing Paper	Year	Venue	Pagerank

Outgoing Citations (Sorted by Pagerank)

Showing 6 of 6 cited papers.

Citations counted here include only citations to other VLDB/SIGMOD/CIDR/PODS papers in this database.

Rank	Cited Paper	Year	Venue	Pagerank
10	Benchmarking Database Systems: A Systematic Approach	1983	VLDB	0.0012093675
366	Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation	2024	VLDB	0.00025580097
659	The Making of TPC-DS	2006	VLDB	0.00018514913
1,968	A Methodology for Database System Performance Evaluation	1984	SIGMOD	9.9099939e-05
3,862	OpenSearch-SQL: Enhancing Text-to-SQL with Dynamic Few-shot and Consistency Alignment	2025	SIGMOD	6.68436e-05
3,978	OmniSQL: Synthesizing High-quality Text-to-SQL Data at Scale	2025	VLDB	6.5662694e-05

Semantically Similar Papers

Overall Rank	Paper	Year	Venue	Pagerank
973	Natural language to SQL: Where are we today?	2020	VLDB	0.0001488435
2,435	ScienceBenchmark: A Complex Real-World Benchmark for Evaluating Natural Language to SQL Systems	2024	VLDB	8.8218963e-05
9,973	BenchPress: A Human-in-the-Loop Annotation System for Rapid Text-to-SQL Benchmark Curation	2026	CIDR	4.1905499e-05
3,666	The Dawn of Natural Language to SQL: Are We Fully Ready?	2024	VLDB	6.8606092e-05
5,363	An In-Depth Benchmarking of Text-to-SQL Systems	2021	SIGMOD	5.5467941e-05
7,137	Automated Validating and Fixing of Text-to-SQL Translation with Execution Consistency	2025	SIGMOD	4.8165495e-05
7,351	Reliable Text-to-SQL with Adaptive Abstention	2025	SIGMOD	4.7484027e-05
10,221	NL2SQLBench: A Modular Benchmarking Framework for LLM-Enabled NL2SQL Solutions	2026	VLDB	4.1905499e-05
366	Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation	2024	VLDB	0.00025580097
10,339	Pervasive Annotation Errors Break Text-to-SQL Benchmarks and Leaderboards	2026	VLDB	4.1905499e-05