
A Dynamic and Contamination-Free Benchmark for Evaluating LLMs on Real-World Text-to-SQL Tasks

[05/30/2025]:
The first release of LiveSQLBench has been released! It contains our initial version: LiveSQLBench-Base-Lite. Download it and test your text-to-SQL LLMs or agents in a containmation-free way!

&

LiveSQLBench (BIRD-SQL Pro v0.5) is a contamination-free, continuously evolving benchmark designed to evaluate LLMs on complex, real-world text-to-SQL tasks, featuring diverse real-world user queries, including Business Intelligence (BI), CRUD operations, etc. Each release will include 50 new, fully open-source DBs curated by the BIRD team through expert collaboration and continuous improvement. It will cover a wide range of database sizes, from end-user level (around 127 columns) to industrial level (1340+ columns). Here are the features of the LiveSQLBench benchmark:
- 1Live Databases:Constructed dynamically from extensive and regularly updated CSV datasets, with both base (user-end level) and large (industrial level) versions (1340+ columns each DB) to test scalability. 
- 2Live User Queries and SQL:Each task pairs unambiguous user queries with annotated, gold-standard SQL statements. The user queries are grounded in external knowledge, with medium to hard complexity SQL statements. 
- 3Contextual Reasoning (HKB):Every DB includes a hierarchical knowledge base (HKB) where each knowledge is related to others, which requires the multi-hop reasoning ability. Two HKB formats are provided: (1) structured JSON format, and (2) unstructured Document format. 
- 4The First Full SQL Spectrum:Supports not just SELECT (Business Intelligence) queries, but also CRUD (e.g., UPDATE, CREATE, and other database management operations) queries. 
- 5Automated Evaluation:Each question includes verifiable test cases for accurate, reproducible scoring. 
- 6Truly Live & Hidden Test:New databases and tasks are added over time. Each release features both open development and hidden test phases. The hidden test set from each release becomes the open development set for the next release, ensuring continuous evolution and fair evaluation. 
Currently, we release a LiveSQLBench-Base-Lite, featuring 18 end-user level databases with 270 tasks, HKB-JSON and the JSON operation in SQL for trial.
LiveSQLBench's updating databases, tasks, and HKB support BIRD-Interact's conversational and agentic evaluation. BIRD-Interact evaluates LLMs' text-to-SQL ability in dynamic interactive settings with database and user simulation.
LiveSQLBench-Base-Lite Data
Please explore LiveSQLBench-Base-Lite examples with DBs, tasks, and HKB-JSON, our initial release featuring 270 tasks across 18 end-user level databases. Each task features unambiguous and straightforward user queries grounded in external knowledge, with medium to hard complexity SQL statements.
180
SELECT Queries
(Base Version)
90
Management SQLs
(Base Version)
360
Avg SQL Tokens
Current Avg
18
Databases
(Base Version)
Preview: Large Version (Industrial Level) DBs and unstructured Knowledge Base (Document) will be supported in the future LiveSQLBench-Full version.
LiveSQLBench Leaderboard
Success Rate. Defined by the ratio of the number of tasks passing the test cases to the total number of tasks.
Evaluation Methodology
- •SELECT queries: Compare execution results with golden SQL outputs
- •Management SQLs: Verify through comprehensive test cases
Last Updated: 05/28/2025
| 🥇1 2025-05-28 | 
 | OpenAI | 44.81 | 0.0233 | 🔗 | 
| 🥈2 2025-05-28 | 
 | OpenAI | 40.00 | 0.0336 | 🔗 | 
| 🥉3 2025-05-28 | 
 | OpenAI | 37.80 | 0.0231 | 🔗 | 
| 4 2025-05-28 | 
 | OpenAI | 37.40 | 0.2129 | 🔗 | 
| 5 2025-05-28 | 
 | OpenAI | 37.03 | 0.4310 | 🔗 | 
| 6 2025-05-28 | 
 | 37.03 | 0.0165 | 🔗 | |
| 7 2025-05-28 | 
 | Anthropic | 36.70 | 0.0623 | 🔗 | 
| 8 2025-05-28 | 
 | Qwen | 34.81 | 0.0043 | 🔗 | 
| 9 2025-05-28 | 
 | Anthropic | 34.81 | 0.0771 | 🔗 | 
| 10 2025-05-28 | 
 | Anthropic | 34.44 | 0.0619 | 🔗 | 
| 11 2025-05-28 | 
 | OpenAI | 32.96 | 0.0788 | 🔗 | 
| 12 2025-05-28 | 
 | OpenAI | 31.48 | 0.0412 | 🔗 | 
| 13 2025-05-28 | 
 | 30.37 | 0.0027 | 🔗 | |
| 14 2025-05-28 | 
 | DeepSeek | 30.37 | 0.0047 | 🔗 | 
| 15 2025-05-28 | 
 | DeepSeek | 27.78 | 0.0165 | 🔗 | 
| 16 2025-05-28 | 
 | Meta | 27.40 | 0.0029 | 🔗 | 
| 17 2025-05-28 | 
 | Meta | 16.70 | 0.0014 | 🔗 | 
Note: Results are based on LiveSQLBench-Base-Lite (270 tasks across 18 end-user level databases with HKB-JSON), including both SELECT queries and management operations; Model with reasoning ability is marked with a grayed-out logo.
Discussion
1Current Model Performance
LiveSQLBench-Base-Lite evaluates LLMs on PostgreSQL, the most widely used and feature-rich open-source database system. Our benchmark provides Docker-based evaluation environments for easy deployment and reproducibility. We conduct separate evaluations across three categories: (1) Model Base - direct SQL generation without external tools, and (2) Agent - models with external tool orchestration. Initial results on Model Base reveal significant challenges, with the best-performing model (o3-mini) achieving a 44.81% success rate. The performance gap between models is notable, with a cluster of top models (o3-mini, GPT-4.1, o4-mini, o1-preview, and Gemini 2.5 Flash with thinking) showing capabilities in the 37-45% range, while others still struggle to consistently generate correct SQL queries. This suggests that while there's an improvement at the top end, complex SQL generation remains a difficult task for most current LLMs. The introduction of reasoning-specific models and newer architectures like OpenAI's 'o' series and Google's Gemini 2.5 shows promise, but highlights the ongoing need for advancements in this domain.
Stay tuned!
We are developing several new versions of LiveSQLBench for the first release:
- •LiveSQLBench-Base-Full with 600 BI tasks, 200 Management tasks and HKB-Documents
- •LiveSQLBench-Large-Lite featuring industrial-scale databases with 1340+ columns
- •LiveSQLBench-Large-Full containing complete large version DBs and tasks
Additionally, we are expanding to multi-dialect support, starting with SQLite for research purposes, with plans to add more dialects based on community voting.
Each new version will include both open development and hidden test sets, with hidden tests becoming the next version's open development set.
Citation
LiveSQLBench Citation
BirdBench Citation
Submit feedback to questions in the dataset via this form
.png)
 4 months ago
                                11
                        4 months ago
                                11
                     
  









