A Comprehensive Chinese Text-to-SQL Benchmark for Complex, Cross-Domain Analytical Scenarios 面向复杂跨领域分析场景的综合性中文 Text-to-SQL 基准测试
Current evaluation results on the Falcon benchmark (500 Questions).
Sorted by Execution Accuracy (EX). "VES" denotes Valid Executable SQL rate.
Falcon 基准测试(500题)的当前评估结果。
按 执行准确率 (Execution Accuracy, EX) 排序。"VES" 代表有效可执行 SQL 比率。
| Rank 排名 | Model 模型 | Execution Accuracy (EX) 执行准确率 (EX) | Executable Rate (VES) 可执行率 (VES) | Date 日期 |
|---|---|---|---|---|
| 1 🏆 | Falcon-Agent + DeepSeek R1 SOTA |
45.2%
226 / 500
|
83.8%
419 / 500
|
2025-09 |
| 2 🥈 | Falcon-Agent + o1 |
43.0%
215 / 500
|
84.6%
423 / 500
|
2025-09 |
| 3 🥉 | Falcon-Agent + o3-mini |
42.2%
211 / 500
|
82.0%
410 / 500
|
2025-09 |
| 4 | Falcon-Agent + Claude 3.7 Sonnet (Thinking) NEW |
41.0%
205 / 500
|
80.6%
403 / 500
|
2025-09 |
| 5 | Falcon-Agent + GPT4.1 (Preview) |
40.2%
201 / 500
|
76.2%
381 / 500
|
2025-09 |
| 6 | Falcon-Agent + Claude 3.7 Sonnet |
40.0%
200 / 500
|
79.0%
395 / 500
|
2025-09 |
| 7 | Falcon-Agent + Qwen3-Coder-480B-Instruct |
37.0%
185 / 500
|
77.4%
387 / 500
|
2025-09 |
| 8 | Falcon-Agent + Gemini 2.5 Pro |
36.8%
184 / 500
|
84.2%
421 / 500
|
2025-09 |
| 10 | Falcon-Agent + Llama 3.3 70B |
24.0%
120 / 500
|
65.6%
328 / 500
|
2025-09 |
Falcon is a continuously evolving, high-quality benchmark designed to bridge the gap between academic Text-to-SQL datasets and real-world enterprise requirements. Unlike traditional benchmarks, Falcon focuses on MaxCompute/Hive dialects and stresses models with complex SQL patterns and linguistic ambiguities common in production environments.
Falcon 是一个持续迭代的高质量基准测试,旨在弥合学术界 Text-to-SQL 数据集与真实企业需求之间的差距。与传统基准测试不同,Falcon 专注于 MaxCompute/Hive 方言,并着重测试模型在生产环境中常见的复杂 SQL 模式和语言歧义处理能力。
The Falcon benchmark is split into a Development Set (with ground truth) and a Test Set (blind). The repository structure is designed for easy integration with evaluation pipelines.
Falcon 基准测试分为 开发集 (Development Set)(包含标准答案)和 测试集 (Test Set)(盲测)。仓库结构设计便于与评估流程集成。
FALCON/
├── dev_data/ # Development Set (开发集)
│ ├── dev.json # Questions, SQL, and Execution Results
│ ├── tables.json # Schema definitions (PK/FK/Columns)
│ └── dev_databases/ # SQLite/CSV source files for execution
│
├── test_data/ # Test Set (测试集)
│ ├── test.json # Questions ONLY (Ground truth hidden)
│ ├── tables.json # Schema definitions
│ └── test_databases/ # SQLite/CSV source files
│
├── simple_agent/ # [NEW] Lightweight Evaluation Scripts (轻量级评估脚本)
│ ├── comparator.py # SQL execution result comparator
│ ├── utils.py # Utilities for SQL extraction
│ └── simple_benchmark.py # Main script to run dev/test evaluation
│
├── submission/ # [NEW] Submission Helpers & Examples (提交辅助工具)
│ ├── example_submission_csv/ # Example CSV files for leaderboard
│ ├── example_submission_sql/ # Example SQL files for leaderboard
│ └── format_submission.py # Helper to convert DB-GPT Excel output
│
└── README.md
We provide two methods for evaluating your models on the Falcon benchmark.
我们提供两种方法来在 Falcon 基准上评估您的模型。
Use the lightweight simple_agent pipeline.
使用轻量级的 simple_agent 流程。
pip install openai pandas tqdm
# For Development Set (开发集) python simple_benchmark.py dev # For Test Set (测试集) python simple_benchmark.py test
Generates a submission.zip automatically.
自动生成 submission.zip 文件。
Evaluate Models and Agents via visual interface.
通过可视化界面评估模型和智能体。
python submission/format_submission.py \ --input <dbgpt_output.xlsx> \ --output submission.zip
If you use Falcon in your research or development, please cite our paper:
如果您在研究或开发中使用了 Falcon,请引用我们的论文:
@article{falcon2025,
title={Falcon: A Comprehensive Chinese Text-to-SQL Benchmark for Enterprise-Grade Evaluation},
author={Luo, Wenzhen and Guan, Wei and Yao, Yifan and Pan, Yimin and Wang, Feng and Yu, Zhipeng and Wen, Zhe and Chen, Liang and Zhuang, Yihong},
journal={arXiv preprint arXiv:2510.24762},
year={2025},
url={https://arxiv.org/abs/2510.24762}
}