🦅 Falcon: Enterprise-Grade Text-to-SQL Benchmark 🦅 Falcon:企业级 Text-to-SQL 基准测试

A Comprehensive Chinese Text-to-SQL Benchmark for Complex, Cross-Domain Analytical Scenarios 面向复杂跨领域分析场景的综合性中文 Text-to-SQL 基准测试

🏆 Leaderboard 🏆 排行榜

Current evaluation results on the Falcon benchmark (500 Questions).
Sorted by Execution Accuracy (EX). "VES" denotes Valid Executable SQL rate.

Falcon 基准测试(500题)的当前评估结果。
执行准确率 (Execution Accuracy, EX) 排序。"VES" 代表有效可执行 SQL 比率。

Rank 排名 Model 模型 Execution Accuracy (EX) 执行准确率 (EX) Executable Rate (VES) 可执行率 (VES) Date 日期
1 🏆 Falcon-Agent + DeepSeek R1 SOTA Ant Group
45.2%
226 / 500
83.8%
419 / 500
2025-09
2 🥈 Falcon-Agent + o1 Ant Group
43.0%
215 / 500
84.6%
423 / 500
2025-09
3 🥉 Falcon-Agent + o3-mini Ant Group
42.2%
211 / 500
82.0%
410 / 500
2025-09
4 Falcon-Agent + Claude 3.7 Sonnet (Thinking) NEW Ant Group
41.0%
205 / 500
80.6%
403 / 500
2025-09
5 Falcon-Agent + GPT4.1 (Preview) Ant Group
40.2%
201 / 500
76.2%
381 / 500
2025-09
6 Falcon-Agent + Claude 3.7 Sonnet Ant Group
40.0%
200 / 500
79.0%
395 / 500
2025-09
7 Falcon-Agent + Qwen3-Coder-480B-Instruct Ant Group
37.0%
185 / 500
77.4%
387 / 500
2025-09
8 Falcon-Agent + Gemini 2.5 Pro Ant Group
36.8%
184 / 500
84.2%
421 / 500
2025-09
10 Falcon-Agent + Llama 3.3 70B Ant Group
24.0%
120 / 500
65.6%
328 / 500
2025-09

📖 Introduction 📖 项目简介

Falcon is a continuously evolving, high-quality benchmark designed to bridge the gap between academic Text-to-SQL datasets and real-world enterprise requirements. Unlike traditional benchmarks, Falcon focuses on MaxCompute/Hive dialects and stresses models with complex SQL patterns and linguistic ambiguities common in production environments.

Falcon 是一个持续迭代的高质量基准测试,旨在弥合学术界 Text-to-SQL 数据集与真实企业需求之间的差距。与传统基准测试不同,Falcon 专注于 MaxCompute/Hive 方言,并着重测试模型在生产环境中常见的复杂 SQL 模式和语言歧义处理能力。

SQL Complexity SQL 复杂度

  • Heavy focus on multi-table joins (77% of samples)
  • 重点关注多表关联(77% 的样本)
  • Nested CTEs (Common Table Expressions)
  • 嵌套 CTE(公用表表达式)
  • Window functions & Ranking
  • 窗口函数与排名计算
  • Type casting & Regular-expression filters
  • 类型转换与正则表达式过滤

Linguistic Challenges 语言挑战

  • Chinese fuzzy time expressions
  • 中文模糊时间表达
  • Colloquial business jargon
  • 口语化业务术语
  • Ellipsis (omitted information)
  • 省略表达(信息缺失)
  • Multi-intent questions
  • 多意图复杂提问

Enterprise Scale 企业级规模

  • Schemas involve denormalized fields
  • 包含非规范化字段的 Schema
  • Implicit foreign keys
  • 隐式外键关系
  • Domain-specific synonyms
  • 领域特定同义词
  • Real-world "dirty" data scenarios
  • 真实世界的“脏数据”场景

📂 Dataset Structure 📂 数据集结构

The Falcon benchmark is split into a Development Set (with ground truth) and a Test Set (blind). The repository structure is designed for easy integration with evaluation pipelines.

Falcon 基准测试分为 开发集 (Development Set)(包含标准答案)和 测试集 (Test Set)(盲测)。仓库结构设计便于与评估流程集成。

Repository Layout 仓库目录结构
FALCON/
├── dev_data/                   # Development Set (开发集)
│   ├── dev.json                # Questions, SQL, and Execution Results
│   ├── tables.json             # Schema definitions (PK/FK/Columns)
│   └── dev_databases/          # SQLite/CSV source files for execution
│
├── test_data/                  # Test Set (测试集)
│   ├── test.json               # Questions ONLY (Ground truth hidden)
│   ├── tables.json             # Schema definitions
│   └── test_databases/         # SQLite/CSV source files
│
├── simple_agent/               # [NEW] Lightweight Evaluation Scripts (轻量级评估脚本)
│   ├── comparator.py           # SQL execution result comparator
│   ├── utils.py                # Utilities for SQL extraction
│   └── simple_benchmark.py     # Main script to run dev/test evaluation
│
├── submission/                 # [NEW] Submission Helpers & Examples (提交辅助工具)
│   ├── example_submission_csv/ # Example CSV files for leaderboard
│   ├── example_submission_sql/ # Example SQL files for leaderboard
│   └── format_submission.py    # Helper to convert DB-GPT Excel output
│
└── README.md

🚀 Getting Started 🚀 快速开始

We provide two methods for evaluating your models on the Falcon benchmark.

我们提供两种方法来在 Falcon 基准上评估您的模型。

Method 1: Script-based 方法 1:脚本评估

Use the lightweight simple_agent pipeline.

使用轻量级的 simple_agent 流程。

1. Clone & Install 1. 克隆与安装 pip install openai pandas tqdm
2. Run Evaluation 2. 运行评估
# For Development Set (开发集)
python simple_benchmark.py dev

# For Test Set (测试集)
python simple_benchmark.py test

Generates a submission.zip automatically. 自动生成 submission.zip 文件。

Method 2: DB-GPT (GUI) 方法 2:DB-GPT (可视化)

Evaluate Models and Agents via visual interface.

通过可视化界面评估模型和智能体。

1. Run Evaluation 1. 运行评估 Use the "Models Evaluation" module in DB-GPT to generate an Excel report. 使用 DB-GPT 中的“模型评估”模块生成 Excel 报告。
2. Format Submission 2. 格式化提交
python submission/format_submission.py \
--input <dbgpt_output.xlsx> \
--output submission.zip

📝 Citation 📝 引用

If you use Falcon in your research or development, please cite our paper:

如果您在研究或开发中使用了 Falcon,请引用我们的论文:

@article{falcon2025,
  title={Falcon: A Comprehensive Chinese Text-to-SQL Benchmark for Enterprise-Grade Evaluation},
  author={Luo, Wenzhen and Guan, Wei and Yao, Yifan and Pan, Yimin and Wang, Feng and Yu, Zhipeng and Wen, Zhe and Chen, Liang and Zhuang, Yihong},
  journal={arXiv preprint arXiv:2510.24762},
  year={2025},
  url={https://arxiv.org/abs/2510.24762}
}