Falcon: Enterprise-Grade Text-to-SQL Benchmark

🏆 Leaderboard 🏆 排行榜

Current evaluation results on the Falcon benchmark (500 Questions).
Sorted by Execution Accuracy (EX). "VES" denotes Valid Executable SQL rate.

Falcon 基准测试（500题）的当前评估结果。
按 执行准确率 (Execution Accuracy, EX) 排序。"VES" 代表有效可执行 SQL 比率。

Rank 排名	Model 模型	Execution Accuracy (EX) 执行准确率 (EX)	Executable Rate (VES) 可执行率 (VES)	Date 日期
1 🏆	Falcon-Agent + DeepSeek R1 SOTA Ant Group	45.2% 226 / 500	83.8% 419 / 500	2025-09
2 🥈	Falcon-Agent + o1 Ant Group	43.0% 215 / 500	84.6% 423 / 500	2025-09
3 🥉	Falcon-Agent + o3-mini Ant Group	42.2% 211 / 500	82.0% 410 / 500	2025-09
4	Falcon-Agent + Claude 3.7 Sonnet (Thinking) NEW Ant Group	41.0% 205 / 500	80.6% 403 / 500	2025-09
5	Falcon-Agent + GPT4.1 (Preview) Ant Group	40.2% 201 / 500	76.2% 381 / 500	2025-09
6	Falcon-Agent + Claude 3.7 Sonnet Ant Group	40.0% 200 / 500	79.0% 395 / 500	2025-09
7	Falcon-Agent + Qwen3-Coder-480B-Instruct Ant Group	37.0% 185 / 500	77.4% 387 / 500	2025-09
8	Falcon-Agent + Gemini 2.5 Pro Ant Group	36.8% 184 / 500	84.2% 421 / 500	2025-09
10	Falcon-Agent + Llama 3.3 70B Ant Group	24.0% 120 / 500	65.6% 328 / 500	2025-09

📖 Introduction 📖 项目简介

Falcon is a continuously evolving, high-quality benchmark designed to bridge the gap between academic Text-to-SQL datasets and real-world enterprise requirements. Unlike traditional benchmarks, Falcon focuses on MaxCompute/Hive dialects and stresses models with complex SQL patterns and linguistic ambiguities common in production environments.

Falcon 是一个持续迭代的高质量基准测试，旨在弥合学术界 Text-to-SQL 数据集与真实企业需求之间的差距。与传统基准测试不同，Falcon 专注于 MaxCompute/Hive 方言，并着重测试模型在生产环境中常见的复杂 SQL 模式和语言歧义处理能力。

SQL Complexity SQL 复杂度

Heavy focus on multi-table joins (77% of samples)
重点关注多表关联（77% 的样本）
Nested CTEs (Common Table Expressions)
嵌套 CTE（公用表表达式）
Window functions & Ranking
窗口函数与排名计算
Type casting & Regular-expression filters
类型转换与正则表达式过滤

Linguistic Challenges 语言挑战

Chinese fuzzy time expressions
中文模糊时间表达
Colloquial business jargon
口语化业务术语
Ellipsis (omitted information)
省略表达（信息缺失）
Multi-intent questions
多意图复杂提问

Enterprise Scale 企业级规模

Schemas involve denormalized fields
包含非规范化字段的 Schema
Implicit foreign keys
隐式外键关系
Domain-specific synonyms
领域特定同义词
Real-world "dirty" data scenarios
真实世界的“脏数据”场景

📂 Dataset Structure 📂 数据集结构

The Falcon benchmark is split into a Development Set (with ground truth) and a Test Set (blind). The repository structure is designed for easy integration with evaluation pipelines.

Falcon 基准测试分为 开发集 (Development Set)（包含标准答案）和 测试集 (Test Set)（盲测）。仓库结构设计便于与评估流程集成。

Repository Layout 仓库目录结构

FALCON/
├── dev_data/                   # Development Set (开发集)
│   ├── dev.json                # Questions, SQL, and Execution Results
│   ├── tables.json             # Schema definitions (PK/FK/Columns)
│   └── dev_databases/          # SQLite/CSV source files for execution
│
├── test_data/                  # Test Set (测试集)
│   ├── test.json               # Questions ONLY (Ground truth hidden)
│   ├── tables.json             # Schema definitions
│   └── test_databases/         # SQLite/CSV source files
│
├── simple_agent/               # [NEW] Lightweight Evaluation Scripts (轻量级评估脚本)
│   ├── comparator.py           # SQL execution result comparator
│   ├── utils.py                # Utilities for SQL extraction
│   └── simple_benchmark.py     # Main script to run dev/test evaluation
│
├── submission/                 # [NEW] Submission Helpers & Examples (提交辅助工具)
│   ├── example_submission_csv/ # Example CSV files for leaderboard
│   ├── example_submission_sql/ # Example SQL files for leaderboard
│   └── format_submission.py    # Helper to convert DB-GPT Excel output
│
└── README.md

🚀 Getting Started 🚀 快速开始

We provide two methods for evaluating your models on the Falcon benchmark.

我们提供两种方法来在 Falcon 基准上评估您的模型。

Method 1: Script-based 方法 1：脚本评估

Use the lightweight simple_agent pipeline.

使用轻量级的 simple_agent 流程。

1. Clone & Install 1. 克隆与安装 pip install openai pandas tqdm

2. Run Evaluation 2. 运行评估

# For Development Set (开发集)
python simple_benchmark.py dev

# For Test Set (测试集)
python simple_benchmark.py test

Generates a submission.zip automatically. 自动生成 submission.zip 文件。

Method 2: DB-GPT (GUI) 方法 2：DB-GPT (可视化)

Evaluate Models and Agents via visual interface.

通过可视化界面评估模型和智能体。

1. Run Evaluation 1. 运行评估 Use the "Models Evaluation" module in DB-GPT to generate an Excel report. 使用 DB-GPT 中的“模型评估”模块生成 Excel 报告。

2. Format Submission 2. 格式化提交

python submission/format_submission.py \
--input <dbgpt_output.xlsx> \
--output submission.zip

📝 Citation 📝 引用

If you use Falcon in your research or development, please cite our paper:

如果您在研究或开发中使用了 Falcon，请引用我们的论文：

@article{falcon2025,
  title={Falcon: A Comprehensive Chinese Text-to-SQL Benchmark for Enterprise-Grade Evaluation},
  author={Luo, Wenzhen and Guan, Wei and Yao, Yifan and Pan, Yimin and Wang, Feng and Yu, Zhipeng and Wen, Zhe and Chen, Liang and Zhuang, Yihong},
  journal={arXiv preprint arXiv:2510.24762},
  year={2025},
  url={https://arxiv.org/abs/2510.24762}
}

🦅 Falcon: Enterprise-Grade Text-to-SQL Benchmark 🦅 Falcon：企业级 Text-to-SQL 基准测试