BFCS: A Large-Scale Execution-Based Benchmark for Function Calling in Science

This repository contains the official implementation of the paper: "BFCS: A Large-Scale Execution-Based Benchmark for Function Calling in Science".

🌟 Overview

BFCS is the first execution-based benchmark specifically designed to evaluate the function-calling capabilities of Large Language Models (LLMs) in scientific domains. Unlike static benchmarks, BFCS adopts an execution-first philosophy:

Real-World Scale: Includes 1,648 function-query-answer pairs across chemistry, biology, pharmacy, medicine, and materials science.
Standardized Environment: Integrated with 48 real scientific Python libraries (e.g., RDKit, Biopython) and 2,100 executable tools.
Rigorous Evaluation: Uses Apptainer for container-native isolation to ensure reproducibility and verify functional correctness (ESR) and semantic accuracy (AMR).

📊 Main Results

Model	Simple ESR	Simple AMR	Simple Gap↓	Multiple ESR	Multiple AMR	Multiple Gap↓	Parallel ESR	Parallel AMR	Parallel Gap↓	Overall ESR	Overall AMR	Overall Gap↓
Proprietary Models
Claude-Opus-4.5	98.94	69.74	29.21	99.39	65.97	33.42	93.85	75.89	17.96	97.39	70.53	26.86
Claude-Sonnet-4.5	97.71	60.11	37.60	95.09	63.82	31.27	95.47	68.47	27.00	96.09	64.13	31.96
Gemini-3-Pro	97.15	69.69	27.47	93.81	63.03	30.78	92.28	73.30	18.99	94.42	68.67	25.74
Gemini-3-Flash	94.47	64.50	29.96	92.18	61.09	31.09	85.71	65.68	20.03	90.79	63.76	27.03
GPT-5.2	92.80	65.23	27.56	93.13	64.44	28.69	92.41	61.12	31.28	92.78	63.60	29.18
Doubao-Seed-1.8	95.61	50.19	45.42	98.18	36.00	62.18	93.21	43.73	49.48	95.67	43.31	52.36
Open-Weight Models
DeepSeek-V3.2	89.12	54.77	34.35	91.27	54.91	36.36	98.26	4.53	93.73	92.88	38.07	54.81
GLM-4.7	91.27	48.28	42.99	96.26	61.07	35.19	91.02	58.23	32.79	92.85	55.86	36.99
Kimi-k2.5	91.98	44.85	47.14	89.64	29.09	60.55	84.84	40.42	44.43	88.82	38.12	50.70
Mistral-Large-3	100.00	48.28	51.72	100.00	49.64	50.36	91.29	43.90	47.39	97.10	47.27	49.82
Qwen3-235B	97.62	59.21	38.42	99.69	60.58	39.12	94.58	57.69	36.89	97.30	59.16	38.14
Qwen3-30B	93.13	42.07	51.06	97.49	38.63	58.86	90.54	34.24	56.30	93.72	38.31	55.41

Note: ESR (Execution Success Rate) measures if the code runs; AMR (Answer Match Rate) measures if the scientific logic is correct. Gap = ESR − AMR, where a positive gap indicates potential silent failures. All values are in percentages.

🚀 Getting Started

1. Prerequisites

We use Apptainer (formerly Singularity) to manage complex scientific dependencies.

# Install Apptainer (refer to official docs for details)
sudo apt-get update && sudo apt-get install -y apptainer

2. Download Data & Containers

Dataset The benchmark data is ready to use and located directly in the ./data directory. It is stratified into three cognitive scenarios:

./data/simple.jsonl: Atomic instruction synthesis without distractors.
./data/multiple.jsonl: Tool selection among semantically similar distractors.
./data/parallel.jsonl: Compositional batch processing requiring multiple independent calls.

BFCS relies on isolated Apptainer environments. The specific installation methods and build instructions for each package are detailed in ./containers/config.yaml.

To save time, you do not need to build these from scratch. We have published the 9 base Apptainer images and 48 package-specific Apptainer images directly to the repository's GitHub Packages (GHCR). You can download or pull them directly:

# Clone the repository
git clone https://github.com/ChampionZhong/BFCS.git
cd BFCS

# Example: Pulling a specific package container directly from GitHub Packages
# Make sure to replace <package_name> with the target tool (e.g., rdkit, biopython)
apptainer pull docker://ghcr.io/championzhong/bfcs/bfcs-<package_name>:latest

3. Run Evaluation

python evaluation/run_eval.py --model_name your_model_name --scenario simple

📂 Dataset Taxonomy

Overview of repositories and corresponding packages which can be import in Python environment, covering the category assigned to each repository and the total count of available wrapper functions.

#	Repo Name (Original Repo)	Package Name (Package Card)	Category	Tools	LICENSE
1	AiZynthFinder	aizynthfinder	Pharmacy	3	MIT
2	anndata	anndata	Biology	8	BSD-3-Clause
3	batchgenerators	batchgenerators	Medicine	26	Apache-2.0
4	bioemu	bioemu	Biology	153	MIT
5	biopython	Bio	Biology	4	BSD-3-Clause
6	boltz	boltz	Pharmacy	46	MIT
7	CEBRA	cebra	Biology	14	Apache-2.0
8	chai-lab	chai_lab	Biology	12	Apache-2.0
9	chembl-downloader	chembl_downloader	Pharmacy	2	MIT
10	ChemInformant	ChemInformant	Chemistry	5	MIT
11	chemprop	chemprop	Pharmacy	4	MIT
12	chempy	chempy	Chemistry	45	BSD-2-Clause
13	CIRpy	cirpy	Chemistry	6	MIT
14	datamol	datamol	Chemistry	13	Apache-2.0
15	deepchem	deepchem	Pharmacy	128	MIT
16	DeepPurpose	DeepPurpose	Pharmacy	31	BSD-3-Clause
17	descriptastorus	descriptastorus	Chemistry	3	BSD-3-Clause
18	drugbank_downloader	drugbank_downloader	Pharmacy	1	MIT
19	dscribe	dscribe	Material	7	Apache-2.0
20	gpaw	gpaw	Material	263	GPLv3+
21	guacamol	guacamol	Pharmacy	5	MIT
22	lungmask	lungmask	Medicine	8	Apache-2.0
23	mace	mace	Material	7	MIT
24	MedCLIP	medclip	Medicine	3	Unknown
25	mendeleev	mendeleev	Chemistry	23	MIT
26	molmass	molmass	Chemistry	14	BSD-3-Clause
27	MONAI	monai	Medicine	96	Apache-2.0
28	mordred	mordred	Chemistry	1	BSD-3-Clause
29	ncbi-genome-download	ncbi_genome_download	Biology	21	Apache-2.0
30	NistChemPy	nistchempy	Chemistry	1	MIT
31	nnUNet	nnunetv2	Medicine	20	Apache-2.0
32	periodictable	periodictable	Chemistry	21	BSD-3-Clause
33	PubChemPy	pubchempy	Chemistry	2	MIT
34	pybel	pybel	Biology	46	MIT
35	pyEQL	pyEQL	Chemistry	4	LGPLv3
36	pyRiemann	pyriemann	Medicine	100	BSD-3-Clause
37	pyscf	pyscf	Chemistry	449	Apache-2.0
38	rdkit	rdkit	Chemistry	110	BSD-3-Clause
39	robert	robert	Chemistry	38	MIT
40	scanpy	scanpy	Biology	21	BSD-3-Clause
41	selfies	selfies	Chemistry	13	Apache-2.0
42	spikeinterface	spikeinterface	Biology	159	MIT
43	stk	stk	Chemistry	12	MIT
44	tape	tape	Biology	7	BSD-3-Clause
45	TDC	tdc	Pharmacy	118	MIT
46	torchdrug	torchdrug	Pharmacy	16	Apache-2.0
47	torchio	torchio	Medicine	3	Apache-2.0
48	useful_rdkit_utils	useful_rdkit_utils	Chemistry	8	MIT

📃 License & Acknowledgements

The source code of the wrappers and the build scripts in this repository are licensed under the Apache License 2.0.

However, the software packages installed within the containers retain their original licenses. Users are responsible for complying with the licenses of the underlying packages when using them:

ATTENTION GPLv3+: gpaw (Please note that using gpaw may impose copyleft obligations)

✍️ Citation

If you find this work helpful, please cite our work:

@misc{zhong2026bfcs,
  title={BFCS: A Large-Scale Execution-Based Benchmark for Function Calling in Science},
  author={Zhong, Zhanping and Su, Xuerui and Zhang, Wei and Pei, Qizhi and Wang, Zun and He, Conghui and Wu, Lijun},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{[https://github.com/ChampionZhong/BFCS](https://github.com/ChampionZhong/BFCS)}},
  year={2026},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BFCS: A Large-Scale Execution-Based Benchmark for Function Calling in Science

🌟 Overview

📊 Main Results

🚀 Getting Started

1. Prerequisites

2. Download Data & Containers

3. Run Evaluation

📂 Dataset Taxonomy

📃 License & Acknowledgements

✍️ Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
containers		containers
data		data
evaluation		evaluation
figs		figs
packages		packages
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

BFCS: A Large-Scale Execution-Based Benchmark for Function Calling in Science

🌟 Overview

📊 Main Results

🚀 Getting Started

1. Prerequisites

2. Download Data & Containers

3. Run Evaluation

📂 Dataset Taxonomy

📃 License & Acknowledgements

✍️ Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages