Skip to content

ChampionZhong/BFCS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

BFCS BFCS: A Large-Scale Execution-Based Benchmark for Function Calling in Science

This repository contains the official implementation of the paper: "BFCS: A Large-Scale Execution-Based Benchmark for Function Calling in Science".

pipline

🌟 Overview

BFCS is the first execution-based benchmark specifically designed to evaluate the function-calling capabilities of Large Language Models (LLMs) in scientific domains. Unlike static benchmarks, BFCS adopts an execution-first philosophy:

  • Real-World Scale: Includes 1,648 function-query-answer pairs across chemistry, biology, pharmacy, medicine, and materials science.
  • Standardized Environment: Integrated with 48 real scientific Python libraries (e.g., RDKit, Biopython) and 2,100 executable tools.
  • Rigorous Evaluation: Uses Apptainer for container-native isolation to ensure reproducibility and verify functional correctness (ESR) and semantic accuracy (AMR).

πŸ“Š Main Results

Model Simple ESR Simple AMR Simple Gap↓ Multiple ESR Multiple AMR Multiple Gap↓ Parallel ESR Parallel AMR Parallel Gap↓ Overall ESR Overall AMR Overall Gap↓
Proprietary Models
Claude-Opus-4.5 98.94 69.74 29.21 99.39 65.97 33.42 93.85 75.89 17.96 97.39 70.53 26.86
Claude-Sonnet-4.5 97.71 60.11 37.60 95.09 63.82 31.27 95.47 68.47 27.00 96.09 64.13 31.96
Gemini-3-Pro 97.15 69.69 27.47 93.81 63.03 30.78 92.28 73.30 18.99 94.42 68.67 25.74
Gemini-3-Flash 94.47 64.50 29.96 92.18 61.09 31.09 85.71 65.68 20.03 90.79 63.76 27.03
GPT-5.2 92.80 65.23 27.56 93.13 64.44 28.69 92.41 61.12 31.28 92.78 63.60 29.18
Doubao-Seed-1.8 95.61 50.19 45.42 98.18 36.00 62.18 93.21 43.73 49.48 95.67 43.31 52.36
Open-Weight Models
DeepSeek-V3.2 89.12 54.77 34.35 91.27 54.91 36.36 98.26 4.53 93.73 92.88 38.07 54.81
GLM-4.7 91.27 48.28 42.99 96.26 61.07 35.19 91.02 58.23 32.79 92.85 55.86 36.99
Kimi-k2.5 91.98 44.85 47.14 89.64 29.09 60.55 84.84 40.42 44.43 88.82 38.12 50.70
Mistral-Large-3 100.00 48.28 51.72 100.00 49.64 50.36 91.29 43.90 47.39 97.10 47.27 49.82
Qwen3-235B 97.62 59.21 38.42 99.69 60.58 39.12 94.58 57.69 36.89 97.30 59.16 38.14
Qwen3-30B 93.13 42.07 51.06 97.49 38.63 58.86 90.54 34.24 56.30 93.72 38.31 55.41

Note: ESR (Execution Success Rate) measures if the code runs; AMR (Answer Match Rate) measures if the scientific logic is correct. Gap = ESR βˆ’ AMR, where a positive gap indicates potential silent failures. All values are in percentages.

πŸš€ Getting Started

1. Prerequisites

We use Apptainer (formerly Singularity) to manage complex scientific dependencies.

# Install Apptainer (refer to official docs for details)
sudo apt-get update && sudo apt-get install -y apptainer

2. Download Data & Containers

Dataset The benchmark data is ready to use and located directly in the ./data directory. It is stratified into three cognitive scenarios:

  • ./data/simple.jsonl: Atomic instruction synthesis without distractors.
  • ./data/multiple.jsonl: Tool selection among semantically similar distractors.
  • ./data/parallel.jsonl: Compositional batch processing requiring multiple independent calls.

BFCS relies on isolated Apptainer environments. The specific installation methods and build instructions for each package are detailed in ./containers/config.yaml.

To save time, you do not need to build these from scratch. We have published the 9 base Apptainer images and 48 package-specific Apptainer images directly to the repository's GitHub Packages (GHCR). You can download or pull them directly:

# Clone the repository
git clone https://github.com/ChampionZhong/BFCS.git
cd BFCS

# Example: Pulling a specific package container directly from GitHub Packages
# Make sure to replace <package_name> with the target tool (e.g., rdkit, biopython)
apptainer pull docker://ghcr.io/championzhong/bfcs/bfcs-<package_name>:latest

3. Run Evaluation

python evaluation/run_eval.py --model_name your_model_name --scenario simple

πŸ“‚ Dataset Taxonomy

category

Overview of repositories and corresponding packages which can be import in Python environment, covering the category assigned to each repository and the total count of available wrapper functions.

# Repo Name (Original Repo) Package Name (Package Card) Category Tools LICENSE
1 AiZynthFinder aizynthfinder Pharmacy 3 MIT
2 anndata anndata Biology 8 BSD-3-Clause
3 batchgenerators batchgenerators Medicine 26 Apache-2.0
4 bioemu bioemu Biology 153 MIT
5 biopython Bio Biology 4 BSD-3-Clause
6 boltz boltz Pharmacy 46 MIT
7 CEBRA cebra Biology 14 Apache-2.0
8 chai-lab chai_lab Biology 12 Apache-2.0
9 chembl-downloader chembl_downloader Pharmacy 2 MIT
10 ChemInformant ChemInformant Chemistry 5 MIT
11 chemprop chemprop Pharmacy 4 MIT
12 chempy chempy Chemistry 45 BSD-2-Clause
13 CIRpy cirpy Chemistry 6 MIT
14 datamol datamol Chemistry 13 Apache-2.0
15 deepchem deepchem Pharmacy 128 MIT
16 DeepPurpose DeepPurpose Pharmacy 31 BSD-3-Clause
17 descriptastorus descriptastorus Chemistry 3 BSD-3-Clause
18 drugbank_downloader drugbank_downloader Pharmacy 1 MIT
19 dscribe dscribe Material 7 Apache-2.0
20 gpaw gpaw Material 263 GPLv3+
21 guacamol guacamol Pharmacy 5 MIT
22 lungmask lungmask Medicine 8 Apache-2.0
23 mace mace Material 7 MIT
24 MedCLIP medclip Medicine 3 Unknown
25 mendeleev mendeleev Chemistry 23 MIT
26 molmass molmass Chemistry 14 BSD-3-Clause
27 MONAI monai Medicine 96 Apache-2.0
28 mordred mordred Chemistry 1 BSD-3-Clause
29 ncbi-genome-download ncbi_genome_download Biology 21 Apache-2.0
30 NistChemPy nistchempy Chemistry 1 MIT
31 nnUNet nnunetv2 Medicine 20 Apache-2.0
32 periodictable periodictable Chemistry 21 BSD-3-Clause
33 PubChemPy pubchempy Chemistry 2 MIT
34 pybel pybel Biology 46 MIT
35 pyEQL pyEQL Chemistry 4 LGPLv3
36 pyRiemann pyriemann Medicine 100 BSD-3-Clause
37 pyscf pyscf Chemistry 449 Apache-2.0
38 rdkit rdkit Chemistry 110 BSD-3-Clause
39 robert robert Chemistry 38 MIT
40 scanpy scanpy Biology 21 BSD-3-Clause
41 selfies selfies Chemistry 13 Apache-2.0
42 spikeinterface spikeinterface Biology 159 MIT
43 stk stk Chemistry 12 MIT
44 tape tape Biology 7 BSD-3-Clause
45 TDC tdc Pharmacy 118 MIT
46 torchdrug torchdrug Pharmacy 16 Apache-2.0
47 torchio torchio Medicine 3 Apache-2.0
48 useful_rdkit_utils useful_rdkit_utils Chemistry 8 MIT

πŸ“ƒ License & Acknowledgements

The source code of the wrappers and the build scripts in this repository are licensed under the Apache License 2.0.

However, the software packages installed within the containers retain their original licenses. Users are responsible for complying with the licenses of the underlying packages when using them:

ATTENTION GPLv3+: gpaw (Please note that using gpaw may impose copyleft obligations)

✍️ Citation

If you find this work helpful, please cite our work:

@misc{zhong2026bfcs,
  title={BFCS: A Large-Scale Execution-Based Benchmark for Function Calling in Science},
  author={Zhong, Zhanping and Su, Xuerui and Zhang, Wei and Pei, Qizhi and Wang, Zun and He, Conghui and Wu, Lijun},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{[https://github.com/ChampionZhong/BFCS](https://github.com/ChampionZhong/BFCS)}},
  year={2026},
}

About

BFCS: A Large-Scale Execution-Based Benchmark for Function Calling in Science

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages