This repository contains the official implementation of the paper: "BFCS: A Large-Scale Execution-Based Benchmark for Function Calling in Science".
BFCS is the first execution-based benchmark specifically designed to evaluate the function-calling capabilities of Large Language Models (LLMs) in scientific domains. Unlike static benchmarks, BFCS adopts an execution-first philosophy:
- Real-World Scale: Includes 1,648 function-query-answer pairs across chemistry, biology, pharmacy, medicine, and materials science.
- Standardized Environment: Integrated with 48 real scientific Python libraries (e.g., RDKit, Biopython) and 2,100 executable tools.
- Rigorous Evaluation: Uses Apptainer for container-native isolation to ensure reproducibility and verify functional correctness (ESR) and semantic accuracy (AMR).
| Model | Simple ESR | Simple AMR | Simple Gapβ | Multiple ESR | Multiple AMR | Multiple Gapβ | Parallel ESR | Parallel AMR | Parallel Gapβ | Overall ESR | Overall AMR | Overall Gapβ |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Proprietary Models | ||||||||||||
| Claude-Opus-4.5 | 98.94 | 69.74 | 29.21 | 99.39 | 65.97 | 33.42 | 93.85 | 75.89 | 17.96 | 97.39 | 70.53 | 26.86 |
| Claude-Sonnet-4.5 | 97.71 | 60.11 | 37.60 | 95.09 | 63.82 | 31.27 | 95.47 | 68.47 | 27.00 | 96.09 | 64.13 | 31.96 |
| Gemini-3-Pro | 97.15 | 69.69 | 27.47 | 93.81 | 63.03 | 30.78 | 92.28 | 73.30 | 18.99 | 94.42 | 68.67 | 25.74 |
| Gemini-3-Flash | 94.47 | 64.50 | 29.96 | 92.18 | 61.09 | 31.09 | 85.71 | 65.68 | 20.03 | 90.79 | 63.76 | 27.03 |
| GPT-5.2 | 92.80 | 65.23 | 27.56 | 93.13 | 64.44 | 28.69 | 92.41 | 61.12 | 31.28 | 92.78 | 63.60 | 29.18 |
| Doubao-Seed-1.8 | 95.61 | 50.19 | 45.42 | 98.18 | 36.00 | 62.18 | 93.21 | 43.73 | 49.48 | 95.67 | 43.31 | 52.36 |
| Open-Weight Models | ||||||||||||
| DeepSeek-V3.2 | 89.12 | 54.77 | 34.35 | 91.27 | 54.91 | 36.36 | 98.26 | 4.53 | 93.73 | 92.88 | 38.07 | 54.81 |
| GLM-4.7 | 91.27 | 48.28 | 42.99 | 96.26 | 61.07 | 35.19 | 91.02 | 58.23 | 32.79 | 92.85 | 55.86 | 36.99 |
| Kimi-k2.5 | 91.98 | 44.85 | 47.14 | 89.64 | 29.09 | 60.55 | 84.84 | 40.42 | 44.43 | 88.82 | 38.12 | 50.70 |
| Mistral-Large-3 | 100.00 | 48.28 | 51.72 | 100.00 | 49.64 | 50.36 | 91.29 | 43.90 | 47.39 | 97.10 | 47.27 | 49.82 |
| Qwen3-235B | 97.62 | 59.21 | 38.42 | 99.69 | 60.58 | 39.12 | 94.58 | 57.69 | 36.89 | 97.30 | 59.16 | 38.14 |
| Qwen3-30B | 93.13 | 42.07 | 51.06 | 97.49 | 38.63 | 58.86 | 90.54 | 34.24 | 56.30 | 93.72 | 38.31 | 55.41 |
Note: ESR (Execution Success Rate) measures if the code runs; AMR (Answer Match Rate) measures if the scientific logic is correct. Gap = ESR β AMR, where a positive gap indicates potential silent failures. All values are in percentages.
We use Apptainer (formerly Singularity) to manage complex scientific dependencies.
# Install Apptainer (refer to official docs for details)
sudo apt-get update && sudo apt-get install -y apptainerDataset The benchmark data is ready to use and located directly in the ./data directory. It is stratified into three cognitive scenarios:
./data/simple.jsonl: Atomic instruction synthesis without distractors../data/multiple.jsonl: Tool selection among semantically similar distractors../data/parallel.jsonl: Compositional batch processing requiring multiple independent calls.
BFCS relies on isolated Apptainer environments. The specific installation methods and build instructions for each package are detailed in ./containers/config.yaml.
To save time, you do not need to build these from scratch. We have published the 9 base Apptainer images and 48 package-specific Apptainer images directly to the repository's GitHub Packages (GHCR). You can download or pull them directly:
# Clone the repository
git clone https://github.com/ChampionZhong/BFCS.git
cd BFCS
# Example: Pulling a specific package container directly from GitHub Packages
# Make sure to replace <package_name> with the target tool (e.g., rdkit, biopython)
apptainer pull docker://ghcr.io/championzhong/bfcs/bfcs-<package_name>:latestpython evaluation/run_eval.py --model_name your_model_name --scenario simpleOverview of repositories and corresponding packages which can be import in Python environment, covering the category assigned to each repository and the total count of available wrapper functions.
| # | Repo Name (Original Repo) | Package Name (Package Card) | Category | Tools | LICENSE |
|---|---|---|---|---|---|
| 1 | AiZynthFinder | aizynthfinder | Pharmacy | 3 | MIT |
| 2 | anndata | anndata | Biology | 8 | BSD-3-Clause |
| 3 | batchgenerators | batchgenerators | Medicine | 26 | Apache-2.0 |
| 4 | bioemu | bioemu | Biology | 153 | MIT |
| 5 | biopython | Bio | Biology | 4 | BSD-3-Clause |
| 6 | boltz | boltz | Pharmacy | 46 | MIT |
| 7 | CEBRA | cebra | Biology | 14 | Apache-2.0 |
| 8 | chai-lab | chai_lab | Biology | 12 | Apache-2.0 |
| 9 | chembl-downloader | chembl_downloader | Pharmacy | 2 | MIT |
| 10 | ChemInformant | ChemInformant | Chemistry | 5 | MIT |
| 11 | chemprop | chemprop | Pharmacy | 4 | MIT |
| 12 | chempy | chempy | Chemistry | 45 | BSD-2-Clause |
| 13 | CIRpy | cirpy | Chemistry | 6 | MIT |
| 14 | datamol | datamol | Chemistry | 13 | Apache-2.0 |
| 15 | deepchem | deepchem | Pharmacy | 128 | MIT |
| 16 | DeepPurpose | DeepPurpose | Pharmacy | 31 | BSD-3-Clause |
| 17 | descriptastorus | descriptastorus | Chemistry | 3 | BSD-3-Clause |
| 18 | drugbank_downloader | drugbank_downloader | Pharmacy | 1 | MIT |
| 19 | dscribe | dscribe | Material | 7 | Apache-2.0 |
| 20 | gpaw | gpaw | Material | 263 | GPLv3+ |
| 21 | guacamol | guacamol | Pharmacy | 5 | MIT |
| 22 | lungmask | lungmask | Medicine | 8 | Apache-2.0 |
| 23 | mace | mace | Material | 7 | MIT |
| 24 | MedCLIP | medclip | Medicine | 3 | Unknown |
| 25 | mendeleev | mendeleev | Chemistry | 23 | MIT |
| 26 | molmass | molmass | Chemistry | 14 | BSD-3-Clause |
| 27 | MONAI | monai | Medicine | 96 | Apache-2.0 |
| 28 | mordred | mordred | Chemistry | 1 | BSD-3-Clause |
| 29 | ncbi-genome-download | ncbi_genome_download | Biology | 21 | Apache-2.0 |
| 30 | NistChemPy | nistchempy | Chemistry | 1 | MIT |
| 31 | nnUNet | nnunetv2 | Medicine | 20 | Apache-2.0 |
| 32 | periodictable | periodictable | Chemistry | 21 | BSD-3-Clause |
| 33 | PubChemPy | pubchempy | Chemistry | 2 | MIT |
| 34 | pybel | pybel | Biology | 46 | MIT |
| 35 | pyEQL | pyEQL | Chemistry | 4 | LGPLv3 |
| 36 | pyRiemann | pyriemann | Medicine | 100 | BSD-3-Clause |
| 37 | pyscf | pyscf | Chemistry | 449 | Apache-2.0 |
| 38 | rdkit | rdkit | Chemistry | 110 | BSD-3-Clause |
| 39 | robert | robert | Chemistry | 38 | MIT |
| 40 | scanpy | scanpy | Biology | 21 | BSD-3-Clause |
| 41 | selfies | selfies | Chemistry | 13 | Apache-2.0 |
| 42 | spikeinterface | spikeinterface | Biology | 159 | MIT |
| 43 | stk | stk | Chemistry | 12 | MIT |
| 44 | tape | tape | Biology | 7 | BSD-3-Clause |
| 45 | TDC | tdc | Pharmacy | 118 | MIT |
| 46 | torchdrug | torchdrug | Pharmacy | 16 | Apache-2.0 |
| 47 | torchio | torchio | Medicine | 3 | Apache-2.0 |
| 48 | useful_rdkit_utils | useful_rdkit_utils | Chemistry | 8 | MIT |
The source code of the wrappers and the build scripts in this repository are licensed under the Apache License 2.0.
However, the software packages installed within the containers retain their original licenses. Users are responsible for complying with the licenses of the underlying packages when using them:
ATTENTION GPLv3+: gpaw (Please note that using gpaw may impose copyleft obligations)
If you find this work helpful, please cite our work:
@misc{zhong2026bfcs,
title={BFCS: A Large-Scale Execution-Based Benchmark for Function Calling in Science},
author={Zhong, Zhanping and Su, Xuerui and Zhang, Wei and Pei, Qizhi and Wang, Zun and He, Conghui and Wu, Lijun},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{[https://github.com/ChampionZhong/BFCS](https://github.com/ChampionZhong/BFCS)}},
year={2026},
}


