Teaching LLMs to Learn Tool Trialing and Execution through Environment Interaction

• 📖 Introduction • ✨ Pipeline • ⚡️ Evaluation • 📧 Contact

📖 Introduction

ToolMaster is a framework that shifts tool learning from static imitation to a trial-and-execution paradigm, enabling Large Language Models (LLMs) to actively master tools. It trains agents to first conduct autonomous tool trials during a "trial phase" to accumulate experiential knowledge. This experience informs the subsequent "execution phase," where the model performs planning and solving while explicitly employing self-correction to rectify errors based on environmental feedback. By leveraging Supervised Fine-Tuning (SFT) on teacher-synthesized trajectories—encompassing both tool trials and self-correction behaviors—followed by Reinforcement Learning (RL) to coordinate these phases, ToolMaster empowers agents to dynamically adapt to unfamiliar tools, significantly enhancing generalization and robustness.

✨ Pipeline

⚙️ Environment Setup

1. Set up ToolMaster

First, clone the repository and install the dependencies.

git clone 
cd ToolMaster
conda create --name toolmaster python=3.11
conda activate toolmaster
pip install -e ./

2. Set up API Server (StableToolBench)

Then, prepare the environment for the tool server execution.

conda create --name api_server python=3.9
conda activate api_server
pip install -r requirements_server.txt

📚 Data Preparation

1. Prepare Tool Environment

To initialize the StableToolBench tool server, you need to download the tool environment dataset. You can download it from HuggingFace.

To use the server, you will further need a toolbench key. You can apply one from this form.

After downloading, extract and organize the data into the following directory structure:

├── data
│   └── stabletoolbench
│       └── ToolEnv2404
│           └── tools

2. Data Synthesis & Processing

Note: This step is optional. You can skip the manual synthesis by directly using our pre-processed SFT dataset from data/for_sft.

If you prefer to synthesize the data from scratch, follow the pipeline below:

Step 1: Start the API Server

Launch the tool server to enable environment interaction. To enable efficient environment interaction, we employ MirrorAPI-Cache.

conda activate toolmaster
vllm serve 'path-to-MirrorAPI-Cache-model' --api-key EMPTY --port 12345 --served-model-name MirrorAPI-Cache --gpu-memory-utilization 0.8 --tensor-parallel-size 1

conda activate api_server
python ./src/api_server/main_mirrorapi_cache.py

Step 2: Run Data Synthesis & Filtering

Use the teacher model to generate tool-use trajectories and filter them to ensure quality.

bash scripts/conduct_data.sh

📈 Model Training

Note: If you do not want to train the model from scratch, you may skip this section and directly proceed to Evaluation

1. Supervised Fine-Tuning (SFT)

First, download the backbone model Qwen2.5-7B-Instruct from HuggingFace. Then, fine-tune the model using the curated dataset to establish baseline tool-use capabilities:

bash scripts/train/train_sft.sh

2. Reinforcement Learning (GRPO)

We utilize Group Relative Policy Optimization (GRPO) for the reinforcement learning stage, implemented based on the verifiers framework.

This process involves multiple interacting components, so we recommend opening separate terminals for each step.

Step 1: Start the API Server

Launch the simulated tool environment using MirrorAPI-Cache. This provides a cost-effective and stable alternative to real-time API calls.

Note: This requires two sub-processes: the Model Server (vLLM) and the API Wrapper.

# [Terminal 1] Serve the MirrorAPI model
conda activate toolmaster
vllm serve 'path-to-MirrorAPI-Cache-model' \
    --api-key EMPTY \
    --port 12345 \
    --served-model-name MirrorAPI-Cache \
    --gpu-memory-utilization 0.8 \
    --tensor-parallel-size 1

# [Terminal 2] Start the API wrapper
conda activate api_server
python ./src/api_server/main_mirrorapi_cache.py

Step 2: Start the Inference Server

Initialize the inference backend to generate rollouts (trajectories) during training.

# [Terminal 3]
conda activate toolmaster
bash scripts/inference_serve.sh

Step 3: Start RL Training

Once all servers are running, initiate the GRPO training process.

# [Terminal 4]
conda activate toolmaster
bash scripts/train_rl.sh

📃 Evaluation

You can directly download our pre-trained models from HuggingFace.

1. ToolHop Evaluation

Evaluate the model's generalization in multi-hop reasoning scenarios.

conda activate toolmaster

# 1. Run inference to generate trajectories
bash scripts/inference_ToolHop.sh

# 2. Compute evaluation metrics
bash scripts/eval_ToolHop.sh

2. TMDB Evaluation

Evaluate the model's generalization on tool-rich environments (movie database).

Prerequisite: You need a valid TMDB API Key to run this benchmark. You can apply for one here.

conda activate toolmaster

# 1. Run inference
bash scripts/inference_TMDB.sh

# 2. Compute metrics
bash scripts/eval_TMDB.sh

3. StableToolBench Evaluation

For StableToolBench, you must launch a specific API server environment (simulating the GPT-4o based environment) before running inference.

Prerequisite: To access the server, you need a valid ToolBench Key. You can apply for one using this form.

Step 1: Launch API Server

# [Terminal 1] Start the environment server
conda activate api_server
python ./src/api_server/main.py

Step 2: Inference & Evaluation Once the server is running, execute the evaluation scripts in a separate terminal.

# [Terminal 2]
conda activate toolmaster

# Run inference
bash scripts/inference_StableToolBench.sh

# Compute metrics
bash scripts/eval_StableToolBench.sh

📧 Contact

If you have questions, suggestions, and bug reports, please email:

gaoxingjie@mails.neu.edu.cn

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
figs		figs
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements_server.txt		requirements_server.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Teaching LLMs to Learn Tool Trialing and Execution through Environment Interaction

📖 Introduction

✨ Pipeline

⚙️ Environment Setup

📚 Data Preparation

1. Prepare Tool Environment

2. Data Synthesis & Processing

📈 Model Training

1. Supervised Fine-Tuning (SFT)

2. Reinforcement Learning (GRPO)

📃 Evaluation

1. ToolHop Evaluation

2. TMDB Evaluation

3. StableToolBench Evaluation

📧 Contact

About

Uh oh!

Releases

Packages

Languages

License

NEUIR/ToolMaster

Folders and files

Latest commit

History

Repository files navigation

Teaching LLMs to Learn Tool Trialing and Execution through Environment Interaction

📖 Introduction

✨ Pipeline

⚙️ Environment Setup

📚 Data Preparation

1. Prepare Tool Environment

2. Data Synthesis & Processing

📈 Model Training

1. Supervised Fine-Tuning (SFT)

2. Reinforcement Learning (GRPO)

📃 Evaluation

1. ToolHop Evaluation

2. TMDB Evaluation

3. StableToolBench Evaluation

📧 Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages