MARSHAL: Incentivizing Multi-Agent Reasoning
via Self-Play with Strategic LLMs

MARSHAL: Incentivizing Multi-Agent Reasoning
via Self-Play with Strategic LLMs

🎉 Accepted by ICLR 2026

🌐 Project Page | 📝 Paper | 🤗 Models

📖 Overview

We introduce MARSHAL, an end-to-end reinforcement learning framework designed to incentivize Multi-Agent Reasoning through Self-play witH strAtegic LLMs in a diverse range of competitive and cooperative games.

MARSHAL addresses the challenge of credit assignment in multi-agent multi-turn self-play through two core mechanisms:

Turn-level Advantage Estimator: Enables fine-grained credit assignment, allowing the model to accurately attribute long-term outcomes to individual actions and provide learning signals across multiple turns.
Agent-specific Advantage Normalization: Stabilizes the training process by calibrating advantage estimates relative to the performance of each agent.

🔥 Key Results

By leveraging self-play across strategic games, MARSHAL (based on Qwen3-4B) demonstrates notable generalization capabilities:

Strategic Games: Achieves up to 28.7% performance improvement on held-out games.
Reasoning Benchmarks: When integrated into leading multi-agent systems (MASs), MARSHAL yields consistent gains of up to
- +10.0% on AIME
- +7.6% on GPQA-Diamond
- +3.5% on average across all tested benchmarks.

🎮 Featured Games

Competitive, perfect-information: Tic-Tac-Toe, Connect Four.
Competitive, imperfect-information: Kuhn Poker, Leduc Hold'em.
Cooperative, imperfect-information: Mini Hanabi, Simple Hanabi.

🚀 Method

Figure 1: Overview of MARSHAL. > Left: Generating player trajectories via self-play in strategic games. Middle: Naive advantage estimation (e.g., GRPO) often fails in multi-turn settings. Right: MARSHAL's advantage estimation ensures accurate credit assignment for multi-turn, multi-agent interactions.

📊 Results

Figure 2: Performance Comparison. > Evaluation of MARSHAL against baselines on strategic games and reasoning benchmarks. MARSHAL not only masters strategic games but also generalizes effectively to complex reasoning tasks within multi-agent frameworks like MAD and AutoGen.

🛠️ Installation

The MARSHAL project is built upon the ROLL framework.

Install ROLL Framework Please follow the official guide to ensure environment and backend compatibility:
ROLL Docs – Getting Started
Install OpenSpiel MARSHAL uses OpenSpiel for game environments:
```
pip install pyspiel
```

⚡ Training

Agentic RL Pipeline

Use the following scripts to reproduce our training results.

# ==============================
# 1. Self-play Training
# ==============================

# Generalist Agent (Multi-Game)
bash examples/multi_games/run_agentic_pipeline_multi_games_selfplay.sh

# Specialist Agent (e.g., Tic-Tac-Toe)
bash examples/tictactoe/run_agentic_pipeline_tictactoe_selfplay.sh

# ==============================
# 2. Training with Fixed Opponent
# ==============================
bash examples/tictactoe/run_agentic_pipeline_tictactoe_single.sh

# ==============================
# 3. Debugging / Rollout
# ==============================
# Rollout only (no gradient updates)
bash examples/tictactoe/run_agentic_rollout_tictactoe.sh

Monitoring

Track training progress using TensorBoard:

tensorboard --logdir=runs/tictactoe_selfplay/

🧪 Evaluation

1. Export Model Checkpoint

Convert the trained checkpoint for evaluation:

bash model_convert.sh

2. Strategic Ability Evaluation

Evaluate the agent's performance in game environments:

bash examples/model_game_eval/run_agentic_rollout_eval.sh

3. Generalization to Multi-Agent Systems

We evaluate MARSHAL on 7 math and QA benchmarks using MASLab.

MATH: GSM8K, MATH500, AQUA, AIME24, AMC23
QA: GPQA-Diamond, MMLU-STEM

🖊️ Citation

If you find our work helpful, please cite:

@misc{yuan2025marshal,
      title={MARSHAL: Incentivizing Multi-Agent Reasoning via Self-Play with Strategic LLMs}, 
      author={Huining Yuan and Zelai Xu and Zheyue Tan and Xiangmin Yi and Mo Guang and Kaiwen Long and Haojia Hui and Boxun Li and Xinlei Chen and Bo Zhao and Xiao-Ping Zhang and Chao Yu and Yu Wang},
      year={2025},
      eprint={2510.15414},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={[https://arxiv.org/abs/2510.15414](https://arxiv.org/abs/2510.15414)}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 155 Commits
.github/workflows		.github/workflows
assets		assets
data		data
docker		docker
docs_roll		docs_roll
examples		examples
mcore_adapter		mcore_adapter
roll		roll
scripts		scripts
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
model_convert.sh		model_convert.sh
playground.ipynb		playground.ipynb
pyproject.toml		pyproject.toml
requirements_common.txt		requirements_common.txt
requirements_torch251_sglang.txt		requirements_torch251_sglang.txt
requirements_torch251_vllm.txt		requirements_torch251_vllm.txt
requirements_torch260_sglang.txt		requirements_torch260_sglang.txt
requirements_torch260_vllm.txt		requirements_torch260_vllm.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MARSHAL: Incentivizing Multi-Agent Reasoning
via Self-Play with Strategic LLMs

🎉 Accepted by ICLR 2026

📖 Overview

🔥 Key Results

🎮 Featured Games

🚀 Method

📊 Results

🛠️ Installation

⚡ Training

Agentic RL Pipeline

Monitoring

🧪 Evaluation

1. Export Model Checkpoint

2. Strategic Ability Evaluation

3. Generalization to Multi-Agent Systems

🖊️ Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MARSHAL: Incentivizing Multi-Agent Reasoning via Self-Play with Strategic LLMs

🎉 Accepted by ICLR 2026

📖 Overview

🔥 Key Results

🎮 Featured Games

🚀 Method

📊 Results

🛠️ Installation

⚡ Training

Agentic RL Pipeline

Monitoring

🧪 Evaluation

1. Export Model Checkpoint

2. Strategic Ability Evaluation

3. Generalization to Multi-Agent Systems

🖊️ Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

MARSHAL: Incentivizing Multi-Agent Reasoning
via Self-Play with Strategic LLMs

Packages