OpenManus-RL Development Guide

Project Overview

OpenManus-RL is a reinforcement learning framework designed for training large language models (LLMs) to perform agent tasks. The project combines two main repositories:

AgentGym: Provides environments, rewards, and evaluation tools for agent tasks
Verl: Handles RL training, rollout methods, and reward computation

The training process follows a pipeline architecture:

Start AgentGym environment services
Initialize reward manager and rollout worker group
Generate trajectories via OpenManus agent
Run PPO or GRPO training to update the LLM
Save checkpoints and repeat from step 3

Key Components

Data Representation: Uses Hugging Face parquet files for input and DataProto for internal data representation
Training Scripts: train_ppo.sh and train_grpo.sh orchestrate the entire training process
Base Agent: Implemented in openmanus_rl/llm_agent/openmanus.py, handles environment interaction
Reward Calculation: Managed in verl/utils/reward_score/agentgym.py, computes cumulative rewards from AgentGym

Core Components

Verl Framework

Verl is the underlying reinforcement learning framework that handles the RL training loop, rollout mechanisms, and reward computation.

DataProto

DataProto is the core data structure used throughout the framework:

Encapsulates both tensor-based data (stored in .batch) and non-tensor metadata (stored in .meta_info)
Provides methods for batch manipulation (slicing, merging, etc.)
Handles device placement and data consistency

Example:

data = DataProto.from_dict({
    'input_ids': input_tensor,
    'attention_mask': mask_tensor,
    'position_ids': position_tensor
})
data.meta_info['task_idx'] = task_indices

Ray Trainer

The Ray-based trainer (verl/trainer/ppo/ray_trainer.py) implements distributed PPO training:

RayPPOTrainer: Manages the entire training process, including:
- Environment initialization
- Worker group coordination
- Advantage computation
- Policy updates
- Validation

Key methods:

init_workers(): Initializes different worker roles
fit(): Main training loop
_validate(): Runs validation on the current policy
_save_checkpoint(): Saves model checkpoints

Rollout Worker Group

Rollout workers generate trajectories from the current policy:

Implemented as a Ray-based worker group that can be distributed across multiple nodes
Handles generation, log probability computation, and policy updates
Uses VLLM for efficient inference

Reward Computation

Reward computation is handled by dedicated modules:

verl/utils/reward_score/agentgym.py: Specific to AgentGym environments
Various reward modules support different types of rewards (EM scores, BLEU, etc.)
apply_kl_penalty(): Adds KL divergence penalties to raw rewards

OpenManus Agent

The OpenManus agent (openmanus_rl/llm_agent/openmanus.py) serves as the interface between the RL framework and environment.

Key Classes

AgentConfig: Configuration for the agent
OpenManusAgent: Main agent class that handles environment interaction

Critical Methods

run_llm_loop
```
def run_llm_loop(self, gen_batch: DataProto, output_dir: str = None, global_steps: int = 0) -> DataProto:
```
This method orchestrates the interaction loop for a batch of environments:
- Takes initial prompts as input
- Runs parallel rollouts using thread pool
- Collects trajectories and rewards
- Formats results into DataProto for training
- Handles visualization if enabled
_run_single_rollout
```
def _run_single_rollout(self, initial_prompt_ids: torch.Tensor, task_idx: int) -> Dict[str, Any]:
```
Executes a single environment interaction:
- Resets environment with task index
- Runs the interaction loop for multiple turns
- Generates responses using the LLM
- Processes responses and executes actions
- Collects rewards and observations
_convert_rollout_results_to_dataproto
```
def _convert_rollout_results_to_dataproto(self, results: List[Dict], original_batch: DataProto) -> DataProto:
```
Converts rollout results to trainable format:
- Aligns rewards with token sequences
- Creates token-level reward tensors
- Concatenates and pads conversation segments
- Preserves metadata from original batch

Training Scripts

The training scripts (train_ppo.sh and train_grpo.sh) orchestrate the entire process:

Initialize the environment:
- Parse command line arguments
- Create dedicated conda environment for specific AgentGym environment
- Start environment server
Set up training:
- Configure data paths and experiment names
- Initialize logging
- Set hyperparameters
Run training:
- Launch Verl trainer with appropriate algorithm (PPO or GRPO)
- Monitor training progress
- Save checkpoints

Development Guide

Adding New Reward Methods

To add new reward methods (e.g., process reward, outcome reward):

Create a new reward module:

# Create a new file in the reward_score directory
touch /home/kunlunz2/github_repos/OpenManus-RL/verl/utils/reward_score/my_reward.py

Implement the reward function:

# my_reward.py
def compute_score(solution_str, ground_truth, **kwargs):
    # Your reward computation logic
    return reward_tensor

Register the reward in __init__.py:

# Add to verl/utils/reward_score/__init__.py
from .my_reward import compute_score as my_reward_compute_score

SUPPORTED_REWARD_SCORE_FNS = {
    # ... existing rewards
    'my_reward': my_reward_compute_score,
}

Modify agent to collect appropriate information:
- Update OpenManusAgent._run_single_rollout to collect required information
- Modify _convert_rollout_results_to_dataproto to format rewards properly

Use the new reward in training script:

# In train_ppo.sh or train_grpo.sh, add:
algorithm.reward_score_fn=my_reward

For process rewards specifically:

Modify _run_single_rollout to track intermediate steps
Update reward computation to consider the process (steps taken) rather than just the outcome

Adding New Environments

To integrate a new environment from AgentGym:

Prepare the environment package:
- Create a dedicated directory in openmanus_rl/agentgym/agentenv-<env_name>/
- Include environment.yml for conda environment specs
- Add setup.sh for any additional setup steps

Update training scripts:

Add the new environment to the case statement in train_ppo.sh and train_grpo.sh:

new_env)
    LAUNCH_CMD="new_env --host $AGENTGYM_HOST --port \$AGENTGYM_PORT"
    DEFAULT_PORT=XXXX
    ;;

Update OpenManus agent:

Add the new environment to ENV_TO_TASK_CLASS in _init_env_client

ENV_TO_TASK_CLASS = {
    # ... existing environments
    "new_env": "NewEnvTask",
}

Prepare training data:
- Create parquet files for training/validation in data/<env_name>/
- Define appropriate reward models in the data
Test the integration:
```
./train_ppo.sh --env_name new_env
```

Extending Rollout Methods

To add new rollout or action template methods:

Modify the OpenManus agent:

Add new parsing logic in postprocess_predictions:

def postprocess_predictions(self, predictions: List[Any]) -> Tuple[List[str], List[str]]:
    # Add new template patterns
    new_pattern = r'<new_action>(.*?)</new_action>'
    # ... process accordingly

Add new action execution logic:
- Update _run_single_rollout to handle new action types
- Modify the action execution logic to process new templates

Update the prompt template:

Modify create_react_prompt to include instructions for the new action templates

def create_react_prompt(task_description, tool_manager):
    # Add instructions for new action templates

Configure the agent:
- Update AgentConfig if new parameters are needed
- Modify training scripts to pass appropriate configurations

Advanced Modifications

For more advanced modifications, such as changing the training algorithm or reward structure:

Modifying the PPO algorithm:
- Update verl/trainer/ppo/core_algos.py for algorithm changes
- Modify advantage calculation in compute_advantage
Changing the rollout worker:
- Create a new worker class in verl/single_controller/ray/
- Register the worker in the appropriate factory methods
Custom data processing:
- Modify _convert_rollout_results_to_dataproto for custom data formats
- Update DataProto methods if needed

Conclusion

OpenManus-RL provides a flexible framework for reinforcement learning with LLMs in agent environments. By understanding the core components and following this development guide, you can extend the framework to support new environments, reward structures, and action templates.

For more detailed information on AgentGym integration, refer to the documentation in /home/kunlunz2/github_repos/OpenManus-RL/openmanus_rl/agentgym/2nd_dev_docs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenManus-RL Development Guide

Project Overview

Key Components

Core Components

Verl Framework

DataProto

Ray Trainer

Rollout Worker Group

Reward Computation

OpenManus Agent

Key Classes

Critical Methods

Training Scripts

Development Guide

Adding New Reward Methods

Adding New Environments

Extending Rollout Methods

Advanced Modifications

Conclusion

FilesExpand file tree

DEVELOPMENT_GUIDE_EN.md

Latest commit

History

DEVELOPMENT_GUIDE_EN.md

File metadata and controls

OpenManus-RL Development Guide

Project Overview

Key Components

Core Components

Verl Framework

DataProto

Ray Trainer

Rollout Worker Group

Reward Computation

OpenManus Agent

Key Classes

Critical Methods

Training Scripts

Development Guide

Adding New Reward Methods

Adding New Environments

Extending Rollout Methods

Advanced Modifications

Conclusion