OpenManus-RL is a reinforcement learning framework designed for training large language models (LLMs) to perform agent tasks. The project combines two main repositories:
- AgentGym: Provides environments, rewards, and evaluation tools for agent tasks
- Verl: Handles RL training, rollout methods, and reward computation
The training process follows a pipeline architecture:
- Start AgentGym environment services
- Initialize reward manager and rollout worker group
- Generate trajectories via OpenManus agent
- Run PPO or GRPO training to update the LLM
- Save checkpoints and repeat from step 3
- Data Representation: Uses Hugging Face parquet files for input and
DataProtofor internal data representation - Training Scripts:
train_ppo.shandtrain_grpo.shorchestrate the entire training process - Base Agent: Implemented in
openmanus_rl/llm_agent/openmanus.py, handles environment interaction - Reward Calculation: Managed in
verl/utils/reward_score/agentgym.py, computes cumulative rewards from AgentGym
Verl is the underlying reinforcement learning framework that handles the RL training loop, rollout mechanisms, and reward computation.
DataProto is the core data structure used throughout the framework:
- Encapsulates both tensor-based data (stored in
.batch) and non-tensor metadata (stored in.meta_info) - Provides methods for batch manipulation (slicing, merging, etc.)
- Handles device placement and data consistency
Example:
data = DataProto.from_dict({
'input_ids': input_tensor,
'attention_mask': mask_tensor,
'position_ids': position_tensor
})
data.meta_info['task_idx'] = task_indicesThe Ray-based trainer (verl/trainer/ppo/ray_trainer.py) implements distributed PPO training:
RayPPOTrainer: Manages the entire training process, including:- Environment initialization
- Worker group coordination
- Advantage computation
- Policy updates
- Validation
Key methods:
init_workers(): Initializes different worker rolesfit(): Main training loop_validate(): Runs validation on the current policy_save_checkpoint(): Saves model checkpoints
Rollout workers generate trajectories from the current policy:
- Implemented as a Ray-based worker group that can be distributed across multiple nodes
- Handles generation, log probability computation, and policy updates
- Uses VLLM for efficient inference
Reward computation is handled by dedicated modules:
verl/utils/reward_score/agentgym.py: Specific to AgentGym environments- Various reward modules support different types of rewards (EM scores, BLEU, etc.)
apply_kl_penalty(): Adds KL divergence penalties to raw rewards
The OpenManus agent (openmanus_rl/llm_agent/openmanus.py) serves as the interface between the RL framework and environment.
AgentConfig: Configuration for the agentOpenManusAgent: Main agent class that handles environment interaction
-
run_llm_loop
def run_llm_loop(self, gen_batch: DataProto, output_dir: str = None, global_steps: int = 0) -> DataProto:
This method orchestrates the interaction loop for a batch of environments:
- Takes initial prompts as input
- Runs parallel rollouts using thread pool
- Collects trajectories and rewards
- Formats results into DataProto for training
- Handles visualization if enabled
-
_run_single_rollout
def _run_single_rollout(self, initial_prompt_ids: torch.Tensor, task_idx: int) -> Dict[str, Any]:
Executes a single environment interaction:
- Resets environment with task index
- Runs the interaction loop for multiple turns
- Generates responses using the LLM
- Processes responses and executes actions
- Collects rewards and observations
-
_convert_rollout_results_to_dataproto
def _convert_rollout_results_to_dataproto(self, results: List[Dict], original_batch: DataProto) -> DataProto:
Converts rollout results to trainable format:
- Aligns rewards with token sequences
- Creates token-level reward tensors
- Concatenates and pads conversation segments
- Preserves metadata from original batch
The training scripts (train_ppo.sh and train_grpo.sh) orchestrate the entire process:
-
Initialize the environment:
- Parse command line arguments
- Create dedicated conda environment for specific AgentGym environment
- Start environment server
-
Set up training:
- Configure data paths and experiment names
- Initialize logging
- Set hyperparameters
-
Run training:
- Launch Verl trainer with appropriate algorithm (PPO or GRPO)
- Monitor training progress
- Save checkpoints
To add new reward methods (e.g., process reward, outcome reward):
-
Create a new reward module:
# Create a new file in the reward_score directory touch /home/kunlunz2/github_repos/OpenManus-RL/verl/utils/reward_score/my_reward.py -
Implement the reward function:
# my_reward.py def compute_score(solution_str, ground_truth, **kwargs): # Your reward computation logic return reward_tensor
-
Register the reward in
__init__.py:# Add to verl/utils/reward_score/__init__.py from .my_reward import compute_score as my_reward_compute_score SUPPORTED_REWARD_SCORE_FNS = { # ... existing rewards 'my_reward': my_reward_compute_score, }
-
Modify agent to collect appropriate information:
- Update
OpenManusAgent._run_single_rolloutto collect required information - Modify
_convert_rollout_results_to_dataprototo format rewards properly
- Update
-
Use the new reward in training script:
# In train_ppo.sh or train_grpo.sh, add: algorithm.reward_score_fn=my_reward
For process rewards specifically:
- Modify
_run_single_rolloutto track intermediate steps - Update reward computation to consider the process (steps taken) rather than just the outcome
To integrate a new environment from AgentGym:
-
Prepare the environment package:
- Create a dedicated directory in
openmanus_rl/agentgym/agentenv-<env_name>/ - Include
environment.ymlfor conda environment specs - Add
setup.shfor any additional setup steps
- Create a dedicated directory in
-
Update training scripts:
- Add the new environment to the case statement in
train_ppo.shandtrain_grpo.sh:
new_env) LAUNCH_CMD="new_env --host $AGENTGYM_HOST --port \$AGENTGYM_PORT" DEFAULT_PORT=XXXX ;; - Add the new environment to the case statement in
-
Update OpenManus agent:
- Add the new environment to
ENV_TO_TASK_CLASSin_init_env_client
ENV_TO_TASK_CLASS = { # ... existing environments "new_env": "NewEnvTask", }
- Add the new environment to
-
Prepare training data:
- Create parquet files for training/validation in
data/<env_name>/ - Define appropriate reward models in the data
- Create parquet files for training/validation in
-
Test the integration:
./train_ppo.sh --env_name new_env
To add new rollout or action template methods:
-
Modify the OpenManus agent:
- Add new parsing logic in
postprocess_predictions:
def postprocess_predictions(self, predictions: List[Any]) -> Tuple[List[str], List[str]]: # Add new template patterns new_pattern = r'<new_action>(.*?)</new_action>' # ... process accordingly
- Add new parsing logic in
-
Add new action execution logic:
- Update
_run_single_rolloutto handle new action types - Modify the action execution logic to process new templates
- Update
-
Update the prompt template:
- Modify
create_react_promptto include instructions for the new action templates
def create_react_prompt(task_description, tool_manager): # Add instructions for new action templates
- Modify
-
Configure the agent:
- Update
AgentConfigif new parameters are needed - Modify training scripts to pass appropriate configurations
- Update
For more advanced modifications, such as changing the training algorithm or reward structure:
-
Modifying the PPO algorithm:
- Update
verl/trainer/ppo/core_algos.pyfor algorithm changes - Modify advantage calculation in
compute_advantage
- Update
-
Changing the rollout worker:
- Create a new worker class in
verl/single_controller/ray/ - Register the worker in the appropriate factory methods
- Create a new worker class in
-
Custom data processing:
- Modify
_convert_rollout_results_to_dataprotofor custom data formats - Update
DataProtomethods if needed
- Modify
OpenManus-RL provides a flexible framework for reinforcement learning with LLMs in agent environments. By understanding the core components and following this development guide, you can extend the framework to support new environments, reward structures, and action templates.
For more detailed information on AgentGym integration, refer to the documentation in /home/kunlunz2/github_repos/OpenManus-RL/openmanus_rl/agentgym/2nd_dev_docs.