Skip to content

Comments

[config] remove deprecated actor_rollout_ref.rollout.mode config field#5352

Open
dhruvnigam93 wants to merge 2 commits intoverl-project:mainfrom
dhruvnigam93:feat/ENG-4604-remove-rollout-mode-config
Open

[config] remove deprecated actor_rollout_ref.rollout.mode config field#5352
dhruvnigam93 wants to merge 2 commits intoverl-project:mainfrom
dhruvnigam93:feat/ENG-4604-remove-rollout-mode-config

Conversation

@dhruvnigam93
Copy link

What does this PR do?

The mode field in RolloutConfig (i.e. actor_rollout_ref.rollout.mode) only ever supported "async" — sync mode was removed in v0.7. Every internal code path already hard-coded async behaviour, leaving ~50+ lines of dead conditional branches throughout the codebase.

This PR removes all internal branching on mode / async_rollout_mode, simplifying the codebase to always use the async code path, while keeping the field in the config schema as Optional[str] = None so that old configs with mode: async continue to load without errors and display a clear deprecation message.

Related issue: #4604


Checklist Before Starting


Test

All changes are pure refactoring (no algorithmic or behavioral change). Tested as follows:

  1. Pre-commit hooks — all pass:

    • ruff (lint) ✅
    • ruff format
    • mypy
    • autogen-trainer-cfg (generated YAMLs regenerated — mode field no longer appears) ✅
    • check-docstring-coverage
    • check-license
    • compile-python
  2. CPU unit testspytest -s --asyncio-mode=auto tests/295 passed, 16 failed, 27 skipped. The 16 failures are pre-existing (unrelated huggingface_hub validation error on absolute local paths). Zero new failures introduced.

  3. End-to-end PPO training — a full training job ran successfully through multiple steps with healthy metrics (actor loss, critic loss, grad norms, ~1482 tokens/sec throughput) and no DeprecationWarning for rollout.mode, confirming the production config does not set mode.


API and Usage Example

Old configs (still work, emit a deprecation warning):

actor_rollout_ref:
  rollout:
    mode: async   # DeprecationWarning shown; field has no effect
    name: vllm
    ...

New configs (no warning):

actor_rollout_ref:
  rollout:
    name: vllm
    ...

Deprecation warning text (emitted from RolloutConfig.__post_init__):

DeprecationWarning: The 'mode' field in RolloutConfig is deprecated and has no effect.
All rollouts now use async mode. Please remove 'mode: async' from your configuration file.
This field will be removed in a future version.

Rollout class lookup — simplified API (internal):

# Before
from verl.workers.rollout.base import get_rollout_class
cls = get_rollout_class("vllm", mode="async")

# After
cls = get_rollout_class("vllm")

Design & Code Changes

High-level design

Keep the mode field in the dataclass as Optional[str] = None (backward-compatible default) but never read it anywhere in the code. A DeprecationWarning is emitted in __post_init__ when the field is explicitly set, guiding users to remove it from their configs. All internal code always uses the async path directly.

Specific changes

File Change
verl/workers/config/rollout.py mode: str = "async"mode: Optional[str] = None; __post_init__ emits DeprecationWarning when field is set; removed old ValueError for sync mode
verl/workers/rollout/base.py _ROLLOUT_REGISTRY keys changed from (name, mode) tuples to plain name strings; get_rollout_class(name, mode)get_rollout_class(name)
verl/workers/fsdp_workers.py get_rollout_class(name, mode)get_rollout_class(name)
verl/workers/megatron_workers.py Same
verl/workers/engine_workers.py Same
verl/checkpoint_engine/base.py Same
verl/trainer/ppo/ray_trainer.py Removed self.async_rollout_mode = True assignment and associated dead comment
verl/experimental/fully_async_policy/fully_async_rollouter.py Removed assert rollout.mode == "async" and self.async_rollout_mode = True
verl/experimental/fully_async_policy/fully_async_trainer.py Same
verl/experimental/one_step_off_policy/ray_trainer.py Same
verl/experimental/transfer_queue/ray_trainer.py Removed if rollout.mode == "async": conditional; always sets up async manager directly
verl/experimental/vla/rob_ray_trainer.py Removed mode check; always uses EnvLoop directly
verl/experimental/vla/sac/sac_ray_trainer.py Same
verl/trainer/config/rollout/rollout.yaml Removed mode: async default line
verl/trainer/config/_generated_ppo_trainer.yaml Regenerated (no mode field)
verl/trainer/config/_generated_ppo_megatron_trainer.yaml Regenerated
verl/trainer/config/_generated_ppo_veomni_trainer.yaml Regenerated
tests/experimental/agent_loop/agent_utils.py Always uses AsyncActorRolloutRefWorker; removed conditional and unused ActorRolloutRefWorker import
tests/trainer/config/test_legacy_config_on_cpu.py Added "mode" to ignored_keys (intentional schema field removal)

Checklist Before Submitting

  • Read the Contribute Guide.
  • Applied pre-commit checks: pre-commit run --all-files — all hooks pass ✅
  • Add / Update the documentation — deferred - will make the changes if they are good to go by maitainers. see below.
  • Unit tests pass on CPU (pytest --asyncio-mode=auto tests/); no new failures
  • GPU CI requested — pending (will notify via Slack/Feishu once PR is open)
  • recipe submodule — not affected by this PR

Open Questions for Maintainers

The following decisions are not blocking this PR, but I will make more changes based on the answers.

Q1: Deprecation timeline — when to remove mode entirely?

The mode field is now Optional[str] = None with a DeprecationWarning so that this is not a breaking change. No behavior depends on it. Should we remove it now or in the next minor/major release?


Q2: Should example scripts and tutorial configs remove mode: async now?

I have not changed the examples and docs yet, but this is a straightforward change. Let me know if you want me to include it in this PR or defer to a follow-up.


Q3: Should the 10 GPU test files that set mode = "async" be cleaned up?

The following test files still set config.actor_rollout_ref.rollout.mode = "async" and will emit a DeprecationWarning in CI logs:

  • tests/workers/rollout/rollout_vllm/test_vllm_abort.py
  • tests/workers/rollout/perf/vllm_async_rollout.py
  • tests/workers/rollout/rollout_trtllm/test_adapter.py
  • tests/workers/rollout/rollout_trtllm/test_async_server.py
  • tests/experimental/agent_loop/test_basic_agent_loop.py
  • tests/experimental/agent_loop/test_multi_modal.py
  • tests/experimental/agent_loop/test_standalone_rollout.py
  • tests/experimental/reward_loop/test_agent_reward_loop_colocate.py
  • tests/experimental/reward_loop/test_agent_reward_loop_standalone.py
  • tests/experimental/reward_loop/test_math_verify.py

Additionally, tests/experimental/reward_loop/test_agent_reward_loop_colocate.py:83 has a dead code branch that still reads rollout.mode to select a worker class — this should be fixed regardless of the deprecation timeline. Happy to include it in this PR.


Q4: Phase 7 documentation — now or deferred?

docs/advance/fully_async.md and related docs still reference actor_rollout_ref.rollout.mode in example scripts. Should it be removed now or later?

The `mode` field in `RolloutConfig` only ever supported `"async"` (sync
mode was already removed). This PR removes all internal branching on
`mode`/`async_rollout_mode` and simplifies the codebase to always use
the async code path.

Changes:
- `verl/workers/config/rollout.py`: change `mode: str = "async"` to
  `mode: Optional[str] = None`; emit `DeprecationWarning` in
  `__post_init__` when field is set, so old configs continue to work
- `verl/workers/rollout/base.py`: simplify `_ROLLOUT_REGISTRY` keys
  from `(name, mode)` tuples to plain `name` strings; remove `mode`
  parameter from `get_rollout_class()`
- `verl/workers/fsdp_workers.py`, `megatron_workers.py`,
  `engine_workers.py`, `checkpoint_engine/base.py`: update all
  `get_rollout_class(name, mode)` call sites to `get_rollout_class(name)`
- `verl/trainer/ppo/ray_trainer.py`: remove `self.async_rollout_mode`
  assignment and dead conditional branches
- `verl/experimental/fully_async_policy/`, `one_step_off_policy/`,
  `transfer_queue/`: remove `assert rollout.mode == "async"` guards and
  dead `async_rollout_mode` assignments
- `verl/experimental/vla/rob_ray_trainer.py`,
  `sac/sac_ray_trainer.py`: remove `mode == "async_envloop"` check;
  always use `EnvLoop`
- `verl/trainer/config/rollout/rollout.yaml`: remove `mode: async`
  default; generated reference YAMLs regenerated accordingly
- `tests/experimental/agent_loop/agent_utils.py`: always use
  `AsyncActorRolloutRefWorker`; remove dead sync-mode branch
- `tests/trainer/config/test_legacy_config_on_cpu.py`: add `"mode"` to
  `ignored_keys` to account for intentional field removal from schema

Old configs with `mode: async` continue to load without errors and
display a clear deprecation message guiding users to remove the field.

Co-authored-by: Cursor <cursoragent@cursor.com>
@CLAassistant
Copy link

CLAassistant commented Feb 19, 2026

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively removes the deprecated rollout.mode configuration field, simplifying the codebase by eliminating dead conditional branches. The changes are well-executed, replacing conditional logic with a single async code path and updating the configuration schema to issue a DeprecationWarning for backward compatibility. The refactoring is consistent, clean, and improves maintainability. Both the code review and security review found no critical or high-severity issues or vulnerabilities in the submitted changes.

@vermouth1992
Copy link
Collaborator

Could you resolve conflict?

@dhruvnigam93
Copy link
Author

resolved conflict, was a minor whitespace issue.

@dhruvnigam93
Copy link
Author

@vermouth1992 please have a look

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants