Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 1 addition & 6 deletions .github/workflows/vllm.yml
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,6 @@ jobs:
- name: Install the current repository
run: |
pip3 install -e .[test]
pip3 install vllm==0.5.4
- name: Download Model to Use
run: |
huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct
Expand All @@ -103,10 +102,6 @@ jobs:
huggingface-cli download 'deepseek-ai/deepseek-llm-7b-chat'
export HF_HUB_OFFLINE=1
# Disable requests to avoid network errors
- name: Running vllm tests on 8 L20 GPUs
run: |
cd tests/workers/rollout/rollout_vllm
torchrun --standalone --nnodes=1 --nproc_per_node=8 $(which pytest) -s test_vllm_hf_loader.py
- name: Test the latest vLLM
run: |
pip3 install --upgrade vllm==0.7.3
Expand All @@ -129,4 +124,4 @@ jobs:
pip3 install --upgrade vllm==0.8.3 tensordict==0.7.2
pytest -svvv tests/workers/rollout/rollout_vllm/test_vllm_chat_scheduler.py
ROLLOUT_NAME=vllm pytest -svvv tests/experimental/agent_loop/test_basic_agent_loop.py
# Note(haibin.lin): for any new test, please update gpu_unit_tests.yaml to avoid repeated tests
# Note(haibin.lin): for any new test, please update gpu_unit_tests.yaml to avoid repeated tests
2 changes: 1 addition & 1 deletion docs/README_vllm0.7.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ actor_rollout_ref.rollout.free_cache_engine=True \

```

For a typical job like examples/ppo_trainer/run_qwen2-7b_seq_balance.sh, the rollout generation time is 115 seconds with vLLM0.6.3, while it is 85 seconds with vLLM0.7.0. By enabling the cudagraph, the generation duration is further reduced to 62 seconds.
For a typical job like examples/ppo_trainer/run_qwen2-7b_seq_balance.sh, the rollout generation time is 85 seconds with vLLM0.7.0. By enabling the cudagraph, the generation duration is further reduced to 62 seconds.

**Note:** Currently, if the `n` is greater than 1 in `SamplingParams` in vLLM>=0.7, there is a potential performance issue on the stability of rollout generation time (Some iterations would see generation time bursts) using vLLM's V0 Engine.

Expand Down
5 changes: 0 additions & 5 deletions docs/README_vllm0.8.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,11 +41,6 @@ actor_rollout_ref.rollout.free_cache_engine=True \

and also **remove** the environment variable if it exists:

```bash
# If you are using vllm<=0.6.3, you might need to set the following environment variable to avoid bugs:
# export VLLM_ATTENTION_BACKEND=XFORMERS
```

## Notes

When you just directly upgrade vllm>=0.8, some dependency packages may undergo version changes. If you encounter the following problems:
Expand Down
26 changes: 0 additions & 26 deletions docs/advance/megatron_extension.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,29 +18,3 @@ We list the steps here:
3. Use the right ``LayerSpec`` , ``TransformerConfig`` and ``HuggingfaceConfig``
as arguments to initialize the GPTModel.
4. Return the model at last.


Add Models with old version of verl
-----------------------------------


The most challenging aspect to use the Megatron-LM backend is implementing
the models for training. Currently, we implement Llama model that
support data parallelism, tensor parallelism, pipeline parallelism (also
vPP) and sequence parallelism. We also implement remove padding (sequence packing) on Llama
model, which can be found in `modeling_llama_megatron.py <https://github.com/volcengine/verl/blob/main/verl/models/llama/megatron/modeling_llama_megatron.py>`_.

To support other model, users are required to implement:

1. Implemnt a model similar to ``modeling_llama_megatron.py`` that satisfy the
parallelism requirements of Megatron-LM. Then register your model in
the `registry.py <https://github.com/volcengine/verl/blob/main/verl/models/registry.py>`_.
2. Checkpoint utils that can load full checkpoint (e.g. huggingface
checkpoint) to partitioned models during the runtime. Then register
your loader to ``weight_loader_registry`` in `weight_loader_registry.py <https://github.com/volcengine/verl/blob/main/verl/models/weight_loader_registry.py>`_.
3. Weight loader that synchronize the weight from Megatron to rollout
(vLLM) model. Note that both the actor model and rollout model are
partitioned during runtime. So, it's advisable to map the model name
in actor model implementation. Otherwise, you may need an additional
name mapping and even weight transformation. The weight loader implementation
is in `megatron_weight_loaders.py <https://github.com/volcengine/verl/blob/main/verl/third_party/vllm/vllm_v_0_6_3/megatron_weight_loaders.py>`_.
2 changes: 0 additions & 2 deletions docs/amd_tutorial/amd_build_dockerfile_page.rst
Original file line number Diff line number Diff line change
Expand Up @@ -407,8 +407,6 @@ slurm_script.sh
echo "IP Head: $ip_head"

# make sure we set environment variables before Ray initialization
# If you are using vllm<=0.6.3, you might need to set the following environment variable to avoid bugs:
# export VLLM_ATTENTION_BACKEND=XFORMERS

# Print out all env variables
printenv
Expand Down
3 changes: 0 additions & 3 deletions docs/examples/config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -309,9 +309,6 @@ Reference model will be enabled when ``actor.use_kl_loss`` or/and ``algorithm.us

- ``actor_rollout_ref.rollout.gpu_memory_utilization``:

- For vLLM v0.5.4 and v0.6.3: The proportion of the **remaining** GPU memory
allocated for kv cache after other models have initialized when using
vLLM.
- For vLLM v0.7.0 and later: The fraction of **total** GPU memory to be used for the vLLM instance.
- For SGLang: Corresponding to ``mem_fraction_static``, the fraction of the free GPU memory used for **static** memory like model weights and KV cache.

Expand Down
9 changes: 1 addition & 8 deletions docs/faq/faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -102,14 +102,7 @@ Solution 2nd:
Illegal memory access
---------------------------------

If you encounter the error message like ``CUDA error: an illegal memory access was encountered`` during rollout, most likely it is due to a known issue from vllm(<=0.6.3).
Please set the following environment variable. The env var must be set before the ``ray start`` command if any.

.. code:: bash

export VLLM_ATTENTION_BACKEND=XFORMERS

If in doubt, print this env var in each rank to make sure it is properly set.
If you encounter the error message like ``CUDA error: an illegal memory access was encountered`` during rollout, please check the vLLM documentation for troubleshooting steps specific to your vLLM version.

Checkpoints
------------------------
Expand Down
7 changes: 1 addition & 6 deletions docs/hybrid_flow.rst
Original file line number Diff line number Diff line change
Expand Up @@ -254,12 +254,7 @@ Important code files in the repository are organized as below:
weight_loader_registery.py # registry of weight loaders for loading hf ckpt into Megatron
third_party
vllm # adaptor for vllm's usage in RL
vllm_v_0_6_3 # vllm v0.6.3 adaptor
llm.py # entrypoints for generate, sync_model_weight, offload_model_weights
parallel_state.py # vllm related device mesh and process groups
dtensor_weight_loaders.py # weight loader for huggingface models with FSDP
megatron_weight_loaders.py # weight loader for Megatron models
vllm_spmd # vllm >= v0.7 adaptor (coming soon)
vllm_spmd # vllm >= v0.7 adaptor
examples # example scripts
tests # integration and unit tests
.github # the configuration of continuous integration tests
Expand Down
3 changes: 1 addition & 2 deletions docs/perf/perf_tuning.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,6 @@ Below are key factors for tuning vLLM-based rollout. Before tuning, we recommend

- Increase ``gpu_memory_utilization``.

- For vLLM v0.5.4 and v0.6.3, the vLLM pre-allocates GPU KVCache by using gpu_memory_utilization of the **remaining** memory.
- For vLLM v0.7.0 and later, the vLLM instance will only use gpu_memory_utilization of the **total** memory.
- For SGLang, it's the fraction of the free GPU memory used for **static** memory like model weights and KV cache. However, the remaining (1-gpu_memory_utilization) will also be used during inference.

Expand All @@ -51,7 +50,7 @@ Below are key factors for tuning vLLM-based rollout. Before tuning, we recommend
More tuning details such as dealing with Preemption and Chunked-prefill
can be found in `vLLM official tuning guide <https://docs.vllm.ai/en/latest/performance/optimization.html>`_

The performance of vllm can be further increased if upgrading from v0.6.3 to v0.7. See https://github.com/volcengine/verl/blob/main/docs/README_vllm0.7.md for details on how to upgrade.
For optimal performance, we recommend using vLLM v0.8.3 or later. See https://github.com/volcengine/verl/blob/main/docs/README_vllm0.8.md for details.

Enable remove padding (sequence packing)
-----------------------------------------
Expand Down
2 changes: 1 addition & 1 deletion docs/start/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ For users who pursue better scalability, we recommend using **Megatron-LM** back

2. Inference:

For inference, vllm 0.6.3 and 0.8.2 have been tested for stability. Avoid using vllm 0.7x due to reported issues with its functionality.
For inference, vllm 0.8.3 and later versions have been tested for stability. We recommend turning on env var `VLLM_USE_V1=1` for optimal performance.

For SGLang, refer to the :doc:`SGLang Backend<../workers/sglang_worker>` for detailed installation and usage instructions. **SGLang offers better throughput and is under extensive development.** We encourage users to report any issues or provide feedback via the `SGLang Issue Tracker <https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/issues/106>`_.

Expand Down
2 changes: 0 additions & 2 deletions docs/start/multinode.rst
Original file line number Diff line number Diff line change
Expand Up @@ -454,8 +454,6 @@ slurm_script.sh
echo "IP Head: $ip_head"

# make sure we set environment variables before Ray initialization
# If you are using vllm<=0.6.3, you might need to set the following environment variable to avoid bugs:
# export VLLM_ATTENTION_BACKEND=XFORMERS

# Print out all env variables
printenv
Expand Down
4 changes: 1 addition & 3 deletions examples/grpo_trainer/run_deepseek7b_llm_math.sh
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
set -x

# If you are using vllm<=0.6.3, you might need to set the following environment variable to avoid bugs:
# export VLLM_ATTENTION_BACKEND=XFORMERS

gsm8k_train_path=$HOME/data/gsm8k/train.parquet
gsm8k_test_path=$HOME/data/gsm8k/test.parquet
Expand Down Expand Up @@ -48,4 +46,4 @@ python3 -m verl.trainer.main_ppo \
trainer.nnodes=1 \
trainer.save_freq=20 \
trainer.test_freq=5 \
trainer.total_epochs=15 $@
trainer.total_epochs=15 $@
4 changes: 1 addition & 3 deletions examples/grpo_trainer/run_deepseek7b_llm_math_megatron.sh
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
set -x

# If you are using vllm<=0.6.3, you might need to set the following environment variable to avoid bugs:
# export VLLM_ATTENTION_BACKEND=XFORMERS
export CUDA_DEVICE_MAX_CONNECTIONS=1 # For megatron communication/computation overlapping

gsm8k_train_path=$HOME/data/gsm8k/train.parquet
Expand Down Expand Up @@ -50,4 +48,4 @@ python3 -m verl.trainer.main_ppo --config-path=config \
trainer.nnodes=1 \
trainer.save_freq=20 \
trainer.test_freq=5 \
trainer.total_epochs=15 $@
trainer.total_epochs=15 $@
2 changes: 0 additions & 2 deletions examples/grpo_trainer/run_minicpmo2_6.sh
Original file line number Diff line number Diff line change
@@ -1,6 +1,4 @@
set -x
# If you are using vllm<=0.6.3, you might need to set the following environment variable to avoid bugs:
# export VLLM_ATTENTION_BACKEND=XFORMERS

python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=grpo \
Expand Down
4 changes: 1 addition & 3 deletions examples/grpo_trainer/run_qwen2-7b.sh
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
set -x

# If you are using vllm<=0.6.3, you might need to set the following environment variable to avoid bugs:
# export VLLM_ATTENTION_BACKEND=XFORMERS

python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=grpo \
Expand Down Expand Up @@ -40,4 +38,4 @@ python3 -m verl.trainer.main_ppo \
trainer.nnodes=1 \
trainer.save_freq=20 \
trainer.test_freq=5 \
trainer.total_epochs=15 $@
trainer.total_epochs=15 $@
4 changes: 1 addition & 3 deletions examples/grpo_trainer/run_qwen2-7b_math.sh
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
set -x

# If you are using vllm<=0.6.3, you might need to set the following environment variable to avoid bugs:
# export VLLM_ATTENTION_BACKEND=XFORMERS

gsm8k_train_path=$HOME/data/gsm8k/train.parquet
gsm8k_test_path=$HOME/data/gsm8k/test.parquet
Expand Down Expand Up @@ -48,4 +46,4 @@ python3 -m verl.trainer.main_ppo \
trainer.nnodes=1 \
trainer.save_freq=20 \
trainer.test_freq=5 \
trainer.total_epochs=15 $@
trainer.total_epochs=15 $@
2 changes: 0 additions & 2 deletions examples/grpo_trainer/run_qwen2-7b_math_megatron.sh
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
set -x

# If you are using vllm<=0.6.3, you might need to set the following environment variable to avoid bugs:
# export VLLM_ATTENTION_BACKEND=XFORMERS
export CUDA_DEVICE_MAX_CONNECTIONS=1 # For megatron communication/computation overlapping

rollout_mode="sync"
Expand Down
4 changes: 1 addition & 3 deletions examples/grpo_trainer/run_qwen2-7b_seq_balance.sh
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
set -x

# If you are using vllm<=0.6.3, you might need to set the following environment variable to avoid bugs:
# export VLLM_ATTENTION_BACKEND=XFORMERS

# For async rollout mode, dataset should return raw chat.
rollout_mode="async"
Expand Down Expand Up @@ -51,4 +49,4 @@ python3 -m verl.trainer.main_ppo \
trainer.nnodes=1 \
trainer.save_freq=20 \
trainer.test_freq=5 \
trainer.total_epochs=15 $@
trainer.total_epochs=15 $@
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
set -x

# If you are using vllm<=0.6.3, you might need to set the following environment variable to avoid bugs:
# export VLLM_ATTENTION_BACKEND=XFORMERS
export CUDA_DEVICE_MAX_CONNECTIONS=1 # For megatron communication/computation overlapping

gsm8k_train_path=$HOME/data/gsm8k/train.parquet
Expand Down Expand Up @@ -51,4 +49,4 @@ python3 -m verl.trainer.main_ppo --config-path=config \
trainer.nnodes=1 \
trainer.save_freq=20 \
trainer.test_freq=5 \
trainer.total_epochs=15 $@
trainer.total_epochs=15 $@
4 changes: 1 addition & 3 deletions examples/grpo_trainer/run_qwen2_5-3b_gsm8k_grpo_lora.sh
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
set -x

# If you are using vllm<=0.6.3, you might need to set the following environment variable to avoid bugs:
# export VLLM_ATTENTION_BACKEND=XFORMERS

python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=grpo \
Expand Down Expand Up @@ -46,4 +44,4 @@ python3 -m verl.trainer.main_ppo \
trainer.nnodes=1 \
trainer.save_freq=20 \
trainer.test_freq=5 \
trainer.total_epochs=15 $@
trainer.total_epochs=15 $@
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
set -x

# If you are using vllm<=0.6.3, you might need to set the following environment variable to avoid bugs:
# export VLLM_ATTENTION_BACKEND=XFORMERS
export CUDA_DEVICE_MAX_CONNECTIONS=1 # For megatron communication/computation overlapping

gsm8k_train_path=$HOME/data/gsm8k/train.parquet
Expand Down Expand Up @@ -50,4 +48,4 @@ python3 -m verl.trainer.main_ppo --config-path=config \
trainer.nnodes=1 \
trainer.save_freq=20 \
trainer.test_freq=5 \
trainer.total_epochs=15 $@
trainer.total_epochs=15 $@
4 changes: 1 addition & 3 deletions examples/grpo_trainer/run_qwen2_5_vl-7b.sh
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
set -x
ENGINE=${1:-vllm}
# If you are using vllm<=0.6.3, you might need to set the following environment variable to avoid bugs:
# export VLLM_ATTENTION_BACKEND=XFORMERS

python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=grpo \
Expand Down Expand Up @@ -45,4 +43,4 @@ python3 -m verl.trainer.main_ppo \
trainer.nnodes=1 \
trainer.save_freq=20 \
trainer.test_freq=5 \
trainer.total_epochs=15 $@
trainer.total_epochs=15 $@
4 changes: 1 addition & 3 deletions examples/grpo_trainer/run_qwen2_5_vl-7b_seq_balance.sh
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
set -x
ENGINE=${1:-vllm}
# If you are using vllm<=0.6.3, you might need to set the following environment variable to avoid bugs:
# export VLLM_ATTENTION_BACKEND=XFORMERS

python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=grpo \
Expand Down Expand Up @@ -44,4 +42,4 @@ python3 -m verl.trainer.main_ppo \
trainer.nnodes=1 \
trainer.save_freq=20 \
trainer.test_freq=5 \
trainer.total_epochs=15 $@
trainer.total_epochs=15 $@
4 changes: 1 addition & 3 deletions examples/grpo_trainer/run_qwen3moe-30b_megatron.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,6 @@ DIST_CKPT_PATH=${DIST_CKPT_PATH}

python scripts/converter_hf_to_mcore.py --hf_model_path $HF_MODEL_PATH --output_path $DIST_CKPT_PATH

# If you are using vllm<=0.6.3, you might need to set the following environment variable to avoid bugs:
# export VLLM_ATTENTION_BACKEND=XFORMERS
export CUDA_DEVICE_MAX_CONNECTIONS=1 # For megatron communication/computation overlapping

python3 -m verl.trainer.main_ppo --config-path=config \
Expand Down Expand Up @@ -53,4 +51,4 @@ python3 -m verl.trainer.main_ppo --config-path=config \
trainer.nnodes=4 \
trainer.save_freq=20 \
trainer.test_freq=5 \
trainer.total_epochs=15 $@
trainer.total_epochs=15 $@
2 changes: 0 additions & 2 deletions examples/ppo_trainer/run_deepseek_math_gsm8k_megatron.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,6 @@ set -x

# Example runnable on H20 * 8

# If you are using vllm<=0.6.3, you might need to set the following environment variable to avoid bugs:
# export VLLM_ATTENTION_BACKEND=XFORMERS
export CUDA_DEVICE_MAX_CONNECTIONS=1 # For megatron communication/computation overlapping

gsm8k_train_path=$HOME/data/gsm8k/train.parquet
Expand Down
2 changes: 0 additions & 2 deletions examples/ppo_trainer/run_deepseek_math_gsm8k_megatron_nsys.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,6 @@ set -x

# Example runnable on H20 * 8

# If you are using vllm<=0.6.3, you might need to set the following environment variable to avoid bugs:
# export VLLM_ATTENTION_BACKEND=XFORMERS
export CUDA_DEVICE_MAX_CONNECTIONS=1 # For megatron communication/computation overlapping

gsm8k_train_path=$HOME/data/gsm8k/train.parquet
Expand Down
2 changes: 0 additions & 2 deletions examples/ppo_trainer/run_moonlight16b_a3b_gsm8k_megatron.sh
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
set -x

# If you are using vllm<=0.6.3, you might need to set the following environment variable to avoid bugs:
# export VLLM_ATTENTION_BACKEND=XFORMERS
export CUDA_DEVICE_MAX_CONNECTIONS=1 # For megatron communication/computation overlapping


Expand Down
2 changes: 0 additions & 2 deletions examples/ppo_trainer/run_qwen1.5_moe_a2.7b-gsm8k_megatron.sh
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
set -x

# If you are using vllm<=0.6.3, you might need to set the following environment variable to avoid bugs:
# export VLLM_ATTENTION_BACKEND=XFORMERS
export CUDA_DEVICE_MAX_CONNECTIONS=1 # For megatron communication/computation overlapping

# 0. download the model
Expand Down
2 changes: 0 additions & 2 deletions examples/ppo_trainer/run_qwen2-7b_math_gsm8k_megatron.sh
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
set -x

# If you are using vllm<=0.6.3, you might need to set the following environment variable to avoid bugs:
# export VLLM_ATTENTION_BACKEND=XFORMERS
export CUDA_DEVICE_MAX_CONNECTIONS=1 # For megatron communication/computation overlapping

gsm8k_train_path=$HOME/data/gsm8k/train.parquet
Expand Down
1 change: 0 additions & 1 deletion examples/ppo_trainer/run_qwen2-7b_rm.sh
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,6 @@ math_test_path=$HOME/data/math/test.parquet
train_files="['$gsm8k_train_path', '$math_train_path']"
test_files="['$gsm8k_test_path', '$math_test_path']"

export VLLM_ATTENTION_BACKEND=XFORMERS # vllm + qwen2-7b with flash_attn has some issues

# prepare model ckpt
huggingface-cli download Qwen/Qwen2-7B-Instruct --local-dir $HOME/models/Qwen2-7B-Instruct &
Expand Down
Loading