Skip to content

feat: add Qwen3.5-9B task config and version bumps#277

Open
dzorlu wants to merge 59 commits intofeat/vl-multimodal-supportfrom
feat/qwen3.5-9b
Open

feat: add Qwen3.5-9B task config and version bumps#277
dzorlu wants to merge 59 commits intofeat/vl-multimodal-supportfrom
feat/qwen3.5-9b

Conversation

@dzorlu
Copy link
Collaborator

@dzorlu dzorlu commented Mar 4, 2026

Summary

  • Add task YAML (openenv-fleet-grpo-qwen3_5-9b.yaml) for Qwen/Qwen3.5-9B — a natively multimodal model (early fusion, GatedDeltaNet hybrid attention) as a drop-in replacement for Qwen3-VL-8B
  • Bump vLLM from ==0.13.0>=0.16.1.dev0 (nightly required; Qwen3.5 support landed after 0.16.0 branch cut, first stable will be v0.17.0)
  • Bump transformers from >=4.51.0>=4.57.0 (Qwen3.5 model class registration)
  • Task YAML uses --extra-index-url https://wheels.vllm.ai/nightly in both setup and run to resolve nightly wheels

Risk areas (no pre-emptive changes — test first)

  • collective_rpc for weight sync (vllm_engine.py:338-342) — internal API, may break with 4-minor-version vLLM jump
  • output_processor.request_states (vllm_engine.py:318-326) — internal API
  • OpenAI serving imports (vllm_engine.py:16-43) — existing try/except should absorb changes

Verified (no changes needed)

  • model_wrapper.py:100hasattr(model_config, "vision_config") correctly detects Qwen3.5-9B as VL
  • generators/utils.py — Chat template uses same <|im_start|> format

Test plan

  • uv sync --extra vllm --extra-index-url https://wheels.vllm.ai/nightly resolves successfully
  • AutoConfig.from_pretrained("Qwen/Qwen3.5-9B") has vision_config
  • vllm serve Qwen/Qwen3.5-9B starts without error on nightly
  • Launch training run with the new task YAML on a test cluster
  • Verify weight sync (collective_rpc) works end-to-end
  • Switch to vllm>=0.17.0 once stable release ships (~mid March 2026)

🤖 Generated with Claude Code

Deniz and others added 30 commits March 3, 2026 18:38
Add task YAML for Qwen/Qwen3.5-9B (natively multimodal, early fusion)
and bump dependencies to support it:
- vllm: ==0.13.0 → >=0.16.1.dev0 (nightly required until v0.17.0)
- transformers: >=4.51.0 → >=4.57.0 (Qwen3.5 model class support)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
vLLM nightly (0.16.1rc1.dev) requires torch==2.10.0, conflicting
with the previous torch==2.9.0 pin.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The global bump broke sglang resolution since sglang==0.4.8.post1
pins transformers==4.52.3. Move the bump into the vllm extra where
it's needed — uv resolves conflicting extras in separate splits.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Revert pyproject.toml vllm extra to stable pins (vllm==0.13.0,
  torch==2.9.0) so uv sync resolves cleanly across all extras
- Override with nightly via uv pip install in the task YAML only
- Use python -m instead of uv run --isolated in run section to
  avoid re-resolving against pyproject.toml
- Set MAX_ATTEMPTS=1 in workflow to fail fast on setup errors

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
uv sync installs torchvision for torch==2.9.0, then the nightly
override bumps torch to 2.10.0 causing ABI mismatch
(torchvision::nms operator not found).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Latest nightly (.dev202) only has aarch64 wheel. unsafe-best-match
lets uv check both nightly and PyPI indexes to find a version with
x86_64 wheels.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
8x B200/H200 exhausted everywhere. Add 4x fallbacks so SkyPilot
can grab whatever is available. Restore retry logic since
provisioning failures are transient.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
torchvision installed by uv sync (for torch 2.9.0) was not being
properly overridden because the pip install line only had the vllm
nightly index. torchvision needs the pytorch cu128 index to get a
build matching torch 2.10.0.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
uv pip install was seeing torchvision as already satisfied (installed
by uv sync for torch 2.9.0) and skipping the upgrade. Split into two
steps and use --reinstall-package to force re-resolution from cu128.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
uv pip's resolution was keeping the old torchvision (built for torch
2.9.0) even with --reinstall-package. Switch to pip with
--force-reinstall --no-deps from cu128 index to guarantee matching
torch+torchvision pair. Falls back to nightly cu128 if stable doesn't
have the version.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use vllm's recommended install method which automatically handles
torch+torchvision ABI compatibility instead of manual version juggling.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
uv run re-syncs the venv from pyproject.toml before executing, which
reverts torch from 2.10.0 (installed by vllm nightly) back to 2.9.0
(pinned in pyproject.toml). This causes the torchvision ABI mismatch
(operator torchvision::nms does not exist) at runtime even though
setup correctly installs matching versions.

Changed both `uv run python` calls to plain `python` since the venv
is already activated and has the correct packages installed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
vLLM nightly pulls in Ray 2.44+ which removed
ray.experimental.collective.util.get_address_and_port.
Fall back to a simple socket-based implementation that does the
same thing: get node IP via ray.util and bind to a free port.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
flash-attn 2.8.3 only supports torch<=2.9, but vLLM nightly requires
torch 2.10. No prebuilt wheels exist for torch 2.10+cu130.

- Uninstall flash-attn after vllm nightly install
- Use SDPA attention (PyTorch's built-in F.scaled_dot_product_attention
  which includes FlashAttention v2 as an internal backend)
- Disable sample packing (requires flash_attention_2 attn impl)
- vLLM uses FlashInfer for its attention kernels, not flash-attn

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
uv sync installs flashinfer-jit-cache 0.5.3+cu128, but vllm nightly
upgrades flashinfer to 0.6.4 without updating the JIT cache package.
Remove the stale cache so flashinfer 0.6.4 regenerates it.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
transformers 4.57.6 (latest stable) doesn't have qwen3_5 in its auto
model mapping. FSDP workers call AutoConfig.from_pretrained() which
requires the model type to be registered. Install from HF main branch
until a stable release (>=4.58.0) includes Qwen3.5 support.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ce retries to 1

AutoModelForVision2Seq was removed in transformers 5.0 (main branch).
Replace with AutoModelForImageTextToText, falling back to the old name
for older transformers versions. Also reduce retry attempts to 1 for
faster debugging iteration.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
accelerate's register_empty_parameter passes _is_hf_initialized to
Parameter.__new__() which torch 2.10 doesn't accept. Installing
accelerate from git main fixes this compatibility issue.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
accelerate's register_empty_parameter passes param.__dict__ (including
_is_hf_initialized from transformers) as kwargs to Parameter.__new__(),
which torch 2.10+ rejects. Patch Parameter.__new__ to accept and ignore
extra kwargs. Also revert accelerate-from-source (code patch handles it).

Ref: verl-project/verl#4522

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
transformers 5.0/main returns a set for _no_split_modules instead of a
list. Convert to list before subscripting with [0].

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
vLLM nightly uses pidfd_getfd for inter-process CUDA memory sharing
during collective_rpc weight sync. This syscall requires ptrace
permissions blocked in containerized environments (RunPod). Set
PR_SET_PTRACER via sitecustomize.py so all Python processes (Ray
workers, vLLM engines) get the permission on startup.

Ref: verl-project/verl#3377

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ainer compat

pidfd_getfd syscall is blocked by seccomp in containerized envs (RunPod).
Two fixes:
- VLLM_ENABLE_V1_MULTIPROCESSING=0: keeps engine in single process, avoids CUDA IPC
- PYTORCH_CUDA_ALLOC_CONF=expandable_segments:False: avoids pidfd_getfd in allocator IPC path

Also removes failed sitecustomize.py prctl workaround (seccomp blocks at kernel level).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rnels)

FlashInfer needs nvcc to JIT-compile GDN attention kernels for Qwen3.5.
SkyPilot containers may not have /usr/local/cuda; detect from common
CUDA paths and fall back to apt-get nvidia-cuda-toolkit.

Also removes unnecessary VLLM_ENABLE_V1_MULTIPROCESSING=0 from YAML
(already set by vllm_engine.py) and restores expandable_segments:True.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
After apt-get install nvidia-cuda-toolkit, nvcc is at /usr/bin/nvcc
but CUDA_HOME was never set. Derive CUDA_HOME from nvcc binary path
as fallback. Fixes bash unbound variable error with set -euo pipefail.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…uption)

apt-get nvidia-cuda-toolkit was breaking the Python venv (wandb import
failed after install). Instead: detect nvcc via find, fall back to pip
nvidia-cuda-nvcc-cu12. Adds diagnostics if nvcc can't be found.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…-cu12)

The find command couldn't locate nvcc in site-packages. Use Python's
nvidia.cuda_nvcc module path directly to derive CUDA_HOME.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…da_nvcc

nvidia.cuda_nvcc is a namespace package with __file__=None.
Use __path__[0] to get the package directory path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
nvcc binary from pip package may not be in PATH or may lack execute
permission. Use full path and add ls/chmod for diagnostics.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Deniz and others added 29 commits March 4, 2026 04:11
…ns nvcc

nvidia-cuda-nvcc-cu12 only has ptxas/headers, NOT nvcc binary.
The CUDA 13 package (nvidia-cuda-nvcc) includes the full compiler.
torch 2.10+cu130 from vLLM nightly needs CUDA 13 nvcc for FlashInfer JIT.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…n torch 2.10)

vLLM asserts that expandable_segments:True is not set when using its memory pool.
See pytorch/pytorch#147851.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
FlashInfer JIT runs inside Ray actor processes which have different
working directories. Relative CUDA_HOME paths (`.venv/...`) cause
`/bin/sh: 1: .venv/.../nvcc: not found` errors in Ninja builds.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pip CUDA packages (nvidia-cuda-nvcc) have incomplete headers - cuda_fp16.h
references nv/target which isn't found. Install cuda-nvcc-12-8 from NVIDIA's
official apt repo which provides a complete, working toolkit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
NVIDIA's CUDA repo uses x86_64 directory naming, not dpkg's amd64.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
FlashInfer's fmha_gen kernels need cublasLt.h (from libcublas-dev-12-8)
and nvrtc.h (from cuda-nvrtc-dev-12-8) in addition to cuda-nvcc-12-8.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Transformers from source (post-5.0) changed apply_chat_template to
return BatchEncoding by default instead of List[int]. This caused:
- "unsupported operand type(s) for +: 'BatchEncoding' and 'list'"
- "TypeError: new(): invalid data type 'str'" (fatal crash)

Also: require 8x GPU, increase max_input_length to 131K, bump
gpu_memory_utilization to 0.85.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
forums-homes is a tool_use env incorrectly tagged as computer_use in v5,
causing init failures (no 'computer' tool) during CUA training. Removed
175 forums-homes tasks (112 CU + 63 TU) and uploaded as v51 to S3.

- Upload v51 dataset to s3://fleet-internal-datasets/v51/openenv/
- Add forums-homes to EXCLUDED_ENVS in prepare_dataset.py
- Update GHA workflow default to v51, restore MAX_ATTEMPTS=4
- Add Dataset Versions section to fleet-training.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
forums-homes removal is handled at the dataset level (v51) not in code.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Restored all_tool_use.json and all_tasks.json to match v5 exactly.
Only all_computer_use.json has forums-homes removed (112 tasks).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Added vast, primeintellect to provider list. Use H200:8 (not H200-SXM:8)
to match the naming convention from the 8b-8gpu task YAML.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- RunPod/Lambda/Vast: H200-SXM:8 (not H200:8)
- PrimeIntellect: H200:8
- Vast: no B200, only H200-SXM:8
- Nebius: B200:8 and H200-SXM:8

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove PrimeIntellect ($0 wallet), Vast (no B200/H200-SXM), Nebius
(not enabled on runner). Only RunPod (H200-SXM:8, B200:8) and
Lambda (B200:8) are viable.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The nebius SDK was not being installed on the GHA runner, causing
`sky check nebius` to fail. Added [nebius] extra to skypilot-nightly.
Restored Vast, Nebius, and PrimeIntellect GPU providers in task YAML.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The prime CLI only sets api_key, not team_id. Without team_id,
API calls hit the personal context ($0) instead of team ($250).
Write config.json directly with both fields.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove xlam-70b, glm-4.7-flash, and harbor-grpo-qwen3-8b — no longer used.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
H200/B200 availability is tight — H100s are much more widely available
and 9B fits fine on 8x H100-80GB.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PrimeIntellect provisioning is unreliable — remove from both H200 and H100 entries.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
v52 filters computer_use to instacart (120), walmart (145), zillow (348) —
the 3 envs with shortest avg trajectories, for training signal with small models.
Tool-use dataset carried over from v51 unchanged.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
H100s don't support Qwen3.5's GatedDeltaNet kernels (FlashInfer JIT).
Only B200/H200 GPUs work for Qwen3.5 training.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
6x B200/H200 preferred, 8x fallback. All params are dynamic via
$SKYPILOT_NUM_GPUS_PER_NODE so no other changes needed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
vLLM uses paged attention (no padding), so this just raises the
ceiling on conversation length. Safe with batched=false.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
72 eval trajectories with 50%+ timeout rate was taking >1hr before
training could start. 12 * 3 samples = 36 trajectories should be
much faster.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tracks generate time vs env.step time per trajectory, logged to WandB:
- timing/total_generate_secs, timing/total_env_step_secs
- timing/avg_generate_per_turn_secs, timing/avg_env_step_per_turn_secs
- timing/pct_env_step (% of trajectory time spent in env interaction)
- timing/num_turns

Also reduces eval_batch_size from 24 to 12 for 9B config.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… GPUs

- MAX_INPUT_LENGTH 262144→131072: caps max training sequence length to reduce
  activation memory during loss.backward() (OOM at 29.62 GiB allocation)
- Reorder GPU preferences: 8x before 6x for better FSDP memory sharding

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant