[Bugfix] Fix RoBERTa position_ids accumulation on CUDA graph padding by he-yufeng · Pull Request #37884 · vllm-project/vllm

he-yufeng · 2026-03-23T11:44:19Z

Purpose

Fix a crash in all RoBERTa-based pooling/embedding models (BGE-M3, XLM-RoBERTa, stsb-roberta, bge-reranker-v2-m3) when CUDA graphs are enabled. After ~4000 sequential requests the server dies with an out-of-bounds position embedding index.

Root cause: replace_roberta_positions() did an in-place position_ids += padding_idx + 1 on the persistent GPU positions buffer. The model runner refreshes only the first num_scheduled_tokens entries via copy_to_gpu each request — the remaining CUDA-graph padding slots keep their stale values. Every request adds another +(padding_idx + 1) to those slots, and eventually the values exceed max_position_embeddings.

For BAAI/bge-m3 (padding_idx=1, offset=2) with short inputs padded to 8 tokens, overflow happens after (8194 - V_init) / 2 ≈ 4000 requests.

Fix: move the padding_idx + 1 offset into RobertaEmbedding.forward as a non-in-place add (position_ids + offset instead of position_ids += offset). This computes the correct positions each call without mutating the persistent buffer. Also fix the same in-place += pattern in the transformers LegacyMixin.

Fixes #37648
Fixes #37868

Related: #37873 (alternative fix that zeroes the padding region in _preprocess)

Test Plan

Reproduce the bug (before fix)

vllm serve BAAI/bge-m3 --port 9001 \
  --hf-overrides '{"architectures": ["BgeM3EmbeddingModel"]}'

for i in $(seq 1 5000); do
  resp=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:9001/pooling \
    -H "Content-Type: application/json" \
    -d '{"model": "BAAI/bge-m3", "task": "embed", "input": ["test '$i'"]}')
  [ "$resp" != "200" ] && echo "FAIL at $i (HTTP $resp)" && break
done

Existing tests

pytest tests/models/language/pooling/test_bge_m3.py -v -s
pytest tests/models/language/pooling/test_embedding.py -v -s -k "stsb-roberta"
pytest tests/models/language/pooling/test_scoring.py -v -s

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results

replace_roberta_positions() did an in-place += on the persistent positions buffer. CUDA graph padding slots aren't refreshed between requests, so the offset kept accumulating until the values overflowed max_position_embeddings (~4000 requests for BGE-M3). Move the padding_idx + 1 offset into RobertaEmbedding.forward as a non-in-place add, which avoids mutating the shared buffer entirely. Also fix the same pattern in the transformers legacy mixin. Fixes vllm-project#37648 Fixes vllm-project#37868

gemini-code-assist

Code Review

This pull request addresses a critical bug in RoBERTa-based models related to position_ids accumulation when using CUDA graphs. The root cause was an in-place modification of a persistent GPU buffer. The fix correctly replaces the in-place addition (+=) with a non-in-place operation in both vllm/model_executor/models/roberta.py and vllm/model_executor/models/transformers/legacy.py. Additionally, the changes in roberta.py refactor the code by removing a helper function and moving the position adjustment logic into the RobertaEmbedding.forward method, which is a more appropriate location. The changes are well-reasoned and appear to be a solid fix for the described issue.

…llm-project#37884)

…llm-project#37884) Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>

…llm-project#37884) Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>

…llm-project#37884)

…llm-project#37884) Signed-off-by: Vinay Damodaran <vrdn@hey.com>

he-yufeng requested a review from hmellor as a code owner March 23, 2026 11:44

mergify bot added nvidia bug Something isn't working labels Mar 23, 2026

github-project-automation bot added this to NVIDIA Mar 23, 2026

gemini-code-assist bot reviewed Mar 23, 2026

View reviewed changes

Isotr0py mentioned this pull request Mar 23, 2026

[Bugfix] RoBERTa position_id accumulation in CUDA graph padding region #37873

Merged

5 tasks

Isotr0py approved these changes Mar 23, 2026

View reviewed changes

github-project-automation bot moved this to Ready in NVIDIA Mar 23, 2026

Isotr0py enabled auto-merge (squash) March 23, 2026 13:01

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 23, 2026

Merge branch 'main' into fix/pooling-split-sizes-mismatch

0c2caf2

Isotr0py merged commit ec22806 into vllm-project:main Mar 23, 2026
54 checks passed

github-project-automation bot moved this from Ready to Done in NVIDIA Mar 23, 2026

pawel-olejniczak mentioned this pull request Mar 24, 2026

[FIX_FOR_VLLM_CUSTOM=14acf429ac08b6d538ca6feb3e06b6d13895804d] Fix CPUOffloadingSpec import path and remove obsolete roberta patch vllm-project/vllm-gaudi#1229

Merged

RhizoNymph pushed a commit to RhizoNymph/vllm that referenced this pull request Mar 26, 2026

[Bugfix] Fix RoBERTa position_ids accumulation on CUDA graph padding (v…

bc8e1b8

…llm-project#37884)

HenryTangDev pushed a commit to HenryTangMain/vllm that referenced this pull request Mar 27, 2026

[Bugfix] Fix RoBERTa position_ids accumulation on CUDA graph padding (v…

3a3c25f

…llm-project#37884)

khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 27, 2026

[Bugfix] Fix RoBERTa position_ids accumulation on CUDA graph padding (v…

543f8ae

…llm-project#37884)

Monishver11 pushed a commit to Monishver11/vllm that referenced this pull request Mar 27, 2026

[Bugfix] Fix RoBERTa position_ids accumulation on CUDA graph padding (v…

46f9391

…llm-project#37884) Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>

nithinvc pushed a commit to nithinvc/vllm that referenced this pull request Mar 27, 2026

[Bugfix] Fix RoBERTa position_ids accumulation on CUDA graph padding (v…

eb756a0

…llm-project#37884) Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>

JiantaoXu pushed a commit to JiantaoXu/vllm that referenced this pull request Mar 28, 2026

[Bugfix] Fix RoBERTa position_ids accumulation on CUDA graph padding (v…

6610db2

…llm-project#37884)

vrdn-23 pushed a commit to vrdn-23/vllm that referenced this pull request Mar 30, 2026

[Bugfix] Fix RoBERTa position_ids accumulation on CUDA graph padding (v…

f541d75

…llm-project#37884) Signed-off-by: Vinay Damodaran <vrdn@hey.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Fix RoBERTa position_ids accumulation on CUDA graph padding#37884

[Bugfix] Fix RoBERTa position_ids accumulation on CUDA graph padding#37884
Isotr0py merged 2 commits intovllm-project:mainfrom
he-yufeng:fix/pooling-split-sizes-mismatch

he-yufeng commented Mar 23, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

he-yufeng commented Mar 23, 2026

Purpose

Test Plan

Reproduce the bug (before fix)

Existing tests

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants