Skip to content

[Bugfix] Fix RoBERTa position_ids accumulation on CUDA graph padding#37884

Merged
Isotr0py merged 2 commits intovllm-project:mainfrom
he-yufeng:fix/pooling-split-sizes-mismatch
Mar 23, 2026
Merged

[Bugfix] Fix RoBERTa position_ids accumulation on CUDA graph padding#37884
Isotr0py merged 2 commits intovllm-project:mainfrom
he-yufeng:fix/pooling-split-sizes-mismatch

Conversation

@he-yufeng
Copy link
Copy Markdown
Contributor

Purpose

Fix a crash in all RoBERTa-based pooling/embedding models (BGE-M3, XLM-RoBERTa, stsb-roberta, bge-reranker-v2-m3) when CUDA graphs are enabled. After ~4000 sequential requests the server dies with an out-of-bounds position embedding index.

Root cause: replace_roberta_positions() did an in-place position_ids += padding_idx + 1 on the persistent GPU positions buffer. The model runner refreshes only the first num_scheduled_tokens entries via copy_to_gpu each request — the remaining CUDA-graph padding slots keep their stale values. Every request adds another +(padding_idx + 1) to those slots, and eventually the values exceed max_position_embeddings.

For BAAI/bge-m3 (padding_idx=1, offset=2) with short inputs padded to 8 tokens, overflow happens after (8194 - V_init) / 2 ≈ 4000 requests.

Fix: move the padding_idx + 1 offset into RobertaEmbedding.forward as a non-in-place add (position_ids + offset instead of position_ids += offset). This computes the correct positions each call without mutating the persistent buffer. Also fix the same in-place += pattern in the transformers LegacyMixin.

Fixes #37648
Fixes #37868

Related: #37873 (alternative fix that zeroes the padding region in _preprocess)

Test Plan

Reproduce the bug (before fix)

vllm serve BAAI/bge-m3 --port 9001 \
  --hf-overrides '{"architectures": ["BgeM3EmbeddingModel"]}'

for i in $(seq 1 5000); do
  resp=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:9001/pooling \
    -H "Content-Type: application/json" \
    -d '{"model": "BAAI/bge-m3", "task": "embed", "input": ["test '$i'"]}')
  [ "$resp" != "200" ] && echo "FAIL at $i (HTTP $resp)" && break
done

Existing tests

pytest tests/models/language/pooling/test_bge_m3.py -v -s
pytest tests/models/language/pooling/test_embedding.py -v -s -k "stsb-roberta"
pytest tests/models/language/pooling/test_scoring.py -v -s

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results

replace_roberta_positions() did an in-place += on the persistent
positions buffer. CUDA graph padding slots aren't refreshed between
requests, so the offset kept accumulating until the values overflowed
max_position_embeddings (~4000 requests for BGE-M3).

Move the padding_idx + 1 offset into RobertaEmbedding.forward as a
non-in-place add, which avoids mutating the shared buffer entirely.
Also fix the same pattern in the transformers legacy mixin.

Fixes vllm-project#37648
Fixes vllm-project#37868
@he-yufeng he-yufeng requested a review from hmellor as a code owner March 23, 2026 11:44
@mergify mergify bot added nvidia bug Something isn't working labels Mar 23, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a critical bug in RoBERTa-based models related to position_ids accumulation when using CUDA graphs. The root cause was an in-place modification of a persistent GPU buffer. The fix correctly replaces the in-place addition (+=) with a non-in-place operation in both vllm/model_executor/models/roberta.py and vllm/model_executor/models/transformers/legacy.py. Additionally, the changes in roberta.py refactor the code by removing a helper function and moving the position adjustment logic into the RobertaEmbedding.forward method, which is a more appropriate location. The changes are well-reasoned and appear to be a solid fix for the described issue.

@github-project-automation github-project-automation bot moved this to Ready in NVIDIA Mar 23, 2026
@Isotr0py Isotr0py enabled auto-merge (squash) March 23, 2026 13:01
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 23, 2026
@Isotr0py Isotr0py merged commit ec22806 into vllm-project:main Mar 23, 2026
54 checks passed
@github-project-automation github-project-automation bot moved this from Ready to Done in NVIDIA Mar 23, 2026
Monishver11 pushed a commit to Monishver11/vllm that referenced this pull request Mar 27, 2026
…llm-project#37884)

Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>
nithinvc pushed a commit to nithinvc/vllm that referenced this pull request Mar 27, 2026
vrdn-23 pushed a commit to vrdn-23/vllm that referenced this pull request Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working nvidia ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[Bug]: bge-m3 /pooling endpoint breaks in the latest main branch BGE-M3 /pooling endpoint crashes with split_with_sizes error after ~50-100 requests

2 participants