[bug][rocm]fix qr when variable inp #11609
Merged
+104
−8
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When the tensor parallelism (TP) degree is set to 4 or 8, frequent changes in the input shape can cause QuickReduce to hang (this issue has been observed with the gpt_oss model).
We have identified that the root cause is overlapping flag memory addresses between consecutive AllReduce operations.
For most models, the hidden size remains relatively stable, so this issue does not occur.
Our current solution is to allocate separate memory regions for the flags and data of the two AllReduce phases in each operation.
(Note: The data region must also be separated, as overlapping would lead to correctness issues.)
To reproduce error
1.install sglang
2.python3 this_script.py
we can also use this command to check if the hang issue is resolved and checkout if the result is reasonable.
A more detailed explanation
Why does the frequently changing INP shape cause problems?
1.I have obtained some logs.
It appears the program isn't stuck at [256, 2048], but rather at its previous execution point, [256, 1024].
To summarize this phenomenon:
It seems that the program didn’t actually hang at the [256, 2048] stage, but rather at the previous one — [256, 1024].
I suspect that the n-th allreduce and the (n+1)-th allreduce overlap in time.
When the (n+1)-th allreduce executes its phase 1, it modifies the flag used by the n-th allreduce’s phase 2.
For [256,2048], its phase 1 address completely overlaps with the phase 1+phase 2 address of [256,1024].
2.If we use
dist.barrier(group=cpu_group)to guarantee all ranks would block at the same point.,and even after running for an hour, the program will not hang.
3.Referring to vLLM’s communication reduction (CR) implementation, using isolated addresses to distinguish different phases of different allreduce batches is necessary to prevent interference between the n-th and (n+1)-th allreduce operations.
Why don't other models have this issue?

For typical models, the hidden size is fixed, so the input does not change frequently, and the phase 2 of the n-th allreduce and phase 1 of the (n+1)-th allreduce do not share addresses. However, for models like GPT-OSS with variable-length inputs, conflicts may occur. What we need to do is to completely isolate the addresses to avoid any conflicts.