Custom All Reduce for Piecewise Cuda Graph#15356
Conversation
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
ref: #14193 |
| class CustomAllreduce: | ||
| _SUPPORTED_WORLD_SIZES = [2, 4, 6, 8] | ||
| _MAX_CAR_SIZE = 8192 * 1024 | ||
| _MAX_CAR_SIZE = 8192 * 1024 * 4 |
There was a problem hiding this comment.
Here should be a fix. Need to calculate the max_size for all_reduce
There was a problem hiding this comment.
Can we add a comment here how it's calculated?
| # if not entry.use_cudagraph or skip_cuda_graphs: | ||
| # return entry.runnable(*args) | ||
| if is_in_torch_compile(): | ||
| return entry.runnable(*args) |
There was a problem hiding this comment.
noob question: what is this for?
There was a problem hiding this comment.
Basically we should prevent replay happens in capture since the custom all reduce buffer has not been allocated. Since now the capture function does warmup and capture only, we should avoid any compile operations to be counted into piecewise cudagraph backend. Thus in compile stage we skip all the later process and directly return the eager run.
There was a problem hiding this comment.
why we need this protection since we already avoided replay in warmup?
https://github.com/sgl-project/sglang/pull/15356/changes#diff-b822ec9786c7a7d6d03d7187d6ad277435a67019576f4f7ef577d0ca2ee3c50eR512
There was a problem hiding this comment.
without this protection the warmup could happen in the warmup_and_torch_compile function. For example, if we capture from 4 to 4096, after the first run with 4096 in warmup_and_torch_compile, other shape from 4 to 3840 will be warmup without this protection since no recompile exists for dense model.
|
could you test qwen vl as well? thanks |
The result for vl model is updated. Please check! |
python/sglang/srt/model_executor/piecewise_cuda_graph_runner.py
Outdated
Show resolved
Hide resolved
python/sglang/srt/model_executor/piecewise_cuda_graph_runner.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
python/sglang/srt/model_executor/piecewise_cuda_graph_runner.py
Outdated
Show resolved
Hide resolved
| class CustomAllreduce: | ||
| _SUPPORTED_WORLD_SIZES = [2, 4, 6, 8] | ||
| _MAX_CAR_SIZE = 8192 * 1024 | ||
| _MAX_CAR_SIZE = 8192 * 1024 * 4 |
There was a problem hiding this comment.
Can we add a comment here how it's calculated?
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
|
/tag-run-ci-label |
|
/tag-and-rerun-ci |
|
btw, we can remove use_original_ca_comm and disable_ca_comm from piecewise_cuda_graph_runner.py? |
True |
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
|
can you resolve the conflict? |
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
solved |
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
|
/rerun-failed-ci |
|
/rerun-failed-ci |
| # mind-exploding: carefully manage the reference and memory. | ||
| with torch.cuda.graph(cudagraph, pool=self.graph_pool): | ||
| stream = get_pcg_capture_stream() | ||
| assert stream is not None, "PCG capture stream is not set" |
There was a problem hiding this comment.
why we can't use “stream is None”
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Motivation
Enable Custom All Reduce in Piecewise Cuda Graph, equal contribution to @ByronHsu
Modifications
graph_captureduring captureAccuracy Tests
For VL Model:
Checklist