Draft: [PD] NIXL Integration by trevor-m · Pull Request #5006 · sgl-project/sglang

trevor-m · 2025-04-02T19:36:15Z

Currently supports 1P+1D

Motivation

Modifications

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

trevor-m · 2025-04-02T20:36:18Z

@ByronHsu Although I've implemented the transfer, the model output is still garbage. Do you know if there's any more dummy code for PD that needs to be implemented?

ByronHsu · 2025-04-03T16:52:16Z

Cool! The change looks pretty neat! I suspect there is misalignment between kv on prefill and decode tensor. Can you print the kv tensor on prefill and decode side to compare?

trevor-m · 2025-04-07T19:05:55Z

@ByronHsu Good news, I fixed the model output issue.

jokerwyt · 2025-04-08T10:47:37Z

Neat work! Can we have a README for how to deploy a demo? I tried to run it with --disaggregation-mode but my UCX seems not working correctly.

[1744127532.843200] [TENCENT64:3124406:0]          tcp_ep.c:1258 UCX  ERROR tcp_ep 0x56393ca4b560 (state=CONNECTED): send(123) failed: Input/output error
[2025-04-08 15:52:12 TP0] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2011, in run_scheduler_process
    scheduler.event_loop_normal_disagg_prefill()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 663, in event_loop_normal_disagg_prefill
    self.process_batch_result_disagg_prefill(batch, result)
  File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/prefill.py", line 176, in process_batch_result_disagg_prefill
    self.send_kv_chunk(req, token_id=next_token_id)
  File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/prefill.py", line 249, in send_kv_chunk
    req.disagg_kv_sender.send(kv_indices)
  File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/conn.py", line 114, in send
    state = self.mgr.agent.transfer(self.xfer_handle)
  File "/usr/local/lib/python3.10/dist-packages/nixl/_api.py", line 268, in transfer
    status = self.agent.postXferReq(handle, notif_msg)
nixl._bindings.nixlBackendError: NIXL_ERR_BACKEND

error handling callback was invoked with status -25 (Connection reset by remote peer)

Guess: must we build ucx with-cuda or with-gdrcopy? I didn't enable them.

hnyls2002 · 2025-04-08T16:47:30Z

@trevor-m Great work, could you please share the installation guide and the commands to test the demo? Thanks a lot!

trevor-m · 2025-04-08T23:01:31Z

Hi @hnyls2002 the installation instructions for NIXL can be found here: https://github.com/ai-dynamo/nixl

I'm using a container with UCX already installed at /opt/hpcx/ucx, so these are the steps I used to build and install nixl. If you don't have UCX, the nixl README has instructions on how to build it.

git clone https://github.com/ai-dynamo/nixl.git
cd nixl
pip install meson
meson setup build
cd build
meson configure -Ducx_path=/opt/hpcx/ucx
ninja
ninja install
cd ..
pip install .

To run the demo:

python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --disaggregation-mode prefill --port 30001

python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --disaggregation-mode decode --port 30002 --base-gpu-id 1

python3 -m sglang.srt.disaggregation.mini_lb --prefill http://0.0.0.0:30001/ --decode http://0.0.0.0:30002/ --host 0.0.0.0 --port 30000

curl -X POST http://127.0.0.1:30000/generate -H "Content-Type: application/json" -d '{
  "text": "Let me tell you a lonnng story ",
  "sampling_params": {
    "temperature": 0
  }
}'

jokerwyt · 2025-04-09T08:05:21Z

Hi @hnyls2002 the installation instructions for NIXL can be found here: https://github.com/ai-dynamo/nixl

I'm using a container with UCX already installed at /opt/hpcx/ucx, so these are the steps I used to build and install nixl. If you don't have UCX, the nixl README has instructions on how to build it.

git clone https://github.com/ai-dynamo/nixl.git
cd nixl
pip install meson
meson setup build
cd build
meson configure -Ducx_path=/opt/hpcx/ucx
ninja
ninja install
cd ..
pip install .

To run the demo:

python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --disaggregation-mode prefill --port 30001

python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --disaggregation-mode decode --port 30002 --base-gpu-id 1

python3 -m sglang.srt.disaggregation.mini_lb --prefill http://0.0.0.0:30001/ --decode http://0.0.0.0:30002/ --host 0.0.0.0 --port 30000

curl -X POST http://127.0.0.1:30000/generate -H "Content-Type: application/json" -d '{
  "text": "Let me tell you a lonnng story ",
  "sampling_params": {
    "temperature": 0
  }
}'

Hi, I still cannot reproduce it; it is the same error as my previous comments. Do you need GDRCopy enabled for nixl?

trevor-m · 2025-04-09T19:46:46Z

Hi @jokerwyt, that error is coming from UCX. It could be a build config issue, or a mismatch between UCX versions.
I don't think GDR copy is required.

One suggestion I have is to use one of NVIDIA's containers which has UCX already installed, such as nvcr.io/nvidia/pytorch:25.03-py3. It is installed at /opt/hpcx/ucx so you can follow the nixl build commands from my earlier comment: #5006 (comment)

jokerwyt · 2025-04-12T14:01:10Z

Hi, NIXL is easy to use. I made some effort to extend this PR to full support for xPyD and tensor parallelism. Now I have a demo for 2P2D, TP=2 for each instance, on an 8-L20 GPU single node.
trevor-m#1

trevor-m · 2025-04-16T18:30:04Z

closing in favor of #5477

trevor-m added 3 commits March 27, 2025 00:42

NIXL integration

6d862c5

Use prepped xfer

7e210f4

Don't use prepped xfer

fc4438c

trevor-m requested review from ByronHsu, Ying1123, hnyls2002, ispobock, merrymercy and zhyncs as code owners April 2, 2025 19:36

Fix bootstrap room

9bff94c

trevor-m force-pushed the nixl branch from 8cc6020 to 9bff94c Compare April 2, 2025 21:51

trevor-m requested a review from xiezhq-hermann as a code owner April 2, 2025 21:51

Use pointer arithmetic instead of tensors. This fixes the model output

119690a

trevor-m force-pushed the nixl branch from a53394d to 119690a Compare April 7, 2025 20:34

Merge branch 'main' into nixl

54c96ec

zhyncs assigned hnyls2002 Apr 8, 2025

zhyncs added the high priority label Apr 8, 2025

jokerwyt mentioned this pull request Apr 14, 2025

Make NIXL backend as an implementation of base backend. #5380

Closed

6 tasks

trevor-m mentioned this pull request Apr 16, 2025

[PD] Add NIXL transfer backend #5477

Merged

6 tasks

trevor-m closed this Apr 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Draft: [PD] NIXL Integration#5006

Draft: [PD] NIXL Integration#5006
trevor-m wants to merge 6 commits intosgl-project:mainfrom
trevor-m:nixl

trevor-m commented Apr 2, 2025 •

edited

Loading

Uh oh!

trevor-m commented Apr 2, 2025

Uh oh!

ByronHsu commented Apr 3, 2025

Uh oh!

trevor-m commented Apr 7, 2025

Uh oh!

jokerwyt commented Apr 8, 2025 •

edited

Loading

Uh oh!

hnyls2002 commented Apr 8, 2025

Uh oh!

trevor-m commented Apr 8, 2025

Uh oh!

jokerwyt commented Apr 9, 2025 •

edited

Loading

Uh oh!

trevor-m commented Apr 9, 2025

Uh oh!

jokerwyt commented Apr 12, 2025 •

edited

Loading

Uh oh!

trevor-m commented Apr 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Comments

Conversation

trevor-m commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Checklist

Uh oh!

trevor-m commented Apr 2, 2025

Uh oh!

ByronHsu commented Apr 3, 2025

Uh oh!

trevor-m commented Apr 7, 2025

Uh oh!

jokerwyt commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hnyls2002 commented Apr 8, 2025

Uh oh!

trevor-m commented Apr 8, 2025

Uh oh!

jokerwyt commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

trevor-m commented Apr 9, 2025

Uh oh!

jokerwyt commented Apr 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

trevor-m commented Apr 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

trevor-m commented Apr 2, 2025 •

edited

Loading

jokerwyt commented Apr 8, 2025 •

edited

Loading

jokerwyt commented Apr 9, 2025 •

edited

Loading

jokerwyt commented Apr 12, 2025 •

edited

Loading