Skip to content

Comments

Draft: [PD] NIXL Integration#5006

Closed
trevor-m wants to merge 6 commits intosgl-project:mainfrom
trevor-m:nixl
Closed

Draft: [PD] NIXL Integration#5006
trevor-m wants to merge 6 commits intosgl-project:mainfrom
trevor-m:nixl

Conversation

@trevor-m
Copy link
Collaborator

@trevor-m trevor-m commented Apr 2, 2025

Currently supports 1P+1D

Motivation

#4655

Modifications

Checklist

@trevor-m
Copy link
Collaborator Author

trevor-m commented Apr 2, 2025

@ByronHsu Although I've implemented the transfer, the model output is still garbage. Do you know if there's any more dummy code for PD that needs to be implemented?

@ByronHsu
Copy link
Collaborator

ByronHsu commented Apr 3, 2025

Cool! The change looks pretty neat! I suspect there is misalignment between kv on prefill and decode tensor. Can you print the kv tensor on prefill and decode side to compare?

@trevor-m
Copy link
Collaborator Author

trevor-m commented Apr 7, 2025

@ByronHsu Good news, I fixed the model output issue.

@jokerwyt
Copy link
Contributor

jokerwyt commented Apr 8, 2025

Neat work! Can we have a README for how to deploy a demo? I tried to run it with --disaggregation-mode but my UCX seems not working correctly.

[1744127532.843200] [TENCENT64:3124406:0]          tcp_ep.c:1258 UCX  ERROR tcp_ep 0x56393ca4b560 (state=CONNECTED): send(123) failed: Input/output error
[2025-04-08 15:52:12 TP0] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2011, in run_scheduler_process
    scheduler.event_loop_normal_disagg_prefill()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 663, in event_loop_normal_disagg_prefill
    self.process_batch_result_disagg_prefill(batch, result)
  File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/prefill.py", line 176, in process_batch_result_disagg_prefill
    self.send_kv_chunk(req, token_id=next_token_id)
  File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/prefill.py", line 249, in send_kv_chunk
    req.disagg_kv_sender.send(kv_indices)
  File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/conn.py", line 114, in send
    state = self.mgr.agent.transfer(self.xfer_handle)
  File "/usr/local/lib/python3.10/dist-packages/nixl/_api.py", line 268, in transfer
    status = self.agent.postXferReq(handle, notif_msg)
nixl._bindings.nixlBackendError: NIXL_ERR_BACKEND

error handling callback was invoked with status -25 (Connection reset by remote peer)

Guess: must we build ucx with-cuda or with-gdrcopy? I didn't enable them.

@hnyls2002
Copy link
Collaborator

@trevor-m Great work, could you please share the installation guide and the commands to test the demo? Thanks a lot!

@trevor-m
Copy link
Collaborator Author

trevor-m commented Apr 8, 2025

Hi @hnyls2002 the installation instructions for NIXL can be found here: https://github.com/ai-dynamo/nixl

I'm using a container with UCX already installed at /opt/hpcx/ucx, so these are the steps I used to build and install nixl. If you don't have UCX, the nixl README has instructions on how to build it.

git clone https://github.com/ai-dynamo/nixl.git
cd nixl
pip install meson
meson setup build
cd build
meson configure -Ducx_path=/opt/hpcx/ucx
ninja
ninja install
cd ..
pip install .

To run the demo:

python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --disaggregation-mode prefill --port 30001

python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --disaggregation-mode decode --port 30002 --base-gpu-id 1

python3 -m sglang.srt.disaggregation.mini_lb --prefill http://0.0.0.0:30001/ --decode http://0.0.0.0:30002/ --host 0.0.0.0 --port 30000

curl -X POST http://127.0.0.1:30000/generate -H "Content-Type: application/json" -d '{
  "text": "Let me tell you a lonnng story ",
  "sampling_params": {
    "temperature": 0
  }
}'

@jokerwyt
Copy link
Contributor

jokerwyt commented Apr 9, 2025

Hi @hnyls2002 the installation instructions for NIXL can be found here: https://github.com/ai-dynamo/nixl

I'm using a container with UCX already installed at /opt/hpcx/ucx, so these are the steps I used to build and install nixl. If you don't have UCX, the nixl README has instructions on how to build it.

git clone https://github.com/ai-dynamo/nixl.git
cd nixl
pip install meson
meson setup build
cd build
meson configure -Ducx_path=/opt/hpcx/ucx
ninja
ninja install
cd ..
pip install .

To run the demo:

python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --disaggregation-mode prefill --port 30001

python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --disaggregation-mode decode --port 30002 --base-gpu-id 1

python3 -m sglang.srt.disaggregation.mini_lb --prefill http://0.0.0.0:30001/ --decode http://0.0.0.0:30002/ --host 0.0.0.0 --port 30000

curl -X POST http://127.0.0.1:30000/generate -H "Content-Type: application/json" -d '{
  "text": "Let me tell you a lonnng story ",
  "sampling_params": {
    "temperature": 0
  }
}'

Hi, I still cannot reproduce it; it is the same error as my previous comments. Do you need GDRCopy enabled for nixl?

@trevor-m
Copy link
Collaborator Author

trevor-m commented Apr 9, 2025

Hi @jokerwyt, that error is coming from UCX. It could be a build config issue, or a mismatch between UCX versions.
I don't think GDR copy is required.

One suggestion I have is to use one of NVIDIA's containers which has UCX already installed, such as nvcr.io/nvidia/pytorch:25.03-py3. It is installed at /opt/hpcx/ucx so you can follow the nixl build commands from my earlier comment: #5006 (comment)

@jokerwyt
Copy link
Contributor

jokerwyt commented Apr 12, 2025

Hi, NIXL is easy to use. I made some effort to extend this PR to full support for xPyD and tensor parallelism. Now I have a demo for 2P2D, TP=2 for each instance, on an 8-L20 GPU single node.
trevor-m#1

@trevor-m
Copy link
Collaborator Author

closing in favor of #5477

@trevor-m trevor-m closed this Apr 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants