[PD] Support KV transfer with MORI-IO by maning00 · Pull Request #14626 · sgl-project/sglang

maning00 · 2025-12-08T07:47:47Z

Motivation

MORI-IO is AMD's high-performance, point-to-point communication library that leverages GDR (GPU Direct RDMA) to achieve ultra-low latency and high bandwidth for KVCache transfer in LLM inference. To enable efficient PD (Prefill-Decode) disaggregation on AMD hardware, we adopt MORI-IO transfer engine as the transport layer for SGLang.

Modifications

Architecture Overview

The implementation follows a similar pattern to the mooncake transfer engine integration, with MORI-IO-specific optimizations:

1. MoriKVManager - Core Transfer Management

Initialization:
- Creates IOEngine with RDMA backend configuration
- Registers all GPU memory buffers with the engine
- Spawns dedicated threads for bootstrap or status polling
Configuration Options (via environment variables):
- SGLANG_MORI_QP_PER_TRANSFER: Number of queue pairs per transfer (default: 1)
- SGLANG_MORI_POST_BATCH_SIZE: RDMA post batch size (default: -1)
- SGLANG_MORI_NUM_WORKERS: Number of worker threads (default: 1)

2. MoriKVSender (Prefill Side)

Workflow:
1. Waits for decode instance registration via bootstrap thread
2. Receives transfer metadata (destination memory descriptors, indices) from decode
3. Issues RDMA writes using batch_write API for KV cache transfer
4. Sends auxiliary data via TCP (ZMQ)
5. Monitors transfer status and notifies decode instance upon completion

3. MoriKVReceiver (Decode Side)

Workflow:
1. Registers local engine descriptor and memory descriptors with prefill instance
2. Polls for transfer completion status
3. Receives auxiliary data via dedicated TCP handler
4. Updates request status based on prefill notifications

4. Dockerfile

Add the NIC_BACKEND option to enable mori support for different network interface cards (NICs).

Usage

Installation: Install MORI-IO library following the MORI installation guide:

cd mori 
pip install -r requirements-build.txt 
git submodule update --init --recursive 
pip3 install .

SGLang PD Disaggregation with MORI-IO: Use --disaggregation-transfer-backend mori to enable MORI-IO transfer engine:

Known Limitations

State data transfer not implemented: Currently, MORI-IO implementation does not support state data transfer for hybrid models (Mamba, SWA, NSA).

Benchmarking and Profiling

End-to-End PD Disaggregation

Hardware Configuration:

GPUs: 8x AMD Instinct MI355X per node
CPUs: 2x AMD EPYC per node
Network: 8x AMD Pensando Pollara 400 AI-NIC per node
Model: DeepSeek-V3 with TP=8
Setup: 3-node configuration (1 prefill + 1 decode + 1 router)

Benchmark Command:

Prefill instance (node 1):

python -m sglang.launch_server \
  --model-path DeepSeek-V3 \
  --disaggregation-mode prefill \
  --host 0.0.0.0 --port 30002 \
  --tp-size 8 \
  --disaggregation-transfer-backend mori \
  --disable-radix-cache \
  --trust-remote-code

Decode instance (node 2):

python -m sglang.launch_server \
  --model-path DeepSeek-V3 \
  --disaggregation-mode decode \
  --host 0.0.0.0 --port 30003 \
  --tp-size 8 \
  --disaggregation-transfer-backend mori \
  --disable-radix-cache \
  --trust-remote-code

Router (node 3):

python -m sglang_router.launch_router \
  --pd-disaggregation --mini-lb \
  --prefill http://node1:30002 \
  --decode http://node2:30003 \
  --host 0.0.0.0 --port 8000

Benchmark client:

python3 -m sglang.bench_serving \
  --backend sglang \
  --base-url http://127.0.0.1:8000 \
  --dataset-name random \
  --num-prompts 1 \
  --random-input <1024/2048/4096/8192> \
  --random-output 16

Accuracy test:

python3 -m sglang.test.few_shot_gsm8k \
    --host http://127.0.0.1 \
    --port 8000 \
    --num-questions 200 \
    --parallel 128 \
    --num-shots 5

Performance Results:

Comparison of STANDALONE (no PD disaggregation) vs MORI vs MOONCAKE backends. Each test was run 3 times and averaged.

Input Tokens	Metric	STANDALONE	MORI	MOONCAKE
1024	Throughput (tok/s)	21.91	21.22	21.24
	TTFT Mean (ms)	155.35	168.13	165.71
2048	Throughput (tok/s)	21.49	20.63	20.78
	TTFT Mean (ms)	162.24	176.57	173.54
4096	Throughput (tok/s)	21.51	20.68	20.74
	TTFT Mean (ms)	159.93	176.05	174.75
8192	Throughput (tok/s)	16.93	16.02	16.10
	TTFT Mean (ms)	294.41	326.19	322.70

FP8 KV Cache Results(--kv-cache-dtype fp8_e4m3):

Input Tokens	Metric	STANDALONE	MORI	MOONCAKE
1024	Throughput (tok/s)	56.47	51.79	51.37
	TTFT Mean (ms)	101.15	103.98	104.54
2048	Throughput (tok/s)	54.43	49.47	49.69
	TTFT Mean (ms)	104.72	109.89	108.40
4096	Throughput (tok/s)	54.47	49.72	49.16
	TTFT Mean (ms)	103.23	108.74	109.21
8192	Throughput (tok/s)	36.16	31.63	32.11
	TTFT Mean (ms)	208.80	229.15	223.93

gsm8k Accuracy Test Results:

Metric	STANDALONE	MORI	MOONCAKE
Accuracy	0.965	0.965	0.970
Invalid	0.000	0.000	0.000

Both MORI and MOONCAKE leverage RDMA effectively, with near-identical performance profiles, validating MORI-IO as a production-ready alternative for AMD hardware.

cc @inkcherry @TianDi101 @Duyi-Wang

gemini-code-assist · 2025-12-08T07:47:50Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

* add disable notif * send aux with tcp * remove unused log --------- Co-authored-by: cwortman-amd <cwortman@amd.com>

ShangmingCai · 2025-12-23T10:40:00Z

python/sglang/srt/disaggregation/mori/conn.py

+        if self.kv_args.ib_device:
+            os.environ["MORI_RDMA_DEVICES"] = self.kv_args.ib_device
+
+        port = get_free_port()


This util is not robust, which might cause port conflict in some situations. Is it possible to borrow some idea from get_zmq_socket_on_host?

Thank you! Fixed by:

Using port=0 to let the OS atomically bind an available port

Retrieve the actual bound port from Mori's TCP stack

This eliminates the race condition entirely. Thanks for catching this!

ShangmingCai

This PR seems very complete. Would it be convenient to add a test to AMD's CI to verify the correctness?

ShangmingCai · 2025-12-24T07:30:40Z

/tag-and-rerun-ci

maning00 · 2025-12-25T06:10:57Z

This PR seems very complete. Would it be convenient to add a test to AMD's CI to verify the correctness?

@ShangmingCai I added a per-commit-amd correctness test for the MORI integration (protocol/serialization/message-handling) using a lightweight fake mori module, since the current AMD CI image (rocm/sgl-dev) doesn’t include mori.
Would you prefer we (1) keep this as-is, or (2) additionally add a real MORI E2E test after mori is available in the AMD CI image?

ShangmingCai · 2025-12-25T06:16:47Z

This PR seems very complete. Would it be convenient to add a test to AMD's CI to verify the correctness?

@ShangmingCai I added a per-commit-amd correctness test for the MORI integration (protocol/serialization/message-handling) using a lightweight fake mori module, since the current AMD CI image (rocm/sgl-dev) doesn’t include mori. Would you prefer we (1) keep this as-is, or (2) additionally add a real MORI E2E test after mori is available in the AMD CI image?

@maning00 I prefer 2, we can add the E2E test later. Do you mind moving this test to the manual dir? Most of the disaggregation tests are in the test/srt dir, adding a new disaggregation dir would bring some confusion. And also, this simulation test can not guarantee the accuracy of the mori backend (no need to waste the CI for it), so we should put it in the manual dir, and add a real E2E test when it is ready.

ShangmingCai · 2026-01-04T07:12:14Z

/tag-and-rerun-ci

maning00 · 2026-01-05T06:33:39Z

This PR seems very complete. Would it be convenient to add a test to AMD's CI to verify the correctness?

@ShangmingCai I added a per-commit-amd correctness test for the MORI integration (protocol/serialization/message-handling) using a lightweight fake mori module, since the current AMD CI image (rocm/sgl-dev) doesn’t include mori. Would you prefer we (1) keep this as-is, or (2) additionally add a real MORI E2E test after mori is available in the AMD CI image?

@maning00 I prefer 2, we can add the E2E test later. Do you mind moving this test to the manual dir? Most of the disaggregation tests are in the test/srt dir, adding a new disaggregation dir would bring some confusion. And also, this simulation test can not guarantee the accuracy of the mori backend (no need to waste the CI for it), so we should put it in the manual dir, and add a real E2E test when it is ready.

@ShangmingCai I have moved the tests to test/manual/test_mori_transfer_engine_e2e.py as suggested.
It includes:
TestMoriTransferEngineE2E: A manual E2E smoke test.
TestMoriTransferEngineTPMismatchE2E: A manual test for the TP mismatch scenario (requires >= 6 GPUs).
Both require SGLANG_MORI_MANUAL_E2E=1 to run, ensuring they don't affect standard CI.

maning00 · 2026-01-05T06:39:00Z

Please incorporate mori.Dockerfile into existing rocm.Dockerfile.

Sure. mori.Dockerfile mainly adds AMD AINIC-related dependencies plus MORI build/config flags. I can incorporate it into the existing rocm.Dockerfile and make it an optional feature (e.g., gated by a build arg), so the default image stays unchanged.

Can think to add NIC options to support AMD AINIC, Broadcom Thor2, etc.

@HaiShaw Done. I have incorporated the MORI-related setup into docker/rocm.Dockerfile (and removed mori.Dockerfile).

To ensure the default image remains unaffected, I added two build arguments:

ENABLE_MORI: Defaults to 0 (disabled). Set to 1 to enable MORI build.
NIC_BACKEND: Defaults to none. Supports ainic (for AMD AINIC) or bnxt (for Broadcom) to install specific NIC dependencies.
Example usage: docker build ... --build-arg ENABLE_MORI=1 --build-arg NIC_BACKEND=ainic ...

maning00 · 2026-01-05T06:39:52Z

@maning00 have a perf table as above with fp8 kv cache?

@HaiShaw Not yet — I haven’t run the FP8 KV cache perf sweep. I can benchmark it and add a perf table in a follow-up update to this PR (same format as the existing table), once the runs finish.

@HaiShaw Added FP8 KV cache perf results

ShangmingCai · 2026-01-06T06:32:53Z

/rerun-failed-ci

HaiShaw · 2026-01-16T15:11:21Z

This PR seems very complete. Would it be convenient to add a test to AMD's CI to verify the correctness?

@ShangmingCai I added a per-commit-amd correctness test for the MORI integration (protocol/serialization/message-handling) using a lightweight fake mori module, since the current AMD CI image (rocm/sgl-dev) doesn’t include mori. Would you prefer we (1) keep this as-is, or (2) additionally add a real MORI E2E test after mori is available in the AMD CI image?

@maning00 I prefer 2, we can add the E2E test later. Do you mind moving this test to the manual dir? Most of the disaggregation tests are in the test/srt dir, adding a new disaggregation dir would bring some confusion. And also, this simulation test can not guarantee the accuracy of the mori backend (no need to waste the CI for it), so we should put it in the manual dir, and add a real E2E test when it is ready.

@Lzy17 @yctseng0211 @bingxche Please help to enable MoRO-IO PD/D CI tests.

HaiShaw

Is --page_size > 1 supported?
Please also add description for tuning following variables:

SGLANG_MORI_QP_PER_TRANSFER: Number of queue pairs per transfer (default: 1)
SGLANG_MORI_POST_BATCH_SIZE: RDMA post batch size (default: -1)
SGLANG_MORI_NUM_WORKERS: Number of worker threads (default: 1)

docker/rocm.Dockerfile

HaiShaw · 2026-01-16T15:27:01Z

docker/rocm.Dockerfile

+      export USE_IONIC="OFF"; \
+      export USE_BNXT="ON"; \
+      echo "[MORI] NIC_BACKEND=bnxt: USE_BNXT=ON. Add Broadcom bnxt packages/repos here when available."; \
+      ;; \


When we can have bnxt support added here?
cc @Lzy17

Update: To ensure full functionality for mori (io and ep), BRCM support will be integrated later once IBGDA support is fully available in the official library.

Do you mean mori yet to support BRCM rdma-core?

No, that is not the case. This is mainly because mori-ep and mori-io are built simultaneously. mori-ep currently depends on a pre-release version of the BRCM library (for IBGDA).

python/sglang/srt/disaggregation/utils.py

HaiShaw · 2026-01-16T16:09:27Z

docker/rocm.Dockerfile

+  git clone "${MORI_REPO}" /sgl-workspace/mori; \
+  cd /sgl-workspace/mori; \
+  git checkout "${MORI_COMMIT}"; \
+  git submodule update --init --recursive; \


Need requirements.txt check?

No need to check here; the only dependency is torch.

@kkHuang-amd let's keep an eye on this onwards

HaiShaw · 2026-01-17T06:55:14Z

@maning00 Please add a basic accuracy test (gsm8k, etc.) on DPSK from 1P1D.
cc @kkHuang-amd @Lzy17

maning00 · 2026-01-19T06:12:01Z

Is --page_size > 1 supported? Please also add description for tuning following variables:

SGLANG_MORI_QP_PER_TRANSFER: Number of queue pairs per transfer (default: 1)
SGLANG_MORI_POST_BATCH_SIZE: RDMA post batch size (default: -1)
SGLANG_MORI_NUM_WORKERS: Number of worker threads (default: 1)

Sure, it is supported.
Added description for these variables

maning00 · 2026-01-19T06:13:32Z

@maning00 Please add a basic accuracy test (gsm8k, etc.) on DPSK from 1P1D. cc @kkHuang-amd @Lzy17

Added gsm8k test results

HaiShaw

Should add BXNT, etc. support later - w.r.t. NIC_BACKEND

HaiShaw · 2026-01-23T08:17:35Z

docker/rocm.Dockerfile

+      rm -rf /var/lib/apt/lists/*; \
+      ;; \
+    *) \
+      echo "ERROR: unknown NIC_BACKEND=${NIC_BACKEND}. Use one of: none, ainic, bnxt"; \


where bnxt is handled, or should we change this echo message?

Co-authored-by: cwortman-amd <cwortman@amd.com>

maning00 and others added 11 commits December 23, 2025 17:32

implement MoriKVManager

d20e07b

add batch api

aa9ce47

format code

b3012f5

remove unused

9394a0f

use user specified nic

e716082

fix engine_key collision

fe53866

Disable unused notification

aa0f1fb

Use TCP to send AUX (#2)

1b2ec21

* add disable notif * send aux with tcp * remove unused log --------- Co-authored-by: cwortman-amd <cwortman@amd.com>

add mix tp support

cb7fd01

remove unused cond states && improve offset calc

8f29dd5

optim _issue_layer_transfers to utilize buffer merge

05bd4c5

maning00 force-pushed the v0.5.6.mori-io branch from b819d55 to 05bd4c5 Compare December 23, 2025 09:32

add mori dockerfile

99d86fe

maning00 marked this pull request as ready for review December 23, 2025 10:24

maning00 requested review from ByronHsu, Fridge003, HaiShaw, ShangmingCai, hnyls2002, ishandhanani and ispobock as code owners December 23, 2025 10:24

ShangmingCai reviewed Dec 23, 2025

View reviewed changes

remove use of get_free_port

d660c2d

maning00 force-pushed the v0.5.6.mori-io branch from aee7c7c to d660c2d Compare December 24, 2025 06:06

github-actions bot added the run-ci label Dec 24, 2025

test: add MORI correctness test

ce368a1

maning00 force-pushed the v0.5.6.mori-io branch from 4c23f30 to ce368a1 Compare December 26, 2025 05:24

github-actions bot added the amd label Jan 4, 2026

maning00 force-pushed the v0.5.6.mori-io branch from 91017ce to 4a1a400 Compare January 4, 2026 06:24

update rocm.Dockerfile

cef2141

maning00 force-pushed the v0.5.6.mori-io branch from 4a1a400 to cef2141 Compare January 4, 2026 06:55

maning00 added 2 commits January 8, 2026 15:39

Merge remote-tracking branch 'upstream/main' into v0.5.6.mori-io

4670de3

Merge branch 'main' into v0.5.6.mori-io

abbab1f

kkHuang-amd mentioned this pull request Jan 13, 2026

Integration mori backend for EP a2a data communication #17012

Merged

5 tasks

HaiShaw requested changes Jan 16, 2026

View reviewed changes

update

c0d7344

HaiShaw requested changes Jan 23, 2026

View reviewed changes

add comments

2ef8c4c

HaiShaw approved these changes Jan 23, 2026

View reviewed changes

HaiShaw and others added 2 commits January 23, 2026 01:14

Merge branch 'main' into v0.5.6.mori-io

7793219

Merge branch 'main' into v0.5.6.mori-io

fda7110

HaiShaw merged commit cbf90d7 into sgl-project:main Jan 29, 2026
173 of 188 checks passed

charlesHsuGG pushed a commit to charlesHsuGG/sglang that referenced this pull request Jan 30, 2026

[PD] Support KV transfer with MORI-IO (sgl-project#14626)

1574201

Co-authored-by: cwortman-amd <cwortman@amd.com>

Chen-0210 pushed a commit to Chen-0210/sglang that referenced this pull request Jan 30, 2026

[PD] Support KV transfer with MORI-IO (sgl-project#14626)

71584bf

Co-authored-by: cwortman-amd <cwortman@amd.com>

sfiisf pushed a commit to sfiisf/sglang that referenced this pull request Feb 5, 2026

[PD] Support KV transfer with MORI-IO (sgl-project#14626)

cf00e60

Co-authored-by: cwortman-amd <cwortman@amd.com>

Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026

[PD] Support KV transfer with MORI-IO (sgl-project#14626)

15b9794

Co-authored-by: cwortman-amd <cwortman@amd.com>

Conversation

maning00 commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Architecture Overview

1. MoriKVManager - Core Transfer Management

2. MoriKVSender (Prefill Side)

3. MoriKVReceiver (Decode Side)

4. Dockerfile

Usage

Known Limitations

Benchmarking and Profiling

End-to-End PD Disaggregation

Uh oh!

gemini-code-assist bot commented Dec 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ShangmingCai left a comment

Choose a reason for hiding this comment

Uh oh!

ShangmingCai commented Dec 24, 2025

Uh oh!

maning00 commented Dec 25, 2025

Uh oh!

ShangmingCai commented Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ShangmingCai commented Jan 4, 2026

Uh oh!

maning00 commented Jan 5, 2026

Uh oh!

maning00 commented Jan 5, 2026

Uh oh!

maning00 commented Jan 5, 2026

Uh oh!

ShangmingCai commented Jan 6, 2026

Uh oh!

HaiShaw commented Jan 16, 2026

Uh oh!

HaiShaw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maning00 Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maning00 Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HaiShaw commented Jan 17, 2026

Uh oh!

maning00 commented Jan 19, 2026

Uh oh!

maning00 commented Jan 19, 2026

Uh oh!

HaiShaw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

maning00 commented Dec 8, 2025 •

edited

Loading

ShangmingCai commented Dec 25, 2025 •

edited

Loading

maning00 Jan 19, 2026 •

edited

Loading

maning00 Jan 23, 2026 •

edited

Loading