Skip to content

[PD] Support KV transfer with MORI-IO#14626

Merged
HaiShaw merged 21 commits intosgl-project:mainfrom
maning00:v0.5.6.mori-io
Jan 29, 2026
Merged

[PD] Support KV transfer with MORI-IO#14626
HaiShaw merged 21 commits intosgl-project:mainfrom
maning00:v0.5.6.mori-io

Conversation

@maning00
Copy link
Contributor

@maning00 maning00 commented Dec 8, 2025

Motivation

MORI-IO is AMD's high-performance, point-to-point communication library that leverages GDR (GPU Direct RDMA) to achieve ultra-low latency and high bandwidth for KVCache transfer in LLM inference. To enable efficient PD (Prefill-Decode) disaggregation on AMD hardware, we adopt MORI-IO transfer engine as the transport layer for SGLang.

Modifications

Architecture Overview

The implementation follows a similar pattern to the mooncake transfer engine integration, with MORI-IO-specific optimizations:

1. MoriKVManager - Core Transfer Management

  • Initialization:

    • Creates IOEngine with RDMA backend configuration
    • Registers all GPU memory buffers with the engine
    • Spawns dedicated threads for bootstrap or status polling
  • Configuration Options (via environment variables):

    • SGLANG_MORI_QP_PER_TRANSFER: Number of queue pairs per transfer (default: 1)
    • SGLANG_MORI_POST_BATCH_SIZE: RDMA post batch size (default: -1)
    • SGLANG_MORI_NUM_WORKERS: Number of worker threads (default: 1)

2. MoriKVSender (Prefill Side)

  • Workflow:
    1. Waits for decode instance registration via bootstrap thread
    2. Receives transfer metadata (destination memory descriptors, indices) from decode
    3. Issues RDMA writes using batch_write API for KV cache transfer
    4. Sends auxiliary data via TCP (ZMQ)
    5. Monitors transfer status and notifies decode instance upon completion

3. MoriKVReceiver (Decode Side)

  • Workflow:
    1. Registers local engine descriptor and memory descriptors with prefill instance
    2. Polls for transfer completion status
    3. Receives auxiliary data via dedicated TCP handler
    4. Updates request status based on prefill notifications

4. Dockerfile

Add the NIC_BACKEND option to enable mori support for different network interface cards (NICs).

Usage

Installation: Install MORI-IO library following the MORI installation guide:

cd mori 
pip install -r requirements-build.txt 
git submodule update --init --recursive 
pip3 install .

SGLang PD Disaggregation with MORI-IO: Use --disaggregation-transfer-backend mori to enable MORI-IO transfer engine:

Known Limitations

State data transfer not implemented: Currently, MORI-IO implementation does not support state data transfer for hybrid models (Mamba, SWA, NSA).

Benchmarking and Profiling

End-to-End PD Disaggregation

Hardware Configuration:

  • GPUs: 8x AMD Instinct MI355X per node
  • CPUs: 2x AMD EPYC per node
  • Network: 8x AMD Pensando Pollara 400 AI-NIC per node
  • Model: DeepSeek-V3 with TP=8
  • Setup: 3-node configuration (1 prefill + 1 decode + 1 router)

Benchmark Command:

Prefill instance (node 1):

python -m sglang.launch_server \
  --model-path DeepSeek-V3 \
  --disaggregation-mode prefill \
  --host 0.0.0.0 --port 30002 \
  --tp-size 8 \
  --disaggregation-transfer-backend mori \
  --disable-radix-cache \
  --trust-remote-code

Decode instance (node 2):

python -m sglang.launch_server \
  --model-path DeepSeek-V3 \
  --disaggregation-mode decode \
  --host 0.0.0.0 --port 30003 \
  --tp-size 8 \
  --disaggregation-transfer-backend mori \
  --disable-radix-cache \
  --trust-remote-code

Router (node 3):

python -m sglang_router.launch_router \
  --pd-disaggregation --mini-lb \
  --prefill http://node1:30002 \
  --decode http://node2:30003 \
  --host 0.0.0.0 --port 8000

Benchmark client:

python3 -m sglang.bench_serving \
  --backend sglang \
  --base-url http://127.0.0.1:8000 \
  --dataset-name random \
  --num-prompts 1 \
  --random-input <1024/2048/4096/8192> \
  --random-output 16

Accuracy test:

python3 -m sglang.test.few_shot_gsm8k \
    --host http://127.0.0.1 \
    --port 8000 \
    --num-questions 200 \
    --parallel 128 \
    --num-shots 5

Performance Results:

Comparison of STANDALONE (no PD disaggregation) vs MORI vs MOONCAKE backends. Each test was run 3 times and averaged.

Input Tokens Metric STANDALONE MORI MOONCAKE
1024 Throughput (tok/s) 21.91 21.22 21.24
TTFT Mean (ms) 155.35 168.13 165.71
2048 Throughput (tok/s) 21.49 20.63 20.78
TTFT Mean (ms) 162.24 176.57 173.54
4096 Throughput (tok/s) 21.51 20.68 20.74
TTFT Mean (ms) 159.93 176.05 174.75
8192 Throughput (tok/s) 16.93 16.02 16.10
TTFT Mean (ms) 294.41 326.19 322.70

FP8 KV Cache Results(--kv-cache-dtype fp8_e4m3):

Input Tokens Metric STANDALONE MORI MOONCAKE
1024 Throughput (tok/s) 56.47 51.79 51.37
TTFT Mean (ms) 101.15 103.98 104.54
2048 Throughput (tok/s) 54.43 49.47 49.69
TTFT Mean (ms) 104.72 109.89 108.40
4096 Throughput (tok/s) 54.47 49.72 49.16
TTFT Mean (ms) 103.23 108.74 109.21
8192 Throughput (tok/s) 36.16 31.63 32.11
TTFT Mean (ms) 208.80 229.15 223.93

gsm8k Accuracy Test Results:

Metric STANDALONE MORI MOONCAKE
Accuracy 0.965 0.965 0.970
Invalid 0.000 0.000 0.000

Both MORI and MOONCAKE leverage RDMA effectively, with near-identical performance profiles, validating MORI-IO as a production-ready alternative for AMD hardware.

cc @inkcherry @TianDi101 @Duyi-Wang

@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@maning00 maning00 marked this pull request as ready for review December 23, 2025 10:24
if self.kv_args.ib_device:
os.environ["MORI_RDMA_DEVICES"] = self.kv_args.ib_device

port = get_free_port()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This util is not robust, which might cause port conflict in some situations. Is it possible to borrow some idea from get_zmq_socket_on_host?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Fixed by:

  1. Using port=0 to let the OS atomically bind an available port
  2. Retrieve the actual bound port from Mori's TCP stack

This eliminates the race condition entirely. Thanks for catching this!

Copy link
Collaborator

@ShangmingCai ShangmingCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR seems very complete. Would it be convenient to add a test to AMD's CI to verify the correctness?

@ShangmingCai
Copy link
Collaborator

/tag-and-rerun-ci

@maning00
Copy link
Contributor Author

This PR seems very complete. Would it be convenient to add a test to AMD's CI to verify the correctness?

@ShangmingCai I added a per-commit-amd correctness test for the MORI integration (protocol/serialization/message-handling) using a lightweight fake mori module, since the current AMD CI image (rocm/sgl-dev) doesn’t include mori.
Would you prefer we (1) keep this as-is, or (2) additionally add a real MORI E2E test after mori is available in the AMD CI image?

@ShangmingCai
Copy link
Collaborator

ShangmingCai commented Dec 25, 2025

This PR seems very complete. Would it be convenient to add a test to AMD's CI to verify the correctness?

@ShangmingCai I added a per-commit-amd correctness test for the MORI integration (protocol/serialization/message-handling) using a lightweight fake mori module, since the current AMD CI image (rocm/sgl-dev) doesn’t include mori. Would you prefer we (1) keep this as-is, or (2) additionally add a real MORI E2E test after mori is available in the AMD CI image?

@maning00 I prefer 2, we can add the E2E test later. Do you mind moving this test to the manual dir? Most of the disaggregation tests are in the test/srt dir, adding a new disaggregation dir would bring some confusion. And also, this simulation test can not guarantee the accuracy of the mori backend (no need to waste the CI for it), so we should put it in the manual dir, and add a real E2E test when it is ready.

@ShangmingCai
Copy link
Collaborator

/tag-and-rerun-ci

@maning00
Copy link
Contributor Author

maning00 commented Jan 5, 2026

This PR seems very complete. Would it be convenient to add a test to AMD's CI to verify the correctness?

@ShangmingCai I added a per-commit-amd correctness test for the MORI integration (protocol/serialization/message-handling) using a lightweight fake mori module, since the current AMD CI image (rocm/sgl-dev) doesn’t include mori. Would you prefer we (1) keep this as-is, or (2) additionally add a real MORI E2E test after mori is available in the AMD CI image?

@maning00 I prefer 2, we can add the E2E test later. Do you mind moving this test to the manual dir? Most of the disaggregation tests are in the test/srt dir, adding a new disaggregation dir would bring some confusion. And also, this simulation test can not guarantee the accuracy of the mori backend (no need to waste the CI for it), so we should put it in the manual dir, and add a real E2E test when it is ready.

@ShangmingCai I have moved the tests to test/manual/test_mori_transfer_engine_e2e.py as suggested.
It includes:
TestMoriTransferEngineE2E: A manual E2E smoke test.
TestMoriTransferEngineTPMismatchE2E: A manual test for the TP mismatch scenario (requires >= 6 GPUs).
Both require SGLANG_MORI_MANUAL_E2E=1 to run, ensuring they don't affect standard CI.

@maning00
Copy link
Contributor Author

maning00 commented Jan 5, 2026

Please incorporate mori.Dockerfile into existing rocm.Dockerfile.

Sure. mori.Dockerfile mainly adds AMD AINIC-related dependencies plus MORI build/config flags. I can incorporate it into the existing rocm.Dockerfile and make it an optional feature (e.g., gated by a build arg), so the default image stays unchanged.

Can think to add NIC options to support AMD AINIC, Broadcom Thor2, etc.

@HaiShaw Done. I have incorporated the MORI-related setup into docker/rocm.Dockerfile (and removed mori.Dockerfile).

To ensure the default image remains unaffected, I added two build arguments:

ENABLE_MORI: Defaults to 0 (disabled). Set to 1 to enable MORI build.
NIC_BACKEND: Defaults to none. Supports ainic (for AMD AINIC) or bnxt (for Broadcom) to install specific NIC dependencies.
Example usage: docker build ... --build-arg ENABLE_MORI=1 --build-arg NIC_BACKEND=ainic ...

@maning00
Copy link
Contributor Author

maning00 commented Jan 5, 2026

@maning00 have a perf table as above with fp8 kv cache?

@HaiShaw Not yet — I haven’t run the FP8 KV cache perf sweep. I can benchmark it and add a perf table in a follow-up update to this PR (same format as the existing table), once the runs finish.

@HaiShaw Added FP8 KV cache perf results

@ShangmingCai
Copy link
Collaborator

/rerun-failed-ci

@HaiShaw
Copy link
Collaborator

HaiShaw commented Jan 16, 2026

This PR seems very complete. Would it be convenient to add a test to AMD's CI to verify the correctness?

@ShangmingCai I added a per-commit-amd correctness test for the MORI integration (protocol/serialization/message-handling) using a lightweight fake mori module, since the current AMD CI image (rocm/sgl-dev) doesn’t include mori. Would you prefer we (1) keep this as-is, or (2) additionally add a real MORI E2E test after mori is available in the AMD CI image?

@maning00 I prefer 2, we can add the E2E test later. Do you mind moving this test to the manual dir? Most of the disaggregation tests are in the test/srt dir, adding a new disaggregation dir would bring some confusion. And also, this simulation test can not guarantee the accuracy of the mori backend (no need to waste the CI for it), so we should put it in the manual dir, and add a real E2E test when it is ready.

@Lzy17 @yctseng0211 @bingxche Please help to enable MoRO-IO PD/D CI tests.

Copy link
Collaborator

@HaiShaw HaiShaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is --page_size > 1 supported?
Please also add description for tuning following variables:

SGLANG_MORI_QP_PER_TRANSFER: Number of queue pairs per transfer (default: 1)
SGLANG_MORI_POST_BATCH_SIZE: RDMA post batch size (default: -1)
SGLANG_MORI_NUM_WORKERS: Number of worker threads (default: 1)

export USE_IONIC="OFF"; \
export USE_BNXT="ON"; \
echo "[MORI] NIC_BACKEND=bnxt: USE_BNXT=ON. Add Broadcom bnxt packages/repos here when available."; \
;; \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we can have bnxt support added here?
cc @Lzy17

Copy link
Contributor Author

@maning00 maning00 Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: To ensure full functionality for mori (io and ep), BRCM support will be integrated later once IBGDA support is fully available in the official library.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean mori yet to support BRCM rdma-core?

Copy link
Contributor Author

@maning00 maning00 Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that is not the case. This is mainly because mori-ep and mori-io are built simultaneously. mori-ep currently depends on a pre-release version of the BRCM library (for IBGDA).

git clone "${MORI_REPO}" /sgl-workspace/mori; \
cd /sgl-workspace/mori; \
git checkout "${MORI_COMMIT}"; \
git submodule update --init --recursive; \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need requirements.txt check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to check here; the only dependency is torch.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kkHuang-amd let's keep an eye on this onwards

@HaiShaw
Copy link
Collaborator

HaiShaw commented Jan 17, 2026

@maning00 Please add a basic accuracy test (gsm8k, etc.) on DPSK from 1P1D.
cc @kkHuang-amd @Lzy17

@maning00
Copy link
Contributor Author

Is --page_size > 1 supported? Please also add description for tuning following variables:

SGLANG_MORI_QP_PER_TRANSFER: Number of queue pairs per transfer (default: 1)
SGLANG_MORI_POST_BATCH_SIZE: RDMA post batch size (default: -1)
SGLANG_MORI_NUM_WORKERS: Number of worker threads (default: 1)

Sure, it is supported.
Added description for these variables

@maning00
Copy link
Contributor Author

@maning00 Please add a basic accuracy test (gsm8k, etc.) on DPSK from 1P1D. cc @kkHuang-amd @Lzy17

Added gsm8k test results

Copy link
Collaborator

@HaiShaw HaiShaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add BXNT, etc. support later - w.r.t. NIC_BACKEND

rm -rf /var/lib/apt/lists/*; \
;; \
*) \
echo "ERROR: unknown NIC_BACKEND=${NIC_BACKEND}. Use one of: none, ainic, bnxt"; \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where bnxt is handled, or should we change this echo message?

@HaiShaw HaiShaw merged commit cbf90d7 into sgl-project:main Jan 29, 2026
173 of 188 checks passed
charlesHsuGG pushed a commit to charlesHsuGG/sglang that referenced this pull request Jan 30, 2026
Co-authored-by: cwortman-amd <cwortman@amd.com>
Chen-0210 pushed a commit to Chen-0210/sglang that referenced this pull request Jan 30, 2026
Co-authored-by: cwortman-amd <cwortman@amd.com>
sfiisf pushed a commit to sfiisf/sglang that referenced this pull request Feb 5, 2026
Co-authored-by: cwortman-amd <cwortman@amd.com>
Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026
Co-authored-by: cwortman-amd <cwortman@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants