[6/n]decouple quantization implementation from vLLM dependency by Hongbosherlock · Pull Request #10750 · sgl-project/sglang

Hongbosherlock · 2025-09-22T14:22:17Z

Motivation

Remove vLLM-dependency-test

Remove the compressed_tensors dependency from vLLM

Now supported quant method:

w8a8-fp8
w8a16fp8
wNa16

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

gemini-code-assist · 2025-09-22T14:22:38Z

Summary of Changes

Hello @Hongbosherlock, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the quantization layer by removing its direct dependency on vLLM. It achieves this by internalizing several quantization schemes and utility functions, such as CompressedTensorsWNA16 and CompressedTensorsW8A8Int8, which were previously imported from vLLM. This change allows sglang to manage its quantization methods independently, expanding the supported methods to include w8a8-fp8, w8a8-int8, wNa16, and w8a16fp8, and streamlining the codebase for future development.

Highlights

Decoupling from vLLM: The core quantization implementation has been decoupled from direct vLLM dependencies, internalizing previously external components.
New Quantization Schemes: Introduced and internalized CompressedTensorsWNA16 and CompressedTensorsW8A8Int8 schemes, previously relying on vLLM.
Supported Quantization Methods: The project now explicitly supports w8a8-fp8, w8a8-int8, wNa16, and w8a16fp8 quantization methods.
Marlin Utilities Internalized: Marlin quantization utility functions and related configurations have been moved into sglang's internal structure, removing external imports.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request decouples the quantization implementation from the vLLM dependency by vendoring or re-implementing the necessary quantization schemes and utilities within the sglang repository. This is a positive step towards making the project more self-contained. However, the changes introduce a significant amount of commented-out code that should be removed for better maintainability. More critically, one of the new files (compressed_tensors_w8a8_int8.py) appears to be broken due to references to variables defined in commented-out code, which will cause runtime errors. I've left detailed comments on these issues.

...on/sglang/srt/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a8_int8.py

python/sglang/srt/layers/quantization/compressed_tensors/compressed_tensors_moe.py

python/sglang/srt/layers/quantization/compressed_tensors/compressed_tensors.py

python/sglang/srt/layers/quantization/compressed_tensors/schemes/__init__.py

...on/sglang/srt/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a16_fp8.py

python/sglang/srt/layers/quantization/compressed_tensors/compressed_tensors.py

python/sglang/srt/layers/quantization/compressed_tensors/schemes/compressed_tensors_wNa16.py

Hongbosherlock · 2025-09-23T08:15:47Z

@AniZpZ this pr is ready for review

test/srt/run_suite.py

FlamingoPg · 2025-10-15T09:35:18Z

Looks good, lets move on

ShangmingCai · 2025-10-21T09:27:36Z

Sorry to cancel the CI, unit-test-backend-1-gpu (2) won't succeed since it is flaky, and I am fixing and need to verify it now. Will merge main and retrigger-ci for this PR when I fix it.

FlamingoPg · 2025-10-21T09:49:33Z

Sorry to cancel the CI, unit-test-backend-1-gpu (2) won't succeed since it is flaky, and I am fixing and need to verify it now. Will merge main and retrigger-ci for this PR when I fix it.

Thanks!

add wNa16

98a1e1a

Hongbosherlock requested review from BBuf, Edwardf0t1, HaiShaw, Ying1123, ch-wan, ispobock, kushanam, merrymercy and zhyncs as code owners September 22, 2025 14:22

Hongbosherlock marked this pull request as draft September 22, 2025 14:22

gemini-code-assist bot reviewed Sep 22, 2025

View reviewed changes

format

dadb73d

AniZpZ self-assigned this Sep 23, 2025

fix wNa16 moe

f7f0a74

Hongbosherlock marked this pull request as ready for review September 23, 2025 08:07

Hongbosherlock added 3 commits September 23, 2025 16:09

update

5556f96

fix fp8

8925a21

Merge branch 'main' into compress_tensor

cf68b55

AniZpZ added the run-ci label Sep 23, 2025

AniZpZ and others added 2 commits September 23, 2025 16:28

rm useless ci

0eb8929

rm vllm test

4d51a4c

Hongbosherlock changed the title ~~[5/n]decouple quantization implementation from vLLM dependency~~ [6/n]decouple quantization implementation from vLLM dependency Sep 23, 2025

ch-wan reviewed Sep 23, 2025

View reviewed changes

test/srt/run_suite.py Show resolved Hide resolved

AniZpZ mentioned this pull request Sep 24, 2025

[Roadmap] Quantization Support #8180

Open

15 tasks

Hongbosherlock and others added 2 commits September 24, 2025 10:53

add CI for int4 quant

a90c613

Merge remote-tracking branch 'origin/main' into compress_tensor

abc9b79

AniZpZ approved these changes Sep 24, 2025

View reviewed changes

AniZpZ and others added 2 commits October 5, 2025 10:45

Merge branch 'main' into compress_tensor

464cd9d

Merge branch 'main' into compress_tensor

bea6887

ch-wan added the ready-to-merge The PR is ready to merge after the CI is green. label Oct 8, 2025

FlamingoPg and others added 3 commits October 13, 2025 12:35

Merge branch 'main' into compress_tensor

31bb5f6

Merge branch 'main' into compress_tensor

74192bf

Merge branch 'main' into compress_tensor

886ba8a

FlamingoPg approved these changes Oct 15, 2025

View reviewed changes

FlamingoPg assigned Hongbosherlock and FlamingoPg Oct 15, 2025

zhyncs and others added 5 commits October 16, 2025 13:53

Merge branch 'main' into compress_tensor

9228222

Merge branch 'main' into compress_tensor

0c2d83b

fixs w8a8int8

a3209ba

revert CI

0a77143

fixs

291cb30

FlamingoPg self-requested a review October 20, 2025 07:57

FlamingoPg approved these changes Oct 20, 2025

View reviewed changes

Hongbosherlock added 4 commits October 20, 2025 16:19

add is_cuda

3f6b5bd

update

db0dad8

resolve conflicts

38e94c1

upd

d436d47

Hongbosherlock mentioned this pull request Oct 21, 2025

[quantization] Fix NameError: name 'WNA16_SUPPORTED_BITS' is not defined #11552

Closed

4 tasks

ShangmingCai and others added 4 commits October 21, 2025 18:52

Merge branch 'main' into compress_tensor

7be2c70

Merge branch 'main' into compress_tensor

0fef556

resolve conflicts

9d5aa4f

Merge branch 'main' into compress_tensor

a3c23d4

FlamingoPg merged commit d7e834d into sgl-project:main Oct 22, 2025
107 of 113 checks passed

hjlee1371 mentioned this pull request Nov 19, 2025

[Feature] 2:4 sparse marlin #13597

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[6/n]decouple quantization implementation from vLLM dependency#10750

[6/n]decouple quantization implementation from vLLM dependency#10750
FlamingoPg merged 31 commits intosgl-project:mainfrom
Hongbosherlock:compress_tensor

Hongbosherlock commented Sep 22, 2025 •

edited by AniZpZ

Loading

Uh oh!

gemini-code-assist bot commented Sep 22, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Hongbosherlock commented Sep 23, 2025

Uh oh!

Uh oh!

FlamingoPg commented Oct 15, 2025

Uh oh!

ShangmingCai commented Oct 21, 2025

Uh oh!

FlamingoPg commented Oct 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Comments

Conversation

Hongbosherlock commented Sep 22, 2025 • edited by AniZpZ Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Sep 22, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Hongbosherlock commented Sep 23, 2025

Uh oh!

Uh oh!

FlamingoPg commented Oct 15, 2025

Uh oh!

ShangmingCai commented Oct 21, 2025

Uh oh!

FlamingoPg commented Oct 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Hongbosherlock commented Sep 22, 2025 •

edited by AniZpZ

Loading