Skip to content

Comments

[6/n]decouple quantization implementation from vLLM dependency#10750

Merged
FlamingoPg merged 31 commits intosgl-project:mainfrom
Hongbosherlock:compress_tensor
Oct 22, 2025
Merged

[6/n]decouple quantization implementation from vLLM dependency#10750
FlamingoPg merged 31 commits intosgl-project:mainfrom
Hongbosherlock:compress_tensor

Conversation

@Hongbosherlock
Copy link
Contributor

@Hongbosherlock Hongbosherlock commented Sep 22, 2025

Motivation

Remove vLLM-dependency-test

Remove the compressed_tensors dependency from vLLM

Now supported quant method:

  • w8a8-fp8
  • w8a16fp8
  • wNa16

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Hongbosherlock, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the quantization layer by removing its direct dependency on vLLM. It achieves this by internalizing several quantization schemes and utility functions, such as CompressedTensorsWNA16 and CompressedTensorsW8A8Int8, which were previously imported from vLLM. This change allows sglang to manage its quantization methods independently, expanding the supported methods to include w8a8-fp8, w8a8-int8, wNa16, and w8a16fp8, and streamlining the codebase for future development.

Highlights

  • Decoupling from vLLM: The core quantization implementation has been decoupled from direct vLLM dependencies, internalizing previously external components.
  • New Quantization Schemes: Introduced and internalized CompressedTensorsWNA16 and CompressedTensorsW8A8Int8 schemes, previously relying on vLLM.
  • Supported Quantization Methods: The project now explicitly supports w8a8-fp8, w8a8-int8, wNa16, and w8a16fp8 quantization methods.
  • Marlin Utilities Internalized: Marlin quantization utility functions and related configurations have been moved into sglang's internal structure, removing external imports.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request decouples the quantization implementation from the vLLM dependency by vendoring or re-implementing the necessary quantization schemes and utilities within the sglang repository. This is a positive step towards making the project more self-contained. However, the changes introduce a significant amount of commented-out code that should be removed for better maintainability. More critically, one of the new files (compressed_tensors_w8a8_int8.py) appears to be broken due to references to variables defined in commented-out code, which will cause runtime errors. I've left detailed comments on these issues.

@AniZpZ AniZpZ self-assigned this Sep 23, 2025
@Hongbosherlock Hongbosherlock marked this pull request as ready for review September 23, 2025 08:07
@Hongbosherlock
Copy link
Contributor Author

@AniZpZ this pr is ready for review

@AniZpZ AniZpZ added the run-ci label Sep 23, 2025
@Hongbosherlock Hongbosherlock changed the title [5/n]decouple quantization implementation from vLLM dependency [6/n]decouple quantization implementation from vLLM dependency Sep 23, 2025
@AniZpZ AniZpZ mentioned this pull request Sep 24, 2025
15 tasks
@ch-wan ch-wan added the ready-to-merge The PR is ready to merge after the CI is green. label Oct 8, 2025
@FlamingoPg
Copy link
Collaborator

Looks good, lets move on

@FlamingoPg FlamingoPg self-requested a review October 20, 2025 07:57
@ShangmingCai
Copy link
Collaborator

Sorry to cancel the CI, unit-test-backend-1-gpu (2) won't succeed since it is flaky, and I am fixing and need to verify it now. Will merge main and retrigger-ci for this PR when I fix it.

@FlamingoPg
Copy link
Collaborator

Sorry to cancel the CI, unit-test-backend-1-gpu (2) won't succeed since it is flaky, and I am fixing and need to verify it now. Will merge main and retrigger-ci for this PR when I fix it.

Thanks!

@FlamingoPg FlamingoPg merged commit d7e834d into sgl-project:main Oct 22, 2025
107 of 113 checks passed
@hjlee1371 hjlee1371 mentioned this pull request Nov 19, 2025
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-to-merge The PR is ready to merge after the CI is green. run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants