* fix: update pr-test-sgl-kernel (sgl-project#5399)
* kernel: support slightly faster merge_state_v2 cuda kernel (sgl-project#5381)
* chore: bump sgl-kernel 0.0.9 (sgl-project#5400)
* chore: upgrade sgl-kernel 0.0.9 (sgl-project#5401)
* Tiny fix DeepseekScalingRotaryEmbedding always use forward_native (sgl-project#5406)
* Fix bench_serving with random-ids (sgl-project#5214)
* [misc] fix ci flaky case (sgl-project#5352)
* [FIX] Fix concatenation error in capture_bs when open --disable-cuda-graph-padding and without MTP (sgl-project#5412)
* Support dynamic connection and TP 16 (sgl-project#5351)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
* Fix broadcast use cuda device lead to memory capacity unbalanced (sgl-project#5416)
* [PD] Fix dynamic port support and MLA buffer for Mooncake (sgl-project#5415)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Co-authored-by: ybyang <ybyang7@iflytek.com>
* Distinguish bootstrap key only in decode server (sgl-project#5422)
* [PD] Remove unused bootstrap param and fix port table type (sgl-project#5423)
* [minor] cleanup cmakelists.txt (sgl-project#5420)
* bugfix: fix merge_state_v2 cuda graph (sgl-project#5419)
* chore: bump sgl-kernel v0.0.9.post1 (sgl-project#5430)
* fix: solve release issue (sgl-project#5434)
* BLackwell cutlass mla: Add check for bad page size/block num combinations (sgl-project#5431)
* feat: update model_specific_adjustment (sgl-project#5344)
Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>
* chore: upgrade sgl-kernel 0.0.9.post1 (sgl-project#5436)
* Fix ignore_eos parameter when loading a chat template (sgl-project#5264)
* add attention backend supporting matrix in the doc (sgl-project#5211)
Co-authored-by: Stefan He <hebiaobuaa@gmail.com>
* Support BNB quantization for llama/mllama (sgl-project#5038)
Co-authored-by: Yuhao Yang <yyh073@foxmail.com>
* [Docs] Update start/install.md (sgl-project#5398)
* [Minor] Move torch.compile patch to a better place (sgl-project#5397)
* [Bug fix] need record start time in pd mode (sgl-project#5425)
* Support MHA with chunked prefix cache for DeepSeek chunked prefill (sgl-project#5113)
* chore: bump v0.4.5.post1 (sgl-project#5445)
* Fix several minor issues in PD disaggregation (sgl-project#5444)
* [doc] Update benchmark_and_profiling.md (sgl-project#5449)
* Update cutlass dependency. (sgl-project#5447)
* add multi-lora feature in README.md (sgl-project#5463)
* Clean up imports (sgl-project#5467)
* [verl] Modify the update_weights func to align with verl's resharding (sgl-project#5345)
Co-authored-by: Chayenne <zhaochen20@outlook.com>
* [Model Support] unsloth/Phi-4-mini bnb model (sgl-project#4982)
Co-authored-by: yhyang201 <yhyang201@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
* Update attention_backend.md: plural form (sgl-project#5489)
* Add test for flash_attn_varlen_func kernel (sgl-project#5484)
* Deprecate disable-mla (sgl-project#5481)
* Deprecate enable-flashinfer-mla and enable-flashmla (sgl-project#5480)
* Feat/support encoder model (like bert) (sgl-project#4887)
* Enable local attention during decode (sgl-project#5479)
* Refactor DeepSeek decoder layer branches (sgl-project#5205)
* Fix a link in sgl-kernel/README.md (sgl-project#5493)
* [Bug fix] use correct func path in deepseek (sgl-project#5496)
Signed-off-by: Xuchun Shang <xuchun.shang@linux.alibaba.com>
* Doc: fix problems of the 'Execute Notebooks / run-all-notebooks' ci caused by the unstability of deepseek-ai/DeepSeek-R1-Distill-Qwen-7B (sgl-project#5503)
* [Feat] Update sgl-kernel flashinfer to latest main version (sgl-project#5500)
Co-authored-by: zhyncs <me@zhyncs.com>
* Fix: Incorrect parameters passed to forward_batch_generation (sgl-project#5506) (sgl-project#5511)
* Fix: fix the exception 'the memory capacity is unbalanced. Some GPUs … (sgl-project#5426)
Co-authored-by: ocss884 <ocss.lin@gmail.com>
* [docs] Fix several consistency issues in sampling_params.md (sgl-project#5373)
Signed-off-by: windsonsea <haifeng.yao@daocloud.io>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
* Configuration qwen2_moe.py - qkv_bias now in transformers (sgl-project#5512)
* Introduce moe_dense_tp_size to fix dense layer errors in DeepSeek V3 + 4x8xH100 (sgl-project#4836)
* Sgl kernel fused_moe_gate support n_shared_experts (sgl-project#5440)
* chore: bump sgl-kernel 0.0.9.post2 (sgl-project#5518)
* use sglang_per_token_group_quant_fp8 from sgl-kernel instead of trion kernel (sgl-project#5473)
Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com>
* fix kimi vl running bug after rebase main (sgl-project#5461)
* fix bug of VLLM_AVAILABLE not defined (sgl-project#5497)
* Avoid computing lse in Ragged Prefill when there's no prefix. (sgl-project#5476)
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
* [Model] Adding Qwen3 and Qwen3MoE (sgl-project#4693)
* fix util import (sgl-project#5542)
* Revert "Avoid computing lse in Ragged Prefill when there's no prefix.… (sgl-project#5544)
* chore: upgrade sgl-kernel 0.0.9.post2 (sgl-project#5540)
* Fix DeepGEMM masked cannot be run on groups not being multiple or 4 (sgl-project#5340)
* Make profiler output file names consistent (sgl-project#5548)
* [PD] Tiny fix timeout error when generate (sgl-project#5545)
* [PD] Fix no cache connect for recevier (sgl-project#5534)
* feat: use flashinfer jit package (sgl-project#5547)
* [PD] Remove the requirement of config file for mooncake backend (sgl-project#5460)
* restruct compressed_tensors_w8a8_fp8 (sgl-project#5475)
* simplify the control logic for using shared experts fusion (sgl-project#5504)
* Remove one kernel in per_tensor_quant_mla_fp8 (sgl-project#5549)
* Fix sampler nan check when calling top_k_top_p_sampling_from_probs (sgl-project#5546)
* [PD] Support page size > 1 (sgl-project#5561)
* fix hicache write back (sgl-project#5543)
* Minor update for ROCm variable style (sgl-project#5562)
* Fix bench_one_batch producing unnatural results for expert parallel (sgl-project#5149)
* [perf] introduce deep gemm group_gemm_masked as bmm (sgl-project#5432)
* [PD] Fix DeepSeek cannot be run on latest master (sgl-project#5568)
* Fix BumpAllocator error when no input_ids (sgl-project#5564)
* enable DeepSeek V3 shared_experts_fusion in sm90 (sgl-project#5571)
* [Fix] fix outlines and xgrammar (sgl-project#4947)
* [Doc]Add instruction for profiling with bench_one_batch (sgl-project#5581)
* Release v0.4.5.post2 (sgl-project#5582)
* Fix bench_serving fail when zero warmup requests (sgl-project#5574)
* Fix DeepEP cannot run on latest master (sgl-project#5567)
* Fix torch memory saver not enabled in DP scenario (sgl-project#5560)
* Super tiny fix typo (sgl-project#5559)
* Add document for LoRA serving (sgl-project#5521)
* Tiny improve error message (sgl-project#5526)
* [PD] Fix server crash when using batch requests (sgl-project#5531)
* [Feat] upgrade pytorch2.6 (sgl-project#5417)
* Fix enable chunked prefill for Llama4 (sgl-project#5575)
* fix: use fa3 for gemma2 (sgl-project#5586)
* Fix ChatCompletionMessageGenericParam to allow for None content (sgl-project#5452)
* [PD] Fix large page size + chunk prefill (sgl-project#5588)
* Add test config yamls for Deepseek v3 (sgl-project#5433)
* [Feature] Prefill assistant response - add continue_final_message parameter (sgl-project#4226)
Co-authored-by: Chayenne <zhaochen20@outlook.com>
* add function call parser for DeepSeek V3 (sgl-project#5224)
* smaller and non gated models for docs (sgl-project#5378)
* Feat: Implement JSON Mode (response_format.type="json_object") (sgl-project#4733)
Co-authored-by: Kyle Pena <kylepena@kyles-macbook-pro.turkey-marlin.ts.net>
* check marlin format before attempting conversion (sgl-project#4675)
* compressed_tensors: port w8a16 fp8 from vllm (sgl-project#4852)
* Fix one more issue reported by torchfix (sgl-project#4859)
* Add sanity check for max_running_requests (sgl-project#5016)
* Correct grafana heatmap. (sgl-project#5019)
* Perform Batch Tokenization. (sgl-project#5141)
* Speedup shared expert weight construction by avoid cloning (sgl-project#5188)
* Tiny add Engine.flush_cache API (sgl-project#5241)
* [misc] remove is_cuda_available (sgl-project#5319)
* Fix flush cache (sgl-project#5590)
* Add Speculative Decoding Eagle3 topk > 1 (sgl-project#5318)
Co-authored-by: Stefan He <hebiaobuaa@gmail.com>
Co-authored-by: Yubo Wang <yubowang2019@gmail.com>
* upstream hicache fixes (sgl-project#5570)
* Tiny add warning when cannot recognize bool env var (sgl-project#5348)
* Modify metrics service endpoint (sgl-project#3443)
* Update protocol.py to fix sgl-project#4589 (sgl-project#4590)
* [Feat.] Enable grafana to show metrics (sgl-project#4718)
Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
* [Fix] Enhance DP Attention for IPv6 Compatibility (sgl-project#4937)
* Support o1 model on Azure (sgl-project#4980)
Co-authored-by: Shan Yu <shanyu1@g.ucla.edu>
* Tiny remove duplicated code (sgl-project#5021)
* Tiny update error hint (sgl-project#5037)
* Support PD bootstrap fields on /v1/chat/completions endpoint (sgl-project#5488)
* [PD] Fix generate endpoint of min_lb for PD (sgl-project#5598)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
* [PD] Fix edge case and simplify large page size + chunked prefill (sgl-project#5589)
* [PD] Add NIXL transfer backend (sgl-project#5477)
* [PD] Support decode overlap schedule (sgl-project#5608)
* [PD] Support prefill overlap + Ensure no race condition (sgl-project#5609)
* Enhance GPU memory settings (sgl-project#5604)
* [feature] enable pre compile jit deep_gemm (sgl-project#5580)
* Clean up mem settings (sgl-project#5610)
* Support aiter RMSNorm in AMD (sgl-project#5510)
Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com>
* chore: bump v0.4.5.post3 (sgl-project#5611)
* Remove extra copy in deepseek forward absorb (sgl-project#5578)
Co-authored-by: saienduri <saimanas.enduri@amd.com>
* [Doc] Fix a 404 link to llama-405b (sgl-project#5615)
Signed-off-by: windsonsea <haifeng.yao@daocloud.io>
* [fix] force use deepgemm in compile_deep_gemm (sgl-project#5618)
* [fix] fix compile_deep_gemm missing kv_b_proj (sgl-project#5620)
* fix: gemma 3 not use softcap (sgl-project#5622)
* Fix FA3 DeepSeek prefill performance regression (sgl-project#5624)
Co-authored-by: ispobock <ispobaoke@gmail.com>
* [NFC] Remove duplicate `compressed-tensors` (sgl-project#5640)
* Fix shared experts fusion error without quantization (sgl-project#5632)
* [feature] Add H20 fp8_w8a8 FusedMoE config for --n-share-experts-fusion=16 (sgl-project#5641)
Co-authored-by: yuethe <yuethe@tencent.com>
* fix flashmla bug (sgl-project#5272)
* [fix] reduce dp capture bs (sgl-project#5634)
Co-authored-by: alcanerian <alcanerian@gmail.com>
* Remove q concat in FA3 backend for DeepSeek decode (sgl-project#5638)
* Revert "Support aiter RMSNorm in AMD" (sgl-project#5646)
* fix: update bench_speculative (sgl-project#5649)
* Turn on DeepGemm By Default and Update Doc (sgl-project#5628)
* Fuse q_a_proj and kv_a_proj (sgl-project#5619)
* Remove unnecessary `torch.full` in DeepSeek (sgl-project#5601)
* [1/2] Add FP8 Blockscale MoE CUTLASS kernel for Blackwell (sgl-project#5281)
* fix sgl-kernel unit tests (sgl-project#5666)
* fix awq_dequantize import (sgl-project#5669)
* Integrating PD disaggregation with DP attention and DeepEP (sgl-project#5435)
Co-authored-by: Byron Hsu <byronhsu1230@gmail.com>
* fix gemma3 unit test (sgl-project#5670)
* fix torchvision::nms not exist (sgl-project#5671)
* [PD] Add support for dp attention with mooncake (sgl-project#5530)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
* tune the threshold of gemma-2-27b-it in test_nightly_gsm8k_eval.py (sgl-project#5677)
* [Doc] Fix two 404 links caused by sglang typo (sgl-project#5667)
Signed-off-by: windsonsea <haifeng.yao@daocloud.io>
* fix: update truss bench_serving (sgl-project#5683)
* fix: only compile ApplyTokenBitmaskInplace cu124+ (sgl-project#5686)
* chore: bump sgl-kernel 0.1.0 (sgl-project#5688)
* vlm: enable radix cache for qwen-vl models (sgl-project#5349)
Co-authored-by: Xinyuan Tong <justinning0323@outlook.com>
* [BugFix] Fix combination of MTP and `--n-share-experts-fusion`with R1 (sgl-project#5707)
* Fix weight loading bug for Deepseek v3+nextn (sgl-project#5684)
* Add example to use sgl engine with fastapi (sgl-project#5648)
Co-authored-by: Ravi Theja Desetty <ravitheja@Ravis-MacBook-Pro.local>
* [Doc] Fix a link to Weilin Zhao (sgl-project#5706)
Signed-off-by: windsonsea <haifeng.yao@daocloud.io>
* Add MMMU benchmark results (sgl-project#4491)
Co-authored-by: Ravi Theja Desetty <ravitheja@Ravis-MacBook-Pro.local>
* [Model] Support `ArcticForCausalLM` architecture (Snowflake/snowflake-arctic-instruct) (sgl-project#5078)
Co-authored-by: vincent-4 <vincentzhongy+githubvincent4@gmail.com>
* [PD] Better logs (sgl-project#5715)
* [PD] Add kvargs table and thread pool for kvcache sender of mooncake (sgl-project#5738)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
* [PD]: Support Muti Prefill in one node (sgl-project#5704)
Co-authored-by: shuaills <shishuaiuoe@gmail.com>
* Fix: deepseek forward absorb (sgl-project#5723)
Co-authored-by: ispobock <ispobaoke@163.com>
* Pin torch audio to 2.6.0 (sgl-project#5750)
* Revert "[Model] Support `ArcticForCausalLM` architecture (Snowflake/snowflake-arctic-instruct)" (sgl-project#5754)
* Disable flaky eagle tests (sgl-project#5753)
* update triton 3.2.0 h200 fused moe triton config and add warning about triton fused_moe_kernel performance degradation due to different Triton versions. (sgl-project#5740)
* [Docs] Update runtime/engine/readme.md (sgl-project#5737)
Signed-off-by: windsonsea <haifeng.yao@daocloud.io>
* Reorder loop in shared expert weight loading (sgl-project#5719)
* fix: fix one more bug from merging mm_inputs (sgl-project#5718)
Co-authored-by: Xinyuan Tong <justinning0323@outlook.com>
Co-authored-by: XinyuanTong <115166877+JustinTong0323@users.noreply.github.com>
* [Fix]: support deepseek-vl2-tiny model (sgl-project#5552)
Co-authored-by: bppps <zouyu.zzx@alibaba-inc.com>
* Bugfix for minicpmo vision test (sgl-project#5760)
* [Minor] fix documentations (sgl-project#5756)
* Add an assertion to enhance the robustness of the operator (sgl-project#5736)
* fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512 (sgl-project#5733)
* Use device_id in dist init to reduce NCCL communicator warmup & creation overhead (sgl-project#5728)
* [fix] fix potential bumpy throughtput with deepgemm (sgl-project#5722)
* Resolves the `404 Not Found` error when running `compile_deep_gemm.py` in multi-node setups (sgl-project#5720)
* perf: update H20 fused_moe_triton kernel config to get higher throughput during prefilling (sgl-project#5716)
* we fix the non existent access of `decrypted_config_file` (sgl-project#5685)
* CI: rewrite test_vision_chunked_prefill to speedup (sgl-project#5682)
* Fuse MLA set kv cache kernel (sgl-project#5748)
* Update amd docker image to `sglang:v0.4.5.post3-rocm630`. (sgl-project#5697)
* [feature] support for roberta embedding models (sgl-project#5730)
* [fix] fix bench_one_batch_server (sgl-project#5607)
* support for the DeepSeek model by enabling streaming response parsing (sgl-project#5592)
* fix: Use `is not None` instead of `!= None` for None checks. (sgl-project#5687)
* Add Llama 4 to FA3 test (sgl-project#5509)
* [misc] more decode step log for batch_one_batch (sgl-project#5565)
* Handle JSONDecodeError while processing request data (sgl-project#5599)
* fix(srt): check if sample_indices is not None before usage. (sgl-project#5633)
* update llguidance to 0.7.11; adds StructTag (sgl-project#4870)
* Use sgl-kernel sgl_per_token_group_quant_int8 (sgl-project#4971)
* Add memory_saver check (sgl-project#4986)
Signed-off-by: Kebe <mail@kebe7jun.com>
* add switch to disable open api doc (sgl-project#3744)
Signed-off-by: congcongke <zhanweidu@163.com>
* Revert "fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512" (sgl-project#5772)
* Fix eagle test case (sgl-project#5776)
* Split local attention test from fa3 test (sgl-project#5774)
* Revert "Revert "fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512"" (sgl-project#5777)
* Simplify FA3 tests (sgl-project#5779)
* Revert "[fix] fix bench_one_batch_server" (sgl-project#5785)
* Revert "Use device_id in dist init to reduce NCCL communicator warmup & creation overhead" (sgl-project#5786)
* [CI] Tune threshold (sgl-project#5787)
* [CI] fix port conflicts (sgl-project#5789)
* [CI] Fix ci tests (sgl-project#5769)
* [PD]Reduce kv transfer threads (sgl-project#5791)
* [CI] Fix test case (sgl-project#5790)
* Add 8-GPU Test for Deepseek-V3 (sgl-project#5691)
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
* Release v0.4.6 (sgl-project#5795)
* Update nightly-test.yml (sgl-project#5797)
* [CI] Improve github summary & enable fa3 for more models (sgl-project#5796)
* [Docs] update grafana setup guide in production metrics (sgl-project#5643)
Co-authored-by: NoahM <88418672+zhudianGG@users.noreply.github.com>
* [Misc] add structure logging, write to file and log tracing for SGL Router
* Improve overlap scheduling (sgl-project#5788)
* Add Cutlass MLA attention backend (sgl-project#5390)
* chore: upgrade sgl-kernel 0.1.0 (sgl-project#5690)
* Dockerfile.dev pip scikit_build_core (sgl-project#5807)
* Add a doc to fix sgl-kernel build link error in py39 with ccache (sgl-project#5809)
* Turn on overlap scheduler for multimodal models (sgl-project#5771)
* Tiny refactor DefaultModelLoader.Source (sgl-project#5482)
* [Docs] Replace lists with tables for cleanup and readability in server_arguments (sgl-project#5276)
* Revert "Tiny refactor DefaultModelLoader.Source" (sgl-project#5825)
* Feat: add support for thinking mode via chat_template_kwargs.enable_t… (sgl-project#5551)
Co-authored-by: shuaills <shishuaiuoe@gmail.com>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
* fix: fix the error where the content is None when reasoning and tool … (sgl-project#5838)
* feat: Add fused moe triton config for qwen3 moe on h100 (sgl-project#5833)
* fused moe triton tuning script support qwen3 (sgl-project#5842)
* feat: Add fused moe triton config for qwen3bf16 moe on h20 (sgl-project#5839)
* [PD] support pd fake transfer for warmup (sgl-project#5726)
* [config] qwen3moe_tune_h20 fp8 tp4 (sgl-project#5846)
* [Doc] Recover history of server_arguments.md (sgl-project#5851)
* feat: Add fused moe triton config for qwen3-30b-fp8 moe on h20 (sgl-project#5850)
* [CI] test chunked prefill more (sgl-project#5798)
* ROCm: update AITER (sgl-project#5816)
* [Feat] QWen-1M context support[1/2]: Update block sparse attention backend utils kernel (sgl-project#5847)
Co-authored-by: sighingnow <sighingnow@gmail.com>
* [Fix] Missing bootstrap_port field (sgl-project#5823)
* feat: update is_fa3_default_architecture (sgl-project#5854)
* add fused moe config for qwen3moe fp8/bf16 (sgl-project#5849)
* chore: bump v0.4.6.post1 (sgl-project#5845)
* fix for hpu backend in model runner and server args
Signed-off-by: Mohit Sinha <msinha@habana.ai>
* rebase formatting issue
Signed-off-by: Mohit Sinha <msinha@habana.ai>
* [SW-228218]: Fix device mismatch in frequency penalty.
Ensure tensors in BatchedFrequencyPenalizer are on the same device by
moving output_ids and frequency_penalties to the device of
cumulated_frequency_penalties. This resolves a RuntimeError
caused by tensors on cpu and hpu:0 during logits subtraction.
---------
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Signed-off-by: Xuchun Shang <xuchun.shang@linux.alibaba.com>
Signed-off-by: windsonsea <haifeng.yao@daocloud.io>
Signed-off-by: Kebe <mail@kebe7jun.com>
Signed-off-by: congcongke <zhanweidu@163.com>
Signed-off-by: Mohit Sinha <msinha@habana.ai>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
Co-authored-by: DefTruth <31974251+DefTruth@users.noreply.github.com>
Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com>
Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com>
Co-authored-by: Zhaoyang Hao <77828610+Muuuchen@users.noreply.github.com>
Co-authored-by: Yuan Luo <yuan.luo@hotmail.com>
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Co-authored-by: lambert0312 <lambert80.ios@gmail.com>
Co-authored-by: shangmingc <caishangming@linux.alibaba.com>
Co-authored-by: ybyang <ybyang7@iflytek.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
Co-authored-by: Trevor Morris <tmorris@nvidia.com>
Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>
Co-authored-by: Chang Su <chang.s.su@oracle.com>
Co-authored-by: mRSun15 <3150105645@zju.edu.cn>
Co-authored-by: ryang <38470282+ryang-max@users.noreply.github.com>
Co-authored-by: Yuhao Yang <yyh073@foxmail.com>
Co-authored-by: Michael Yao <haifeng.yao@daocloud.io>
Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
Co-authored-by: Elfie Guo <164945471+elfiegg@users.noreply.github.com>
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
Co-authored-by: BearBiscuit <55008898+BearBiscuit05@users.noreply.github.com>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
Co-authored-by: eigen <52445717+yyihuang@users.noreply.github.com>
Co-authored-by: yhyang201 <yhyang201@gmail.com>
Co-authored-by: Didier Durand <durand.didier@gmail.com>
Co-authored-by: woodx <124784234+woodx9@users.noreply.github.com>
Co-authored-by: Xuchun Shang <xuchun.shang@linux.alibaba.com>
Co-authored-by: mlmz <54172054+minleminzui@users.noreply.github.com>
Co-authored-by: PGFLMG <1106310035@qq.com>
Co-authored-by: u4lr451 <u4lr451@gmail.com>
Co-authored-by: ocss884 <ocss.lin@gmail.com>
Co-authored-by: Michael Feil <63565275+michaelfeil@users.noreply.github.com>
Co-authored-by: strgrb <zhangkaihong.zkh@antgroup.com>
Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com>
Co-authored-by: liwenju0 <like4hub@gmail.com>
Co-authored-by: Wenxuan Tan <wtan45@wisc.edu>
Co-authored-by: yhyang201 <47235274+yhyang201@users.noreply.github.com>
Co-authored-by: Yubo Wang <yubowang2019@gmail.com>
Co-authored-by: Byron Hsu <byronhsu1230@gmail.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: Zhaoyi Li <36555117+Lzy17@users.noreply.github.com>
Co-authored-by: lukec <118525388+sleepcoo@users.noreply.github.com>
Co-authored-by: tarinkk <129432511+tarinkk@users.noreply.github.com>
Co-authored-by: AmadeusW <41280211+Amadeus-Winarto@users.noreply.github.com>
Co-authored-by: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com>
Co-authored-by: Yi Zhou <zhouyi920521@gmail.com>
Co-authored-by: simveit <69345428+simveit@users.noreply.github.com>
Co-authored-by: kyle-pena-kuzco <kyle.pena@kuzco.xyz>
Co-authored-by: Kyle Pena <kylepena@kyles-macbook-pro.turkey-marlin.ts.net>
Co-authored-by: Enrique Shockwave <33002121+qeternity@users.noreply.github.com>
Co-authored-by: Juwan Yoo <ryan@tmfi.us>
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
Co-authored-by: mac0ne <mac0ne@users.noreply.github.com>
Co-authored-by: Sundara Raman Ramachandran <sundar24295@gmail.com>
Co-authored-by: Qingquan Song <ustcsqq@gmail.com>
Co-authored-by: moontidef <53668275+relic-yuexi@users.noreply.github.com>
Co-authored-by: Huapeng Zhou <73010314+PopSoda2002@users.noreply.github.com>
Co-authored-by: Lucius <souzou@foxmail.com>
Co-authored-by: Chuyue Sun <33578456+ChuyueSun@users.noreply.github.com>
Co-authored-by: Shan Yu <shanyu1@g.ucla.edu>
Co-authored-by: Yongtong Wu <914554688@qq.com>
Co-authored-by: michael-amd <Michael.Zhang@amd.com>
Co-authored-by: Ke Bao <ISPObaoke@163.com>
Co-authored-by: saienduri <saimanas.enduri@amd.com>
Co-authored-by: ispobock <ispobaoke@gmail.com>
Co-authored-by: Connector Switch <c8ef@outlook.com>
Co-authored-by: saltyfish66 <38240284+saltyfish66@users.noreply.github.com>
Co-authored-by: yuethe <yuethe@tencent.com>
Co-authored-by: alcanerian <alcanerian@gmail.com>
Co-authored-by: HAI <hixiao@gmail.com>
Co-authored-by: Mick <mickjagger19@icloud.com>
Co-authored-by: Xinyuan Tong <justinning0323@outlook.com>
Co-authored-by: Ravi Theja <ravi03071991@gmail.com>
Co-authored-by: Ravi Theja Desetty <ravitheja@Ravis-MacBook-Pro.local>
Co-authored-by: vincent-4 <vincentzhongy+githubvincent4@gmail.com>
Co-authored-by: IAN <50618241+hcyz33@users.noreply.github.com>
Co-authored-by: shuaills <shishuaiuoe@gmail.com>
Co-authored-by: XinyuanTong <115166877+JustinTong0323@users.noreply.github.com>
Co-authored-by: ZXN <44322223+bppps@users.noreply.github.com>
Co-authored-by: bppps <zouyu.zzx@alibaba-inc.com>
Co-authored-by: Yi Zhang <1109276519@qq.com>
Co-authored-by: Kyungmin Lee <30465912+lkm2835@users.noreply.github.com>
Co-authored-by: vzed <207368749+vincentzed@users.noreply.github.com>
Co-authored-by: DavidBao <121073073+DavidBao03@users.noreply.github.com>
Co-authored-by: Frankey_8080 <32973306+Frank-Jie@users.noreply.github.com>
Co-authored-by: yan97ao <580776+yan97ao@users.noreply.github.com>
Co-authored-by: aoshen524 <aoshen524@gmail.com>
Co-authored-by: Michał Moskal <michal@moskal.me>
Co-authored-by: Kebe <mail@kebe7jun.com>
Co-authored-by: zhanweidu <zhanweidu@163.com>
Co-authored-by: NoahM <88418672+zhudianGG@users.noreply.github.com>
Co-authored-by: Simo Lin <linsimo.mark@gmail.com>
Co-authored-by: JiLi <leege233@gmail.com>
Co-authored-by: sighingnow <sighingnow@gmail.com>
Co-authored-by: XTY <xutianyi1999@live.com>
Co-authored-by: vikram singh shekhawat <vshekhawat@habana.ai>
This PR provides initial support for the Qwen3 and Qwen3MoE. Once the model file becomes available, I will further refine the integration.
ref huggingface/transformers#36878, vllm-project/vllm#15289
Motivation
#4682
Modifications