[feat] add ascend readme and docker release by pkking · Pull Request #8700 · sgl-project/sglang

pkking · 2025-08-02T07:55:56Z

Motivation

In the past, we only had images for GPU and AMD, but this PR would try to build and push docker image for NPU hardware

Modifications

Add two new workflow and NPU related Dockerfile, both docker images will be published to offical registry:

daily dev image for user to try and nightly test case, named as sglang:main-cann8.2.rc1.alpha003-a3
release image when new tag added, named as sglang:v0.4.10.post1-cann8.2.rc1.alpha003-a3

NOTE: we use metadata-action to generate image tags and labels, so the workflow rely on push tag event

FOR END USER

the image name schema: sglang-{ver|main}-{cann_ver}-{device_type}, the os and python inherit from cann image, default to the latest ubuntu and python version

FOR DEVELOPER

The released image will use the specific sgl-kernel-npu tag, if a new kernel-npu tag is needed, please edit the release docker workflow

Accuracy Test

NO NEED

Benchmark & Profiling

NO NEED

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

gemini-code-assist

Summary of Changes

Hello @pkking, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

I've initiated work on adding Docker release capabilities, specifically targeting NPU hardware. This pull request introduces a new Dockerfile that outlines the complete environment setup required to build and run applications on NPU devices, including the installation of various deep learning frameworks and custom kernels. The goal is to streamline the deployment process for NPU-accelerated workloads.

Highlights

New Dockerfile for NPU Support: I've introduced a new Dockerfile (docker/Dockerfile.npu) to enable the building and pushing of Docker images tailored for NPU (Neural Processing Unit) hardware. This is a foundational step towards supporting NPU-specific deployments.
Comprehensive NPU Environment Setup: The new Dockerfile sets up a comprehensive environment, including the installation of essential development tools, PyTorch with NPU adapters, vLLM, Triton-Ascend, and SGLang. It also integrates a custom SGLang kernel for NPU, ensuring all necessary dependencies are pre-configured within the image.
Integration of Custom NPU Kernel: The Dockerfile includes specific steps to clone and build sgl-kernel-npu and install deep-ep, which are crucial for leveraging NPU capabilities with SGLang. This ensures that the custom kernel is correctly compiled and linked within the Docker environment.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new Dockerfile for NPU hardware. The Dockerfile is functional but has several areas for improvement. My review focuses on a critical security issue regarding hardcoded credentials in URLs, and several medium-severity issues related to Docker best practices for optimizing image size and build efficiency. Specifically, I've suggested removing sensitive credentials, combining multiple RUN instructions for apt and pip commands, and cleaning up cloned git repositories after use. These changes will result in a more secure and leaner Docker image.

docker/Dockerfile.npu

docs/references/ascend_npu.md

pkking · 2025-08-09T12:48:35Z

LGTM

scripts/npu_ci_install_dependency.sh

docs/references/deepseek.md

docs/references/ascend_npu.md

docker/Dockerfile.npu

docs/start/install.md

scripts/npu_ci_install_dependency.sh

docs/references/hardware.rst

Signed-off-by: mywaaagh_admin <pkwarcraft@gmail.com>

Signed-off-by: lichaoran <pkwarcraft@gmail.com>

Signed-off-by: mywaaagh_admin <pkwarcraft@gmail.com>

ping1jing2 · 2025-08-12T07:07:31Z

lgtm

thincal · 2025-08-12T08:55:36Z

@iforgetmyname I am using 8 * Ascend 910B to deploy the GLM4.5-Air (106B) model, but it reports OOM, could you help have a check? thanks.

repro steps

# step1: launch docker
docker run -it --rm --shm-size 512g \
    -v /usr/local/Ascend/driver:/usr/local/Ascend/driver  \
    -v /usr/local/Ascend/add-ons/:/usr/local/Ascend/add-ons  \
    -v /usr/local/sbin/:/usr/local/sbin \
    -v /lib/modules:/lib/modules  \
    -v /data/models:/data/models \
    --device=/dev/davinci0 \
    --device=/dev/davinci1 \
    --device=/dev/davinci2 \
    --device=/dev/davinci3 \
    --device=/dev/davinci4 \
    --device=/dev/davinci5 \
    --device=/dev/davinci6 \
    --device=/dev/davinci7 \
    --device=/dev/davinci_manager \
    --device=/dev/devmm_svm \
    --device=/dev/hisi_hdc \
    --privileged=true \
    sglang-ascend:latest bash

# step2: prepare env
source /usr/local/Ascend/driver/bin/setenv.bash
source /usr/local/Ascend/ascend-toolkit/set_env.sh
pip3 install -U transformers==4.53.3

# step3: launch sglang server
python3 -m sglang.launch_server --model-path=/data/models/glm4.5-air-hf/ --trust-remote-code --tp=8

logs

root@d8ac077bb820:/workspace# python3 -m sglang.launch_server --model-path=/data/models/glm4.5-air-hf/ --trust-remote-code --tp=8
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
INFO 08-11 16:42:56 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 08-11 16:42:56 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
WARNING 08-11 16:42:57 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
[2025-08-11 16:43:07] server_args=ServerArgs(model_path='/data/models/glm4.5-air-hf/', tokenizer_path='/data/models/glm4.5-air-hf/', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=True, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=30000, skip_server_warmup=False, warmups=None, nccl_port=None, dtype='auto', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', mem_fraction_static=0.779, max_running_requests=None, max_queued_requests=9223372036854775807, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, device='npu', tp_size=8, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=909378885, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, api_key=None, served_model_name='/data/models/glm4.5-air-hf/', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend=None, decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='pytorch', grammar_backend='xgrammar', mm_attention_backend=None, speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, ep_size=1, moe_a2a_backend=None, enable_flashinfer_cutlass_moe=False, enable_flashinfer_trtllm_moe=False, enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through_selective', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, cuda_graph_max_bs=None, cuda_graph_bs=None, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_nccl_nvls=False, enable_symm_mem=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_torch_compile=False, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, enable_return_hidden_states=False, enable_triton_kernel_moe=False, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, num_reserved_decode_tokens=512, pdlb_url=None, custom_weight_loader=[], weight_loader_disable_mmap=False, enable_pdmux=False, sm_group_num=3, enable_ep_moe=False, enable_deepep_moe=False)
[2025-08-11 16:43:08] Using default HuggingFace chat template with detected content format: string
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
INFO 08-11 16:43:19 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 08-11 16:43:19 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
INFO 08-11 16:43:19 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 08-11 16:43:19 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
INFO 08-11 16:43:19 [importing.py:53] Triton module has been replaced with a placeholder.
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
INFO 08-11 16:43:20 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
INFO 08-11 16:43:20 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 08-11 16:43:20 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
INFO 08-11 16:43:20 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 08-11 16:43:20 [importing.py:53] Triton module has been replaced with a placeholder.
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
INFO 08-11 16:43:20 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
INFO 08-11 16:43:20 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
INFO 08-11 16:43:20 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 08-11 16:43:20 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
INFO 08-11 16:43:20 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 08-11 16:43:20 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
INFO 08-11 16:43:20 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 08-11 16:43:20 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
WARNING 08-11 16:43:21 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 08-11 16:43:21 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 08-11 16:43:21 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 08-11 16:43:21 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 08-11 16:43:21 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 08-11 16:43:21 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 08-11 16:43:21 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 08-11 16:43:22 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
WARNING 08-11 16:43:22 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
[2025-08-11 16:43:24 TP0] Attention backend not explicitly specified. Use ascend backend by default.
[2025-08-11 16:43:24 TP0] Init torch distributed begin.
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
INFO 08-11 16:43:39 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 08-11 16:43:39 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
INFO 08-11 16:43:39 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 08-11 16:43:39 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
INFO 08-11 16:43:40 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 08-11 16:43:40 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
WARNING 08-11 16:43:40 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 08-11 16:43:40 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
INFO 08-11 16:43:41 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 08-11 16:43:42 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
WARNING 08-11 16:43:42 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
WARNING 08-11 16:43:43 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
INFO 08-11 16:43:43 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 08-11 16:43:43 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
INFO 08-11 16:43:43 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 08-11 16:43:44 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
[2025-08-11 16:43:44 TP1] Failed to import from custom_ar with ModuleNotFoundError("No module named 'sgl_kernel'")
INFO 08-11 16:43:44 [importing.py:53] Triton module has been replaced with a placeholder.
[2025-08-11 16:43:44 TP3] Failed to import from custom_ar with ModuleNotFoundError("No module named 'sgl_kernel'")
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
INFO 08-11 16:43:45 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
WARNING 08-11 16:43:45 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
WARNING 08-11 16:43:45 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
INFO 08-11 16:43:46 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 08-11 16:43:46 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
[2025-08-11 16:43:46 TP0] Failed to import from custom_ar with ModuleNotFoundError("No module named 'sgl_kernel'")
WARNING 08-11 16:43:46 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
[2025-08-11 16:43:47 TP2] Failed to import from custom_ar with ModuleNotFoundError("No module named 'sgl_kernel'")
WARNING 08-11 16:43:47 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
[2025-08-11 16:43:49 TP6] Failed to import from custom_ar with ModuleNotFoundError("No module named 'sgl_kernel'")
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
[2025-08-11 16:43:50 TP5] Failed to import from custom_ar with ModuleNotFoundError("No module named 'sgl_kernel'")
[2025-08-11 16:43:50 TP7] Failed to import from custom_ar with ModuleNotFoundError("No module named 'sgl_kernel'")
[2025-08-11 16:43:52 TP4] Failed to import from custom_ar with ModuleNotFoundError("No module named 'sgl_kernel'")
[2025-08-11 16:43:52 TP0] Init torch distributed ends. mem usage=0.00 GB
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
[2025-08-11 16:43:53 TP1] GLM4MoEForCausalLM has no SGLang implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.
[2025-08-11 16:43:53 TP3] GLM4MoEForCausalLM has no SGLang implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.
[2025-08-11 16:43:53 TP1] Using Transformers backend.
[2025-08-11 16:43:53 TP3] Using Transformers backend.
[2025-08-11 16:43:53 TP0] GLM4MoEForCausalLM has no SGLang implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.
[2025-08-11 16:43:53 TP0] Load weight begin. avail mem=60.63 GB
[2025-08-11 16:43:53 TP0] Using Transformers backend.
[2025-08-11 16:43:53 TP5] GLM4MoEForCausalLM has no SGLang implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.
[2025-08-11 16:43:53 TP5] Using Transformers backend.
[2025-08-11 16:43:53 TP6] GLM4MoEForCausalLM has no SGLang implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.
[2025-08-11 16:43:53 TP6] Using Transformers backend.
[2025-08-11 16:43:53 TP7] GLM4MoEForCausalLM has no SGLang implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.
[2025-08-11 16:43:53 TP7] Using Transformers backend.
[2025-08-11 16:43:53 TP2] GLM4MoEForCausalLM has no SGLang implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.
[2025-08-11 16:43:53 TP2] Using Transformers backend.
[2025-08-11 16:43:53 TP4] GLM4MoEForCausalLM has no SGLang implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.
[2025-08-11 16:43:53 TP4] Using Transformers backend.
[2025-08-11 16:43:58 TP1] Scheduler hit an exception: Traceback (most recent call last):
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/managers/scheduler.py", line 2421, in run_scheduler_process
    scheduler = Scheduler(
                ^^^^^^^^^^
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/managers/scheduler.py", line 312, in __init__
    self.tp_worker = TpWorkerClass(
                     ^^^^^^^^^^^^^^
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 67, in __init__
    self.worker = TpModelWorker(
                  ^^^^^^^^^^^^^^
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/managers/tp_worker.py", line 84, in __init__
    self.model_runner = ModelRunner(
                        ^^^^^^^^^^^^
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/model_executor/model_runner.py", line 242, in __init__
    self.initialize(min_per_gpu_memory)
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/model_executor/model_runner.py", line 285, in initialize
    self.load_model()
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/model_executor/model_runner.py", line 643, in load_model
    self.model = get_model(
                 ^^^^^^^^^^
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/model_loader/__init__.py", line 22, in get_model
    return loader.load_model(
           ^^^^^^^^^^^^^^^^^^
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/model_loader/loader.py", line 432, in load_model
    model = _initialize_model(
            ^^^^^^^^^^^^^^^^^^
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/model_loader/loader.py", line 174, in _initialize_model
    return model_class(
           ^^^^^^^^^^^^
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/models/transformers.py", line 158, in __init__
    self.model: PreTrainedModel = AutoModel.from_config(
                                  ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 453, in from_config
    return model_class._from_config(config, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/transformers/modeling_utils.py", line 311, in _wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2208, in _from_config
    model = cls(config, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/huggingface/modules/transformers_modules/modeling_glm4_moe.py", line 454, in __init__
    [GLM4MoEDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
  File "/root/.cache/huggingface/modules/transformers_modules/modeling_glm4_moe.py", line 454, in <listcomp>
    [GLM4MoEDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/huggingface/modules/transformers_modules/modeling_glm4_moe.py", line 152, in __init__
    self.mlp = GLM4MoESparseMoeBlock(config, layer_id=layer_idx)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/huggingface/modules/transformers_modules/modeling_glm4_moe.py", line 106, in __init__
    self.experts = nn.ModuleList([
                                 ^
  File "/root/.cache/huggingface/modules/transformers_modules/modeling_glm4_moe.py", line 107, in <listcomp>
    GLM4MoEMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(config.n_routed_experts)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/huggingface/modules/transformers_modules/modeling_glm4_moe.py", line 40, in __init__
    self.gate_proj = nn.Linear(config.hidden_size, intermediate_size, bias=False)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/nn/modules/linear.py", line 106, in __init__
    torch.empty((out_features, in_features), **factory_kwargs)
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/utils/_device.py", line 104, in __torch_function__
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
RuntimeError: NPU out of memory. Tried to allocate 12.00 MiB (NPU 1; 60.97 GiB total capacity; 60.59 GiB already allocated; 60.59 GiB current active; 28.28 MiB free; 60.66 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.

[2025-08-11 16:43:58] Received sigquit from a child process. It usually means the child failed.
Killed

ping1jing2 · 2025-08-12T09:16:52Z

@iforgetmyname I am using 8 * Ascend 910B to deploy the GLM4.5-Air (106B) model, but it reports OOM, could you help have a check? thanks.

repro steps

# step1: launch docker
docker run -it --rm --shm-size 512g \
    -v /usr/local/Ascend/driver:/usr/local/Ascend/driver  \
    -v /usr/local/Ascend/add-ons/:/usr/local/Ascend/add-ons  \
    -v /usr/local/sbin/:/usr/local/sbin \
    -v /lib/modules:/lib/modules  \
    -v /data/models:/data/models \
    --device=/dev/davinci0 \
    --device=/dev/davinci1 \
    --device=/dev/davinci2 \
    --device=/dev/davinci3 \
    --device=/dev/davinci4 \
    --device=/dev/davinci5 \
    --device=/dev/davinci6 \
    --device=/dev/davinci7 \
    --device=/dev/davinci_manager \
    --device=/dev/devmm_svm \
    --device=/dev/hisi_hdc \
    --privileged=true \
    sglang-ascend:latest bash

# step2: prepare env
source /usr/local/Ascend/driver/bin/setenv.bash
source /usr/local/Ascend/ascend-toolkit/set_env.sh
pip3 install -U transformers==4.53.3

# step3: launch sglang server
python3 -m sglang.launch_server --model-path=/data/models/glm4.5-air-hf/ --trust-remote-code --tp=8

logs

root@d8ac077bb820:/workspace# python3 -m sglang.launch_server --model-path=/data/models/glm4.5-air-hf/ --trust-remote-code --tp=8
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
INFO 08-11 16:42:56 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 08-11 16:42:56 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
WARNING 08-11 16:42:57 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
[2025-08-11 16:43:07] server_args=ServerArgs(model_path='/data/models/glm4.5-air-hf/', tokenizer_path='/data/models/glm4.5-air-hf/', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=True, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=30000, skip_server_warmup=False, warmups=None, nccl_port=None, dtype='auto', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', mem_fraction_static=0.779, max_running_requests=None, max_queued_requests=9223372036854775807, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, device='npu', tp_size=8, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=909378885, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, api_key=None, served_model_name='/data/models/glm4.5-air-hf/', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend=None, decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='pytorch', grammar_backend='xgrammar', mm_attention_backend=None, speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, ep_size=1, moe_a2a_backend=None, enable_flashinfer_cutlass_moe=False, enable_flashinfer_trtllm_moe=False, enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through_selective', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, cuda_graph_max_bs=None, cuda_graph_bs=None, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_nccl_nvls=False, enable_symm_mem=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_torch_compile=False, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, enable_return_hidden_states=False, enable_triton_kernel_moe=False, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, num_reserved_decode_tokens=512, pdlb_url=None, custom_weight_loader=[], weight_loader_disable_mmap=False, enable_pdmux=False, sm_group_num=3, enable_ep_moe=False, enable_deepep_moe=False)
[2025-08-11 16:43:08] Using default HuggingFace chat template with detected content format: string
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
INFO 08-11 16:43:19 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 08-11 16:43:19 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
INFO 08-11 16:43:19 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 08-11 16:43:19 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
INFO 08-11 16:43:19 [importing.py:53] Triton module has been replaced with a placeholder.
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
INFO 08-11 16:43:20 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
INFO 08-11 16:43:20 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 08-11 16:43:20 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
INFO 08-11 16:43:20 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 08-11 16:43:20 [importing.py:53] Triton module has been replaced with a placeholder.
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
INFO 08-11 16:43:20 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
INFO 08-11 16:43:20 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
INFO 08-11 16:43:20 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 08-11 16:43:20 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
INFO 08-11 16:43:20 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 08-11 16:43:20 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
INFO 08-11 16:43:20 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 08-11 16:43:20 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
WARNING 08-11 16:43:21 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 08-11 16:43:21 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 08-11 16:43:21 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 08-11 16:43:21 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 08-11 16:43:21 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 08-11 16:43:21 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 08-11 16:43:21 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 08-11 16:43:22 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
WARNING 08-11 16:43:22 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
[2025-08-11 16:43:24 TP0] Attention backend not explicitly specified. Use ascend backend by default.
[2025-08-11 16:43:24 TP0] Init torch distributed begin.
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
INFO 08-11 16:43:39 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 08-11 16:43:39 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
INFO 08-11 16:43:39 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 08-11 16:43:39 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
INFO 08-11 16:43:40 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 08-11 16:43:40 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
WARNING 08-11 16:43:40 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 08-11 16:43:40 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
INFO 08-11 16:43:41 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 08-11 16:43:42 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
WARNING 08-11 16:43:42 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
WARNING 08-11 16:43:43 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
INFO 08-11 16:43:43 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 08-11 16:43:43 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
INFO 08-11 16:43:43 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 08-11 16:43:44 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
[2025-08-11 16:43:44 TP1] Failed to import from custom_ar with ModuleNotFoundError("No module named 'sgl_kernel'")
INFO 08-11 16:43:44 [importing.py:53] Triton module has been replaced with a placeholder.
[2025-08-11 16:43:44 TP3] Failed to import from custom_ar with ModuleNotFoundError("No module named 'sgl_kernel'")
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
INFO 08-11 16:43:45 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
WARNING 08-11 16:43:45 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
WARNING 08-11 16:43:45 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
INFO 08-11 16:43:46 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 08-11 16:43:46 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
[2025-08-11 16:43:46 TP0] Failed to import from custom_ar with ModuleNotFoundError("No module named 'sgl_kernel'")
WARNING 08-11 16:43:46 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
[2025-08-11 16:43:47 TP2] Failed to import from custom_ar with ModuleNotFoundError("No module named 'sgl_kernel'")
WARNING 08-11 16:43:47 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
[2025-08-11 16:43:49 TP6] Failed to import from custom_ar with ModuleNotFoundError("No module named 'sgl_kernel'")
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
[2025-08-11 16:43:50 TP5] Failed to import from custom_ar with ModuleNotFoundError("No module named 'sgl_kernel'")
[2025-08-11 16:43:50 TP7] Failed to import from custom_ar with ModuleNotFoundError("No module named 'sgl_kernel'")
[2025-08-11 16:43:52 TP4] Failed to import from custom_ar with ModuleNotFoundError("No module named 'sgl_kernel'")
[2025-08-11 16:43:52 TP0] Init torch distributed ends. mem usage=0.00 GB
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
[2025-08-11 16:43:53 TP1] GLM4MoEForCausalLM has no SGLang implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.
[2025-08-11 16:43:53 TP3] GLM4MoEForCausalLM has no SGLang implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.
[2025-08-11 16:43:53 TP1] Using Transformers backend.
[2025-08-11 16:43:53 TP3] Using Transformers backend.
[2025-08-11 16:43:53 TP0] GLM4MoEForCausalLM has no SGLang implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.
[2025-08-11 16:43:53 TP0] Load weight begin. avail mem=60.63 GB
[2025-08-11 16:43:53 TP0] Using Transformers backend.
[2025-08-11 16:43:53 TP5] GLM4MoEForCausalLM has no SGLang implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.
[2025-08-11 16:43:53 TP5] Using Transformers backend.
[2025-08-11 16:43:53 TP6] GLM4MoEForCausalLM has no SGLang implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.
[2025-08-11 16:43:53 TP6] Using Transformers backend.
[2025-08-11 16:43:53 TP7] GLM4MoEForCausalLM has no SGLang implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.
[2025-08-11 16:43:53 TP7] Using Transformers backend.
[2025-08-11 16:43:53 TP2] GLM4MoEForCausalLM has no SGLang implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.
[2025-08-11 16:43:53 TP2] Using Transformers backend.
[2025-08-11 16:43:53 TP4] GLM4MoEForCausalLM has no SGLang implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.
[2025-08-11 16:43:53 TP4] Using Transformers backend.
[2025-08-11 16:43:58 TP1] Scheduler hit an exception: Traceback (most recent call last):
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/managers/scheduler.py", line 2421, in run_scheduler_process
    scheduler = Scheduler(
                ^^^^^^^^^^
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/managers/scheduler.py", line 312, in __init__
    self.tp_worker = TpWorkerClass(
                     ^^^^^^^^^^^^^^
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 67, in __init__
    self.worker = TpModelWorker(
                  ^^^^^^^^^^^^^^
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/managers/tp_worker.py", line 84, in __init__
    self.model_runner = ModelRunner(
                        ^^^^^^^^^^^^
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/model_executor/model_runner.py", line 242, in __init__
    self.initialize(min_per_gpu_memory)
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/model_executor/model_runner.py", line 285, in initialize
    self.load_model()
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/model_executor/model_runner.py", line 643, in load_model
    self.model = get_model(
                 ^^^^^^^^^^
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/model_loader/__init__.py", line 22, in get_model
    return loader.load_model(
           ^^^^^^^^^^^^^^^^^^
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/model_loader/loader.py", line 432, in load_model
    model = _initialize_model(
            ^^^^^^^^^^^^^^^^^^
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/model_loader/loader.py", line 174, in _initialize_model
    return model_class(
           ^^^^^^^^^^^^
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/models/transformers.py", line 158, in __init__
    self.model: PreTrainedModel = AutoModel.from_config(
                                  ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 453, in from_config
    return model_class._from_config(config, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/transformers/modeling_utils.py", line 311, in _wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2208, in _from_config
    model = cls(config, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/huggingface/modules/transformers_modules/modeling_glm4_moe.py", line 454, in __init__
    [GLM4MoEDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
  File "/root/.cache/huggingface/modules/transformers_modules/modeling_glm4_moe.py", line 454, in <listcomp>
    [GLM4MoEDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/huggingface/modules/transformers_modules/modeling_glm4_moe.py", line 152, in __init__
    self.mlp = GLM4MoESparseMoeBlock(config, layer_id=layer_idx)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/huggingface/modules/transformers_modules/modeling_glm4_moe.py", line 106, in __init__
    self.experts = nn.ModuleList([
                                 ^
  File "/root/.cache/huggingface/modules/transformers_modules/modeling_glm4_moe.py", line 107, in <listcomp>
    GLM4MoEMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(config.n_routed_experts)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/huggingface/modules/transformers_modules/modeling_glm4_moe.py", line 40, in __init__
    self.gate_proj = nn.Linear(config.hidden_size, intermediate_size, bias=False)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/nn/modules/linear.py", line 106, in __init__
    torch.empty((out_features, in_features), **factory_kwargs)
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/utils/_device.py", line 104, in __torch_function__
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
RuntimeError: NPU out of memory. Tried to allocate 12.00 MiB (NPU 1; 60.97 GiB total capacity; 60.59 GiB already allocated; 60.59 GiB current active; 28.28 MiB free; 60.66 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.

[2025-08-11 16:43:58] Received sigquit from a child process. It usually means the child failed.
Killed

let me check it. if you don't use this image, will you also encounter the oom?

iforgetmyname · 2025-08-12T09:38:24Z

@iforgetmyname I am using 8 * Ascend 910B to deploy the GLM4.5-Air (106B) model, but it reports OOM, could you help have a check? thanks.

repro steps

# step1: launch docker
docker run -it --rm --shm-size 512g \
    -v /usr/local/Ascend/driver:/usr/local/Ascend/driver  \
    -v /usr/local/Ascend/add-ons/:/usr/local/Ascend/add-ons  \
    -v /usr/local/sbin/:/usr/local/sbin \
    -v /lib/modules:/lib/modules  \
    -v /data/models:/data/models \
    --device=/dev/davinci0 \
    --device=/dev/davinci1 \
    --device=/dev/davinci2 \
    --device=/dev/davinci3 \
    --device=/dev/davinci4 \
    --device=/dev/davinci5 \
    --device=/dev/davinci6 \
    --device=/dev/davinci7 \
    --device=/dev/davinci_manager \
    --device=/dev/devmm_svm \
    --device=/dev/hisi_hdc \
    --privileged=true \
    sglang-ascend:latest bash

# step2: prepare env
source /usr/local/Ascend/driver/bin/setenv.bash
source /usr/local/Ascend/ascend-toolkit/set_env.sh
pip3 install -U transformers==4.53.3

# step3: launch sglang server
python3 -m sglang.launch_server --model-path=/data/models/glm4.5-air-hf/ --trust-remote-code --tp=8

logs

root@d8ac077bb820:/workspace# python3 -m sglang.launch_server --model-path=/data/models/glm4.5-air-hf/ --trust-remote-code --tp=8
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
INFO 08-11 16:42:56 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 08-11 16:42:56 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
WARNING 08-11 16:42:57 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
[2025-08-11 16:43:07] server_args=ServerArgs(model_path='/data/models/glm4.5-air-hf/', tokenizer_path='/data/models/glm4.5-air-hf/', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=True, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=30000, skip_server_warmup=False, warmups=None, nccl_port=None, dtype='auto', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', mem_fraction_static=0.779, max_running_requests=None, max_queued_requests=9223372036854775807, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, device='npu', tp_size=8, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=909378885, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, api_key=None, served_model_name='/data/models/glm4.5-air-hf/', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend=None, decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='pytorch', grammar_backend='xgrammar', mm_attention_backend=None, speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, ep_size=1, moe_a2a_backend=None, enable_flashinfer_cutlass_moe=False, enable_flashinfer_trtllm_moe=False, enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through_selective', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, cuda_graph_max_bs=None, cuda_graph_bs=None, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_nccl_nvls=False, enable_symm_mem=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_torch_compile=False, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, enable_return_hidden_states=False, enable_triton_kernel_moe=False, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, num_reserved_decode_tokens=512, pdlb_url=None, custom_weight_loader=[], weight_loader_disable_mmap=False, enable_pdmux=False, sm_group_num=3, enable_ep_moe=False, enable_deepep_moe=False)
[2025-08-11 16:43:08] Using default HuggingFace chat template with detected content format: string
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
INFO 08-11 16:43:19 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 08-11 16:43:19 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
INFO 08-11 16:43:19 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 08-11 16:43:19 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
INFO 08-11 16:43:19 [importing.py:53] Triton module has been replaced with a placeholder.
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
INFO 08-11 16:43:20 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
INFO 08-11 16:43:20 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 08-11 16:43:20 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
INFO 08-11 16:43:20 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 08-11 16:43:20 [importing.py:53] Triton module has been replaced with a placeholder.
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
INFO 08-11 16:43:20 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
INFO 08-11 16:43:20 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
INFO 08-11 16:43:20 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 08-11 16:43:20 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
INFO 08-11 16:43:20 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 08-11 16:43:20 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
INFO 08-11 16:43:20 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 08-11 16:43:20 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
WARNING 08-11 16:43:21 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 08-11 16:43:21 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 08-11 16:43:21 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 08-11 16:43:21 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 08-11 16:43:21 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 08-11 16:43:21 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 08-11 16:43:21 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 08-11 16:43:22 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
WARNING 08-11 16:43:22 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
[2025-08-11 16:43:24 TP0] Attention backend not explicitly specified. Use ascend backend by default.
[2025-08-11 16:43:24 TP0] Init torch distributed begin.
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
INFO 08-11 16:43:39 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 08-11 16:43:39 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
INFO 08-11 16:43:39 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 08-11 16:43:39 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
INFO 08-11 16:43:40 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 08-11 16:43:40 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
WARNING 08-11 16:43:40 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 08-11 16:43:40 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
INFO 08-11 16:43:41 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 08-11 16:43:42 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
WARNING 08-11 16:43:42 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
WARNING 08-11 16:43:43 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
INFO 08-11 16:43:43 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 08-11 16:43:43 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
INFO 08-11 16:43:43 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 08-11 16:43:44 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
[2025-08-11 16:43:44 TP1] Failed to import from custom_ar with ModuleNotFoundError("No module named 'sgl_kernel'")
INFO 08-11 16:43:44 [importing.py:53] Triton module has been replaced with a placeholder.
[2025-08-11 16:43:44 TP3] Failed to import from custom_ar with ModuleNotFoundError("No module named 'sgl_kernel'")
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
INFO 08-11 16:43:45 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
WARNING 08-11 16:43:45 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
WARNING 08-11 16:43:45 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
INFO 08-11 16:43:46 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 08-11 16:43:46 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
[2025-08-11 16:43:46 TP0] Failed to import from custom_ar with ModuleNotFoundError("No module named 'sgl_kernel'")
WARNING 08-11 16:43:46 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
[2025-08-11 16:43:47 TP2] Failed to import from custom_ar with ModuleNotFoundError("No module named 'sgl_kernel'")
WARNING 08-11 16:43:47 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:42: UserWarning: Using kernels directly from vllm. This might lead to performance degradation or missing functionalities as certain kernels may not be optimized. 
  warnings.warn(
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/layers/quantization/awq.py:62: UserWarning: Only CUDA and HIP support AWQ currently.
  warnings.warn(f"Only CUDA and HIP support AWQ currently.")
[2025-08-11 16:43:49 TP6] Failed to import from custom_ar with ModuleNotFoundError("No module named 'sgl_kernel'")
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
[2025-08-11 16:43:50 TP5] Failed to import from custom_ar with ModuleNotFoundError("No module named 'sgl_kernel'")
[2025-08-11 16:43:50 TP7] Failed to import from custom_ar with ModuleNotFoundError("No module named 'sgl_kernel'")
[2025-08-11 16:43:52 TP4] Failed to import from custom_ar with ModuleNotFoundError("No module named 'sgl_kernel'")
[2025-08-11 16:43:52 TP0] Init torch distributed ends. mem usage=0.00 GB
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/configs/compiler_config.py:74: UserWarning: The following torchair config or properties may not take effect or report error in max-autotune mode: 
  warnings.warn("The following torchair config or properties may not take effect or report " + \
[2025-08-11 16:43:53 TP1] GLM4MoEForCausalLM has no SGLang implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.
[2025-08-11 16:43:53 TP3] GLM4MoEForCausalLM has no SGLang implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.
[2025-08-11 16:43:53 TP1] Using Transformers backend.
[2025-08-11 16:43:53 TP3] Using Transformers backend.
[2025-08-11 16:43:53 TP0] GLM4MoEForCausalLM has no SGLang implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.
[2025-08-11 16:43:53 TP0] Load weight begin. avail mem=60.63 GB
[2025-08-11 16:43:53 TP0] Using Transformers backend.
[2025-08-11 16:43:53 TP5] GLM4MoEForCausalLM has no SGLang implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.
[2025-08-11 16:43:53 TP5] Using Transformers backend.
[2025-08-11 16:43:53 TP6] GLM4MoEForCausalLM has no SGLang implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.
[2025-08-11 16:43:53 TP6] Using Transformers backend.
[2025-08-11 16:43:53 TP7] GLM4MoEForCausalLM has no SGLang implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.
[2025-08-11 16:43:53 TP7] Using Transformers backend.
[2025-08-11 16:43:53 TP2] GLM4MoEForCausalLM has no SGLang implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.
[2025-08-11 16:43:53 TP2] Using Transformers backend.
[2025-08-11 16:43:53 TP4] GLM4MoEForCausalLM has no SGLang implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.
[2025-08-11 16:43:53 TP4] Using Transformers backend.
[2025-08-11 16:43:58 TP1] Scheduler hit an exception: Traceback (most recent call last):
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/managers/scheduler.py", line 2421, in run_scheduler_process
    scheduler = Scheduler(
                ^^^^^^^^^^
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/managers/scheduler.py", line 312, in __init__
    self.tp_worker = TpWorkerClass(
                     ^^^^^^^^^^^^^^
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 67, in __init__
    self.worker = TpModelWorker(
                  ^^^^^^^^^^^^^^
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/managers/tp_worker.py", line 84, in __init__
    self.model_runner = ModelRunner(
                        ^^^^^^^^^^^^
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/model_executor/model_runner.py", line 242, in __init__
    self.initialize(min_per_gpu_memory)
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/model_executor/model_runner.py", line 285, in initialize
    self.load_model()
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/model_executor/model_runner.py", line 643, in load_model
    self.model = get_model(
                 ^^^^^^^^^^
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/model_loader/__init__.py", line 22, in get_model
    return loader.load_model(
           ^^^^^^^^^^^^^^^^^^
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/model_loader/loader.py", line 432, in load_model
    model = _initialize_model(
            ^^^^^^^^^^^^^^^^^^
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/model_loader/loader.py", line 174, in _initialize_model
    return model_class(
           ^^^^^^^^^^^^
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/models/transformers.py", line 158, in __init__
    self.model: PreTrainedModel = AutoModel.from_config(
                                  ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 453, in from_config
    return model_class._from_config(config, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/transformers/modeling_utils.py", line 311, in _wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2208, in _from_config
    model = cls(config, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/huggingface/modules/transformers_modules/modeling_glm4_moe.py", line 454, in __init__
    [GLM4MoEDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
  File "/root/.cache/huggingface/modules/transformers_modules/modeling_glm4_moe.py", line 454, in <listcomp>
    [GLM4MoEDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/huggingface/modules/transformers_modules/modeling_glm4_moe.py", line 152, in __init__
    self.mlp = GLM4MoESparseMoeBlock(config, layer_id=layer_idx)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/huggingface/modules/transformers_modules/modeling_glm4_moe.py", line 106, in __init__
    self.experts = nn.ModuleList([
                                 ^
  File "/root/.cache/huggingface/modules/transformers_modules/modeling_glm4_moe.py", line 107, in <listcomp>
    GLM4MoEMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(config.n_routed_experts)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/huggingface/modules/transformers_modules/modeling_glm4_moe.py", line 40, in __init__
    self.gate_proj = nn.Linear(config.hidden_size, intermediate_size, bias=False)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/nn/modules/linear.py", line 106, in __init__
    torch.empty((out_features, in_features), **factory_kwargs)
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/utils/_device.py", line 104, in __torch_function__
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
RuntimeError: NPU out of memory. Tried to allocate 12.00 MiB (NPU 1; 60.97 GiB total capacity; 60.59 GiB already allocated; 60.59 GiB current active; 28.28 MiB free; 60.66 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.

[2025-08-11 16:43:58] Received sigquit from a child process. It usually means the child failed.
Killed

Hey @thincal could you please open an issue here so that we can track it？

Signed-off-by: mywaaagh_admin <pkwarcraft@gmail.com> Signed-off-by: lichaoran <pkwarcraft@gmail.com> Co-authored-by: Even Zhou <even.y.zhou@outlook.com> Co-authored-by: ronnie_zheng <zl19940307@163.com>

pkking requested review from ByronHsu, HaiShaw, merrymercy and zhyncs as code owners August 2, 2025 07:55

gemini-code-assist bot reviewed Aug 2, 2025

View reviewed changes

pkking marked this pull request as draft August 2, 2025 07:57

gemini-code-assist bot reviewed Aug 2, 2025

View reviewed changes

iforgetmyname mentioned this pull request Aug 1, 2025

[Roadmap] Supporting Ascend NPU on 2025 H2 #8004

Closed

21 tasks

pkking marked this pull request as ready for review August 5, 2025 11:27

pkking changed the title ~~[WIP]feat: add docker release~~ [NPU]feat: add docker release Aug 6, 2025

pkking force-pushed the main branch from 4e45395 to b0b66a1 Compare August 6, 2025 00:39

ping1jing2 changed the title ~~[NPU]feat: add docker release~~ [feat] add ascend docker release Aug 6, 2025

pkking force-pushed the main branch from b0b66a1 to a7e1235 Compare August 6, 2025 08:15

pkking commented Aug 6, 2025

View reviewed changes

pkking commented Aug 7, 2025

View reviewed changes

docs/references/ascend_npu.md Outdated Show resolved Hide resolved

pkking force-pushed the main branch from 2bb5341 to 3779c37 Compare August 9, 2025 15:25