Skip to content

[Bug] Cannot use gguf #7404

@fireblade2534

Description

@fireblade2534

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

when trying to run it I get this error:

sglang            | [2025-06-20 23:51:48] server_args=ServerArgs(model_path='/models/qwen3-32b-gguf/Qwen3-32B-Q5_K_M.gguf', tokenizer_path='/models/qwen3-32b-gguf/Qwen3-32B-Q5_K_M.gguf', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='gguf', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization='gguf', quantization_param_path=None, context_length=None, device='cuda', served_model_name='/models/qwen3-32b-gguf/Qwen3-32B-Q5_K_M.gguf', chat_template=None, completion_template=None, is_embedding=False, enable_multimodal=None, revision=None, impl='auto', host='0.0.0.0', port=6060, mem_fraction_static=0.85, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=2048, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=2, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=109096101, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, bucket_time_to_first_token=None, bucket_e2e_request_latency=None, bucket_inter_token_latency=None, collect_tokens_histogram=False, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser='qwen3', tool_call_parser=None, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, ep_size=1, enable_ep_moe=False, enable_deepep_moe=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, cuda_graph_max_bs=8, cuda_graph_bs=None, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_nccl_nvls=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, disable_overlap_schedule=False, disable_overlap_cg_plan=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_torch_compile=False, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through_selective', flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, enable_return_hidden_states=False, warmups=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, num_reserved_decode_tokens=512, pdlb_url=None)
sglang            | Loading a GGUF checkpoint in PyTorch, requires both PyTorch and GGUF>=0.10.0 to be installed. Please see https://pytorch.org/ and https://github.com/ggerganov/llama.cpp/tree/master/gguf-py for installation instructions.
sglang            | Traceback (most recent call last):
sglang            |   File "<frozen runpy>", line 198, in _run_module_as_main
sglang            |   File "<frozen runpy>", line 88, in _run_code
sglang            |   File "/sgl-workspace/sglang/python/sglang/launch_server.py", line 14, in <module>
sglang            |     launch_server(server_args)
sglang            |   File "/sgl-workspace/sglang/python/sglang/srt/entrypoints/http_server.py", line 767, in launch_server
sglang            |     tokenizer_manager, scheduler_info = _launch_subprocesses(server_args=server_args)
sglang            |                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sglang            |   File "/sgl-workspace/sglang/python/sglang/srt/entrypoints/engine.py", line 760, in _launch_subprocesses
sglang            |     tokenizer_manager = TokenizerManager(server_args, port_args)
sglang            |                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sglang            |   File "/sgl-workspace/sglang/python/sglang/srt/managers/tokenizer_manager.py", line 196, in __init__
sglang            |     self.model_config = ModelConfig.from_server_args(server_args)
sglang            |                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sglang            |   File "/sgl-workspace/sglang/python/sglang/srt/configs/model_config.py", line 257, in from_server_args
sglang            |     return ModelConfig(
sglang            |            ^^^^^^^^^^^^
sglang            |   File "/sgl-workspace/sglang/python/sglang/srt/configs/model_config.py", line 77, in __init__
sglang            |     self.hf_config = get_config(
sglang            |                      ^^^^^^^^^^^
sglang            |   File "/sgl-workspace/sglang/python/sglang/srt/hf_transformers_utils.py", line 118, in get_config
sglang            |     config = AutoConfig.from_pretrained(
sglang            |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
sglang            |   File "/usr/local/lib/python3.12/dist-packages/transformers/models/auto/configuration_auto.py", line 1153, in from_pretrained
sglang            |     config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
sglang            |                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sglang            |   File "/usr/local/lib/python3.12/dist-packages/transformers/configuration_utils.py", line 595, in get_config_dict
sglang            |     config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
sglang            |                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sglang            |   File "/usr/local/lib/python3.12/dist-packages/transformers/configuration_utils.py", line 686, in _get_config_dict
sglang            |     config_dict = load_gguf_checkpoint(resolved_config_file, return_tensors=False)["config"]
sglang            |                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sglang            |   File "/usr/local/lib/python3.12/dist-packages/transformers/modeling_gguf_pytorch_utils.py", line 365, in load_gguf_checkpoint
sglang            |     raise ImportError("Please install torch and gguf>=0.10.0 to load a GGUF checkpoint in PyTorch.")
sglang            | ImportError: Please install torch and gguf>=0.10.0 to load a GGUF checkpoint in PyTorch.

I am running the offical sglang image so it doesn't make sense that the packages are not installed.

Reproduction

I am using the following docker compose file:

services:
  sglang:
    image: lmsysorg/sglang:latest
    container_name: sglang
    volumes:
      - ${HOME}/.cache/huggingface:/root/.cache/huggingface
      # If you use modelscope, you need mount this directory
      # - ${HOME}/.cache/modelscope:/root/.cache/modelscope
      - /home/fireblade2534/models:/models
    restart: always
    network_mode: host # required by RDMA
    privileged: true # required by RDMA
    # Or you can only publish port 30000
    # ports:
    #   - 30000:30000
    environment:
      HF_TOKEN: <secret>
      # if you use modelscope to download model, you need set this environment
      # - SGLANG_USE_MODELSCOPE: true
    entrypoint: python3 -m sglang.launch_server
    command: --model-path /models/qwen3-32b-gguf/Qwen3-32B-Q5_K_M.gguf
      --host 0.0.0.0
      --port 6060
      --reasoning-parser qwen3
      --tp 2
      --load-format gguf
    ulimits:
      memlock: -1
      stack: 67108864
    ipc: host
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://localhost:6060/health || exit 1"]
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

Environment

N/A

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions