[Bug] Cannot use gguf

### Checklist

- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [x] 5. Please use English, otherwise it will be closed.

### Describe the bug

when trying to run it I get this error:

```
sglang            | [2025-06-20 23:51:48] server_args=ServerArgs(model_path='/models/qwen3-32b-gguf/Qwen3-32B-Q5_K_M.gguf', tokenizer_path='/models/qwen3-32b-gguf/Qwen3-32B-Q5_K_M.gguf', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='gguf', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization='gguf', quantization_param_path=None, context_length=None, device='cuda', served_model_name='/models/qwen3-32b-gguf/Qwen3-32B-Q5_K_M.gguf', chat_template=None, completion_template=None, is_embedding=False, enable_multimodal=None, revision=None, impl='auto', host='0.0.0.0', port=6060, mem_fraction_static=0.85, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=2048, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=2, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=109096101, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, bucket_time_to_first_token=None, bucket_e2e_request_latency=None, bucket_inter_token_latency=None, collect_tokens_histogram=False, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser='qwen3', tool_call_parser=None, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, ep_size=1, enable_ep_moe=False, enable_deepep_moe=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, cuda_graph_max_bs=8, cuda_graph_bs=None, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_nccl_nvls=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, disable_overlap_schedule=False, disable_overlap_cg_plan=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_torch_compile=False, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through_selective', flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, enable_return_hidden_states=False, warmups=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, num_reserved_decode_tokens=512, pdlb_url=None)
sglang            | Loading a GGUF checkpoint in PyTorch, requires both PyTorch and GGUF>=0.10.0 to be installed. Please see https://pytorch.org/ and https://github.com/ggerganov/llama.cpp/tree/master/gguf-py for installation instructions.
sglang            | Traceback (most recent call last):
sglang            |   File "<frozen runpy>", line 198, in _run_module_as_main
sglang            |   File "<frozen runpy>", line 88, in _run_code
sglang            |   File "/sgl-workspace/sglang/python/sglang/launch_server.py", line 14, in <module>
sglang            |     launch_server(server_args)
sglang            |   File "/sgl-workspace/sglang/python/sglang/srt/entrypoints/http_server.py", line 767, in launch_server
sglang            |     tokenizer_manager, scheduler_info = _launch_subprocesses(server_args=server_args)
sglang            |                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sglang            |   File "/sgl-workspace/sglang/python/sglang/srt/entrypoints/engine.py", line 760, in _launch_subprocesses
sglang            |     tokenizer_manager = TokenizerManager(server_args, port_args)
sglang            |                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sglang            |   File "/sgl-workspace/sglang/python/sglang/srt/managers/tokenizer_manager.py", line 196, in __init__
sglang            |     self.model_config = ModelConfig.from_server_args(server_args)
sglang            |                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sglang            |   File "/sgl-workspace/sglang/python/sglang/srt/configs/model_config.py", line 257, in from_server_args
sglang            |     return ModelConfig(
sglang            |            ^^^^^^^^^^^^
sglang            |   File "/sgl-workspace/sglang/python/sglang/srt/configs/model_config.py", line 77, in __init__
sglang            |     self.hf_config = get_config(
sglang            |                      ^^^^^^^^^^^
sglang            |   File "/sgl-workspace/sglang/python/sglang/srt/hf_transformers_utils.py", line 118, in get_config
sglang            |     config = AutoConfig.from_pretrained(
sglang            |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
sglang            |   File "/usr/local/lib/python3.12/dist-packages/transformers/models/auto/configuration_auto.py", line 1153, in from_pretrained
sglang            |     config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
sglang            |                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sglang            |   File "/usr/local/lib/python3.12/dist-packages/transformers/configuration_utils.py", line 595, in get_config_dict
sglang            |     config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
sglang            |                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sglang            |   File "/usr/local/lib/python3.12/dist-packages/transformers/configuration_utils.py", line 686, in _get_config_dict
sglang            |     config_dict = load_gguf_checkpoint(resolved_config_file, return_tensors=False)["config"]
sglang            |                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sglang            |   File "/usr/local/lib/python3.12/dist-packages/transformers/modeling_gguf_pytorch_utils.py", line 365, in load_gguf_checkpoint
sglang            |     raise ImportError("Please install torch and gguf>=0.10.0 to load a GGUF checkpoint in PyTorch.")
sglang            | ImportError: Please install torch and gguf>=0.10.0 to load a GGUF checkpoint in PyTorch.
```

I am running the offical sglang image so it doesn't make sense that the packages are not installed.

### Reproduction

I am using the following docker compose file:

```
services:
  sglang:
    image: lmsysorg/sglang:latest
    container_name: sglang
    volumes:
      - ${HOME}/.cache/huggingface:/root/.cache/huggingface
      # If you use modelscope, you need mount this directory
      # - ${HOME}/.cache/modelscope:/root/.cache/modelscope
      - /home/fireblade2534/models:/models
    restart: always
    network_mode: host # required by RDMA
    privileged: true # required by RDMA
    # Or you can only publish port 30000
    # ports:
    #   - 30000:30000
    environment:
      HF_TOKEN: <secret>
      # if you use modelscope to download model, you need set this environment
      # - SGLANG_USE_MODELSCOPE: true
    entrypoint: python3 -m sglang.launch_server
    command: --model-path /models/qwen3-32b-gguf/Qwen3-32B-Q5_K_M.gguf
      --host 0.0.0.0
      --port 6060
      --reasoning-parser qwen3
      --tp 2
      --load-format gguf
    ulimits:
      memlock: -1
      stack: 67108864
    ipc: host
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://localhost:6060/health || exit 1"]
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
```

### Environment

N/A

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Cannot use gguf #7404

Checklist

Describe the bug

Reproduction

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] Cannot use gguf #7404

Description

Checklist

Describe the bug

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions