Skip to content

Fix model loading from S3 buckets#11865

Closed
ganochenkodg wants to merge 21 commits intosgl-project:mainfrom
ganochenkodg:main
Closed

Fix model loading from S3 buckets#11865
ganochenkodg wants to merge 21 commits intosgl-project:mainfrom
ganochenkodg:main

Conversation

@ganochenkodg
Copy link

@ganochenkodg ganochenkodg commented Oct 20, 2025

Motivation

The goal of this PR is to fix loading models from S3 buckets. The code contains a bug in iterator function: https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/connector/s3.py#L104

file.removeprefix lefts / in the beginning of filename, and it causes issues during os.path.join execution. Python documentation about path.join says: "If any component is an absolute path, all previous components (on Windows, including the previous drive letter, if there was one) are thrown away, and joining continues." With such bug all files from S3 are saved in the root folder as a result, and Sglang can't find them during model loading:

ValueError: Unrecognized model in /tmp/tmpgtn2cumr. Should have a model_type key in its config.json, or contain one of the following strings in its name: list of model types

Modifications

This PR fixes path.join function and also adds S3 related python dependencies into docker image, because they are non present in current images by some reason.

destination_file = os.path.join(local_dir, file.removeprefix(base_dir))
print(destination_file)
/config.json

destination_file = os.path.join(local_dir, file.removeprefix(f"{base_dir}/"))
print(destination_file)
/tmp/somefolder/config.json

Accuracy Tests

Benchmarking and Profiling

Checklist

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @ganochenkodg, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves an issue where models failed to load from S3 buckets due to incorrect local file path generation. The fix involves a precise adjustment to how file paths are handled when downloading from S3, preventing files from being inadvertently saved to the root directory. Additionally, it updates the Docker build process to include essential S3 client libraries, ensuring the application has all required components for S3 integration.

Highlights

  • S3 Model Loading Fix: Corrected a critical bug in the S3 connector that prevented models from loading correctly by fixing an issue with path construction during file download, ensuring files are saved to the intended local directories.
  • Docker Image Updates: Added necessary S3-related Python dependencies (boto3, runai-model-streamer, runai-model-streamer-s3) to the Dockerfile to support S3 operations within the containerized environment.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a critical bug in the SGLang framework that prevented models from being loaded correctly from S3 buckets. The issue stemmed from an incorrect path manipulation within the pull_files function, leading to files being saved in the wrong directory. The fix involves correcting the file.removeprefix call to properly construct the destination file path. Additionally, the PR adds necessary S3 dependencies to the Docker image to ensure the framework can interact with S3 storage. I have identified a critical issue in the pull_files function that needs to be addressed to ensure correct functionality.

return

for file in files:
destination_file = os.path.join(self.local_dir, file.removeprefix(base_dir))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The original code file.removeprefix(base_dir) can lead to incorrect file paths if base_dir does not end with a /. This can cause files to be saved in the root directory instead of the intended subdirectory. The corrected code file.removeprefix(f"{base_dir}/") ensures that the prefix always includes the trailing slash, preventing this issue.

It's important to ensure that base_dir always ends with a / to avoid unexpected behavior. If base_dir is guaranteed to always end with a /, then the change is correct. However, if there's a possibility that base_dir might not end with a /, it would be safer to add a check to ensure that it does before calling removeprefix.

I'm marking this as critical because the bug causes all files to be saved in the root folder, which is a major issue.

Suggested change
destination_file = os.path.join(self.local_dir, file.removeprefix(base_dir))
destination_file = os.path.join(self.local_dir, file.removeprefix(f"{base_dir}/"))

@noa-neria
Copy link

noa-neria commented Oct 20, 2025

To fully support loading from S3 buckets the call to create_remote_connector should also be fixed here and here

This is because create_remote_connector now has additional parameter (device) in the definition

The device parameter is passed only when the model weights are loaded, but not in the other calls

@ganochenkodg
Copy link
Author

@noa-neria is it really necessary to fix this concrete bug? i tested this fix with sglang 0.5.2 and it works fine, you can find my image here - https://hub.docker.com/layers/dganochenko/sglang/v0.5.2/

@noa-neria
Copy link

noa-neria commented Oct 23, 2025

Yes, it is necessary.
For latest version 0.5.3 sglang.launch_server --model-path "s3://path/to/model" fails with the following error:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/ubuntu/sglang_env/lib/python3.12/site-packages/sglang/launch_server.py", line 11, in <module>
    server_args = prepare_server_args(sys.argv[1:])
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/sglang_env/lib/python3.12/site-packages/sglang/srt/server_args.py", line 3477, in prepare_server_args
    return ServerArgs.from_cli_args(raw_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/sglang_env/lib/python3.12/site-packages/sglang/srt/server_args.py", line 3126, in from_cli_args
    return cls(**{attr: getattr(args, attr) for attr in attrs})
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<string>", line 258, in __init__
  File "/home/ubuntu/sglang_env/lib/python3.12/site-packages/sglang/srt/server_args.py", line 522, in __post_init__
    self._handle_gpu_memory_settings(gpu_mem)
  File "/home/ubuntu/sglang_env/lib/python3.12/site-packages/sglang/srt/server_args.py", line 763, in _handle_gpu_memory_settings
    model_config = ModelConfig.from_server_args(self)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/sglang_env/lib/python3.12/site-packages/sglang/srt/configs/model_config.py", line 208, in from_server_args
    return ModelConfig(
           ^^^^^^^^^^^^
  File "/home/ubuntu/sglang_env/lib/python3.12/site-packages/sglang/srt/configs/model_config.py", line 107, in __init__
    self._maybe_pull_model_tokenizer_from_remote()
  File "/home/ubuntu/sglang_env/lib/python3.12/site-packages/sglang/srt/configs/model_config.py", line 728, in _maybe_pull_model_tokenizer_from_remote
    client = create_remote_connector(self.model_path)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: create_remote_connector() missing 1 required positional argument: 'device'

@ganochenkodg
Copy link
Author

Yes, it is necessary. For latest version 0.5.3 sglang.launch_server --model-path "s3://path/to/model" fails with the following error:
Got it, will fix that soon.

@ganochenkodg
Copy link
Author

ganochenkodg commented Oct 27, 2025

@noa-neria done, i fixed it and tested with s3 and huggingface

S3 logs

root@sglang-7946cfd5dc-488z4:/sgl-workspace/sglang# pip list| grep sglang
sglang                    0.5.4.post1   /sgl-workspace/sglang/python
sglang-router             0.2.1
root@sglang-7946cfd5dc-488z4:/sgl-workspace/sglang# python3 -m sglang.launch_server --model-path s3://epam-team-gpu-benchmark/Llama-3.1-8B-Instruct --served-model-name unsloth/Llama-3.1-8B-Instruct --load-format remote --host 0.0.0.0 --port 30000 --context-length 8192
[2025-10-27 22:37:09] INFO model_config.py:784: Pulling model configs from remote...
[2025-10-27 22:37:09] INFO credentials.py:1224: Found credentials in environment variables.
[2025-10-27 22:37:09] INFO configprovider.py:988: Found endpoint for s3 via: environment_global.
[2025-10-27 22:37:10] INFO configprovider.py:988: Found endpoint for s3 via: environment_global.
...
Loading safetensors using Runai Model Streamer: 100% Completed | 4/4 [00:30<00:00,  6.39s/it]
Loading safetensors using Runai Model Streamer: 100% Completed | 4/4 [00:30<00:00,  7.65s/it]
[RunAI Streamer] Overall time to stream 15.0 GiB of all files to cpu: 30.62s, 500.2 MiB/s
[2025-10-27 22:38:07] Loaded weights from remote storage in 31.24 seconds.
[2025-10-27 22:38:07] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=6.79 GB, mem usage=15.05 GB.
[2025-10-27 22:38:07] Using KV cache dtype: torch.bfloat16
[2025-10-27 22:38:07] KV Cache is allocated. #tokens: 26483, K size: 1.62 GB, V size: 1.62 GB
[2025-10-27 22:38:07] Memory pool end. avail mem=3.46 GB
[2025-10-27 22:38:07] Capture cuda graph begin. This can take up to several minutes. avail mem=2.95 GB
[2025-10-27 22:38:07] Capture cuda graph bs [1, 2, 4, 8, 12, 16, 24]
Capturing batches (bs=1 avail_mem=2.82 GB): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [01:57<00:00, 16.74s/it]
[2025-10-27 22:40:05] Capture cuda graph end. Time elapsed: 117.98 s. mem usage=0.14 GB. avail mem=2.81 GB.
[2025-10-27 22:40:05] Found endpoint for s3 via: environment_global.
[2025-10-27 22:40:07] max_total_num_tokens=26483, chunked_prefill_size=2048, max_prefill_tokens=16384, max_running_requests=2048, context_len=8192, available_gpu_mem=2.81 GB
[2025-10-27 22:40:08] INFO:     Started server process [83]
[2025-10-27 22:40:08] INFO:     Waiting for application startup.
[2025-10-27 22:40:08] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 0.6, 'top_k': 50, 'top_p': 0.9}
[2025-10-27 22:40:08] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 0.6, 'top_k': 50, 'top_p': 0.9}
[2025-10-27 22:40:08] INFO:     Application startup complete.
[2025-10-27 22:40:08] INFO:     Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)
[2025-10-27 22:40:09] INFO:     127.0.0.1:42290 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-10-27 22:40:09] Prefill batch, #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-10-27 22:40:13] INFO:     127.0.0.1:42304 - "POST /generate HTTP/1.1" 200 OK
[2025-10-27 22:40:13] The server is fired up and ready to roll!

HF logs

root@sglang-7946cfd5dc-488z4:/sgl-workspace/sglang# python3 -m sglang.launch_server --model-path unsloth/Llama-3.1-8B-Instruct  --host 0.0.0.0 --port 30000 --context-length 8192
[2025-10-27 22:45:06] WARNING server_args.py:1104: Attention backend not explicitly specified. Use flashinfer backend by default.
[2025-10-27 22:45:06] INFO trace.py:48: opentelemetry package is not installed, tracing disabled
...
[2025-10-27 22:45:27] Load weight begin. avail mem=21.84 GB
[2025-10-27 22:45:28] Using model weights format ['*.safetensors']
model-00001-of-00004.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.98G/4.98G [00:28<00:00, 175MB/s]
model-00002-of-00004.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5.00G/5.00G [00:31<00:00, 161MB/s]
model-00003-of-00004.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.92G/4.92G [00:30<00:00, 162MB/s]
model-00004-of-00004.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.17G/1.17G [00:06<00:00, 178MB/s]
model.safetensors.index.json: 23.9kB [00:00, 45.8MB/s]
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:28<01:24, 28.10s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:28<00:23, 11.86s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:56<00:19, 19.16s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [01:24<00:00, 22.76s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [01:24<00:00, 21.18s/it]
[2025-10-27 22:48:31] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=6.79 GB, mem usage=15.05 GB.
[2025-10-27 22:48:31] Using KV cache dtype: torch.bfloat16
[2025-10-27 22:48:31] KV Cache is allocated. #tokens: 26483, K size: 1.62 GB, V size: 1.62 GB
[2025-10-27 22:48:31] Memory pool end. avail mem=3.46 GB
[2025-10-27 22:48:31] Capture cuda graph begin. This can take up to several minutes. avail mem=2.95 GB
[2025-10-27 22:48:31] Capture cuda graph bs [1, 2, 4, 8, 12, 16, 24]
Capturing batches (bs=1 avail_mem=2.82 GB): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:02<00:00,  2.57it/s]
[2025-10-27 22:48:35] Capture cuda graph end. Time elapsed: 3.46 s. mem usage=0.14 GB. avail mem=2.81 GB.
[2025-10-27 22:48:36] max_total_num_tokens=26483, chunked_prefill_size=2048, max_prefill_tokens=16384, max_running_requests=2048, context_len=8192, available_gpu_mem=2.81 GB
[2025-10-27 22:48:37] INFO:     Started server process [861]
[2025-10-27 22:48:37] INFO:     Waiting for application startup.
[2025-10-27 22:48:37] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 0.6, 'top_k': 50, 'top_p': 0.9}
[2025-10-27 22:48:37] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 0.6, 'top_k': 50, 'top_p': 0.9}
[2025-10-27 22:48:37] INFO:     Application startup complete.
[2025-10-27 22:48:37] INFO:     Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)
[2025-10-27 22:48:38] INFO:     127.0.0.1:36104 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-10-27 22:48:38] Prefill batch, #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-10-27 22:48:39] INFO:     127.0.0.1:36106 - "POST /generate HTTP/1.1" 200 OK
[2025-10-27 22:48:39] The server is fired up and ready to roll!

@noa-neria
Copy link

@merrymercy @hnyls2002 can you please help review?
This PR fixes a bug which entered in #8215 and broke the integration with the connector to object storage
I am one of the maintainers of the model streamer which is used here for fast loading

model_impl: Union[str, ModelImpl] = ModelImpl.AUTO,
sampling_defaults: str = "openai",
quantize_and_serve: bool = False,
device: Optional[str] = None,
Copy link
Collaborator

@b8zhong b8zhong Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to avoid adding this to model config? @ganochenkodg

Edit: disregard this comment.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ganochenkodg, why would you need this device config?

Copy link
Author

@ganochenkodg ganochenkodg Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hnyls2002 In ModelConfig init function we run self._maybe_pull_model_tokenizer_from_remote() (line 116) that executes create_remote_connector, which requires 2 params - model_path and device (line 812). This device variable i can take in from_server_args method (line 240) and must store somewhere to pass in create_remote_connector.

I can remove storing device as a model parameter and pass it to pull_model_tokenizer function directly, like this:
- self.device = device
- self._maybe_pull_model_tokenizer_from_remote()
+ self._maybe_pull_model_tokenizer_from_remote(device)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please do it, the device parameter should not be stored in model config.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hnyls2002 done, check it now please

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ganochenkodg It still seems. tobe in ModelConfig?

Image

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@b8zhong i left it in parameters, because i have no choice. we doesn't store store it in this.device now, only pass next. ModelConfig init function calls self._maybe_pull_model_tokenizer_from_remote (here) -> which calls client_remote_connector (here) whioch requires device as a parameter. so i added passing device parameter to init function in from_server_args (here), i can't find any other way to take "device" other than passing it here. without device create_remote_connector obviously crashes

&& rm -rf /root/.cargo /root/.rustup target dist ~/.cargo

# Install S3 dependencies
RUN python3 -m pip install --no-cache-dir --break-system-packages \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why break system packages?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always add this to avoid warnings from pip. Looks like not needed here, fixed.

model_impl: Union[str, ModelImpl] = ModelImpl.AUTO,
sampling_defaults: str = "openai",
quantize_and_serve: bool = False,
device: Optional[str] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ganochenkodg, why would you need this device config?

@ganochenkodg
Copy link
Author

UPD:
Also I added support for gs:// prefixes in model_path. Now you can use gs://bucket_name/model/path and don't set specific environment variables for Google Cloud Storage buckets, they will be set automatically.

root@sglang-66b468b458-mjrxg:/sgl-workspace/sglang# env| grep ENDPOINT
root@sglang-66b468b458-mjrxg:/sgl-workspace/sglang# python3 -m sglang.launch_server --model-path gs://epam-team-gpu-benchmark/Llama-3.1-8B-Instruct --served-model-name unsloth/Llama-3.1-8B-Instruct --load-format remote --host 0.0.0.0 --port 30000 --context-length 8192
[2025-11-07 15:56:26] INFO model_config.py:808: Pulling model configs from remote...
[2025-11-07 15:56:26] INFO credentials.py:1224: Found credentials in environment variables.
[2025-11-07 15:56:27] INFO configprovider.py:988: Found endpoint for s3 via: environment_global.
[2025-11-07 15:56:27] INFO configprovider.py:988: Found endpoint for s3 via: environment_global.
...
[2025-11-07 15:59:09] INFO:     127.0.0.1:33484 - "POST /generate HTTP/1.1" 200 OK
[2025-11-07 15:59:09] The server is fired up and ready to roll!

@noa-neria @hnyls2002 ^

model_impl: Union[str, ModelImpl] = ModelImpl.AUTO,
sampling_defaults: str = "openai",
quantize_and_serve: bool = False,
device: Optional[str] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please do it, the device parameter should not be stored in model config.

@b8zhong
Copy link
Collaborator

b8zhong commented Dec 2, 2025

/tag-and-rerun-ci

@ganochenkodg ganochenkodg marked this pull request as draft January 29, 2026 01:28
@noa-neria
Copy link

@b8zhong @hnyls2002 @ganochenkodg Model Streamer team just opened PR #17948 for a solid integration with runai model streamer, as was done in vLLM.
The new PR solves the issues in the current (broken) integration and is native in both s3 and gcs.
The new integration also supports faster loading of multiple files and distributed streaming.

@b8zhong
Copy link
Collaborator

b8zhong commented Jan 30, 2026

Closing because of updated PR. But thanks @ganochenkodg

@b8zhong b8zhong closed this Jan 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

Comments