Support http redirection download by myandpr · Pull Request #59384 · ray-project/ray

myandpr · 2025-12-11T11:53:38Z

Description

When using runtime_env.working_dir with a remote zip archive URL (for example,https://gitee.com/whaozi/kuberay/repository/archive/master.zip), Ray downloads an HTML page instead of the actual zip file. This causes the Ray job to fail when accessing files from the working directory.

Downloading the same URL with standard tools such as wget works as expected and returns the correct zip archive. This PR addresses the inconsistency in how runtime_env.working_dir handles remote archive downloads.

for example

import ray

ray.init(include_dashboard=False, ignore_reinit_error=True)
@ray.remote(
    runtime_env={"working_dir": "https://gitee.com/whaozi/kuberay/repository/archive/master.zip"}
)
def list_repo_files():
    import pathlib
    return sorted(p.name for p in pathlib.Path(".").iterdir())

print(ray.get(list_repo_files.remote()))
ray.shutdown()

https_gitee_com_whaozi_kuberay_repository_archive_master is empty,
and
https_gitee_com_whaozi_kuberay_repository_archive_master.zip is an HTML file

We test

wget https://gitee.com/whaozi/kuberay/repository/archive/master.zip
--2025-08-05 14:28:52--  https://gitee.com/whaozi/kuberay/repository/archive/master.zip
Resolving gitee.com (gitee.com)... 180.76.198.77, 180.76.199.13, 180.76.198.225
Connecting to gitee.com (gitee.com)|180.76.198.77|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://gitee.com/whaozi/kuberay/repository/blazearchive/master.zip?Expires=1754430533&Signature=8EMfEVLuEJRLPHsJPQnqkwoSfWYTon6sdYUD7VrHZcM%3D [following]
--2025-08-05 14:28:54--  https://gitee.com/whaozi/kuberay/repository/blazearchive/master.zip?Expires=1754430533&Signature=8EMfEVLuEJRLPHsJPQnqkwoSfWYTon6sdYUD7VrHZcM%3D
Reusing existing connection to gitee.com:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘master.zip’

master.zip                                 [                                                  <=>                        ]  10.37M  1.23MB/s    in 13s

I think we are not handling http redirection here. If I directly use the redirected url, it works

from smart_open import open as open_file

with open_file("https://gitee.com/whaozi/kuberay/repository/blazearchive/master.zip?Expires=1754430533&Signature=8EMfEVLuEJRLPHsJPQnqkwoSfWYTon6sdYUD7VrHZcM%3D", "rb") as fin:
    with open_file("/tmp/jjyao_test.zip", "wb") as fout:
        fout.write(fin.read())

So,

Problem is:

When using runtime_env.working_dir with a remote zip URL (e.g. gitee archives), Ray’s HTTPS downloader uses the default Python-urllib user-agent, and some hosts respond with HTML rather than the archive. The working directory then contains HTML and the Ray job fails, while wget succeeds because it presents a curl-like user-agent.

Solution

_download_https_uri() now sets curl-like headers (ray-runtime-env-curl/1.0 UA + Accept: /, configurable via RAY_RUNTIME_ENV_HTTP_USER_AGENT). This keeps Ray’s behavior consistent with curl/wget, allowing gitee and similar hosts to return the proper zip file. A regression test verifies the headers are set.

Related issues

related issues: "Fixes #52233"

Additional information

Signed-off-by: yaommen <myanstu@163.com>

gemini-code-assist

Code Review

This pull request correctly addresses an issue with downloading files over HTTPS by using urllib.request and setting a curl-like User-Agent header. This prevents some servers from returning incorrect content. The changes are well-structured, with a new helper function for downloading and a corresponding unit test. My main feedback is to add a timeout to the network request to prevent potential hangs. I've also suggested a small change to the test to support this.

gemini-code-assist · 2025-12-11T11:54:48Z

python/ray/_private/runtime_env/protocol.py

+
+        request = urllib.request.Request(source_uri, headers=cls._http_headers())
+        try:
+            with urllib.request.urlopen(request) as response:


It's good practice to include a timeout for network requests to prevent the process from hanging indefinitely, as the default timeout can be very long or infinite. Consider adding a reasonable timeout, for example, 60 seconds. Note that the corresponding test mock for urlopen will also need to be updated to accept a timeout parameter.

Suggested change

with urllib.request.urlopen(request) as response:

with urllib.request.urlopen(request, timeout=60) as response:

gemini-code-assist · 2025-12-11T11:54:48Z

python/ray/tests/test_runtime_env_packaging.py

+        def __exit__(self, exc_type, exc, tb):
+            self.close()
+
+    def fake_urlopen(request):


To support the addition of a timeout to the urlopen call in _download_https_uri, this mock function's signature should be updated to accept a timeout argument.

Suggested change

def fake_urlopen(request):

def fake_urlopen(request, timeout=None):

python/ray/_private/runtime_env/protocol.py

Signed-off-by: yaommen <myanstu@163.com>

edoakes · 2025-12-12T16:36:17Z

@myandpr can you please add some more detail about the specific problem and how this solves it to the PR description?

And is there any way to configure smart_open properly so it handles this for us?

Signed-off-by: yaommen <myanstu@163.com>

myandpr · 2025-12-17T02:37:57Z

@myandpr can you please add some more detail about the specific problem and how this solves it to the PR description?

And is there any way to configure smart_open properly so it handles this for us?
@edoakes Thanks for your suggestion,
and I have updated the PR description.
otherwise,
I’ve updated the PR so HTTPS downloads now go through smart_open whenever it’s available, passing our curl‑style User-Agent/Accept headers.
Please take a look when you have time, Thanks very much.

.bazelrc

Signed-off-by: yaommen <myanstu@163.com>

myandpr · 2025-12-18T18:16:42Z

@edoakes PTAL. Thanks a lot

edoakes · 2025-12-18T22:57:44Z

python/ray/_private/runtime_env/protocol.py

+        # Prefer smart_open so we get consistent redirect/cert handling with the
+        # rest of our remote protocols.  Fall back to urllib if it is not
+        # available so HTTPS downloads keep working without extra deps.


I believe https downloads already depend on smart_open today, so there's no need for this special handling

I've removed the urllib fallback and introduced _handle_https_protocol, so HTTPS now always uses smart_open

edoakes · 2025-12-18T22:58:55Z

python/ray/_private/runtime_env/protocol.py

+        elif protocol == "https":
+            cls._download_https_uri(source_uri=source_uri, dest_file=dest_file)
+            return


instead of this code divergence with the early return, let's just have it return an open function that wraps the header settings

Done; the HTTPS branch now just calls _handle_https_protocol()

Signed-off-by: yaommen <myanstu@163.com>

edoakes

Thanks!

edoakes · 2025-12-22T22:14:19Z

CI tests triggered. PR will auto-merge if tests pass. If not, please ping me once tests are passing.

Signed-off-by: yaommen <myanstu@163.com>

myandpr · 2025-12-23T09:30:48Z

CI tests triggered. PR will auto-merge if tests pass. If not, please ping me once tests are passing.

Hi @edoakes , CI is green now—thanks again for the review! Could you help merge the PR

## Description When using `runtime_env.working_dir` with a remote zip archive URL (for example,`https://gitee.com/whaozi/kuberay/repository/archive/master.zip`), Ray downloads an HTML page instead of the actual zip file. This causes the Ray job to fail when accessing files from the working directory. Downloading the same URL with standard tools such as `wget` works as expected and returns the correct zip archive. This PR addresses the inconsistency in how `runtime_env.working_dir` handles remote archive downloads. #### for example ``` import ray ray.init(include_dashboard=False, ignore_reinit_error=True) @ray.remote( runtime_env={"working_dir": "https://gitee.com/whaozi/kuberay/repository/archive/master.zip"} ) def list_repo_files(): import pathlib return sorted(p.name for p in pathlib.Path(".").iterdir()) print(ray.get(list_repo_files.remote())) ray.shutdown() ``` https_gitee_com_whaozi_kuberay_repository_archive_master is empty, and https_gitee_com_whaozi_kuberay_repository_archive_master.zip is an HTML file <img width="1438" height="550" alt="image" src="https://github.com/user-attachments/assets/ec330c99-3bf7-431a-8f3e-6c1789e257ab" /> #### We test ``` wget https://gitee.com/whaozi/kuberay/repository/archive/master.zip --2025-08-05 14:28:52-- https://gitee.com/whaozi/kuberay/repository/archive/master.zip Resolving gitee.com (gitee.com)... 180.76.198.77, 180.76.199.13, 180.76.198.225 Connecting to gitee.com (gitee.com)|180.76.198.77|:443... connected. HTTP request sent, awaiting response... 302 Found Location: https://gitee.com/whaozi/kuberay/repository/blazearchive/master.zip?Expires=1754430533&Signature=8EMfEVLuEJRLPHsJPQnqkwoSfWYTon6sdYUD7VrHZcM%3D [following] --2025-08-05 14:28:54-- https://gitee.com/whaozi/kuberay/repository/blazearchive/master.zip?Expires=1754430533&Signature=8EMfEVLuEJRLPHsJPQnqkwoSfWYTon6sdYUD7VrHZcM%3D Reusing existing connection to gitee.com:443. HTTP request sent, awaiting response... 200 OK Length: unspecified [application/zip] Saving to: ‘master.zip’ master.zip [ <=> ] 10.37M 1.23MB/s in 13s ``` I think we are not handling http redirection here. If I directly use the redirected url, it works ``` from smart_open import open as open_file with open_file("https://gitee.com/whaozi/kuberay/repository/blazearchive/master.zip?Expires=1754430533&Signature=8EMfEVLuEJRLPHsJPQnqkwoSfWYTon6sdYUD7VrHZcM%3D", "rb") as fin: with open_file("/tmp/jjyao_test.zip", "wb") as fout: fout.write(fin.read()) ``` So, #### Problem is: When using runtime_env.working_dir with a remote zip URL (e.g. gitee archives), Ray’s HTTPS downloader uses the default Python-urllib user-agent, and some hosts respond with HTML rather than the archive. The working directory then contains HTML and the Ray job fails, while wget succeeds because it presents a curl-like user-agent. #### Solution _download_https_uri() now sets curl-like headers (ray-runtime-env-curl/1.0 UA + Accept: */*, configurable via RAY_RUNTIME_ENV_HTTP_USER_AGENT). This keeps Ray’s behavior consistent with curl/wget, allowing gitee and similar hosts to return the proper zip file. A regression test verifies the headers are set. ## Related issues related issues: "Fixes ray-project#52233" ## Additional information --------- Signed-off-by: yaommen <myanstu@163.com> Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>

## Description When using `runtime_env.working_dir` with a remote zip archive URL (for example,`https://gitee.com/whaozi/kuberay/repository/archive/master.zip`), Ray downloads an HTML page instead of the actual zip file. This causes the Ray job to fail when accessing files from the working directory. Downloading the same URL with standard tools such as `wget` works as expected and returns the correct zip archive. This PR addresses the inconsistency in how `runtime_env.working_dir` handles remote archive downloads. #### for example ``` import ray ray.init(include_dashboard=False, ignore_reinit_error=True) @ray.remote( runtime_env={"working_dir": "https://gitee.com/whaozi/kuberay/repository/archive/master.zip"} ) def list_repo_files(): import pathlib return sorted(p.name for p in pathlib.Path(".").iterdir()) print(ray.get(list_repo_files.remote())) ray.shutdown() ``` https_gitee_com_whaozi_kuberay_repository_archive_master is empty, and https_gitee_com_whaozi_kuberay_repository_archive_master.zip is an HTML file <img width="1438" height="550" alt="image" src="https://github.com/user-attachments/assets/ec330c99-3bf7-431a-8f3e-6c1789e257ab" /> #### We test ``` wget https://gitee.com/whaozi/kuberay/repository/archive/master.zip --2025-08-05 14:28:52-- https://gitee.com/whaozi/kuberay/repository/archive/master.zip Resolving gitee.com (gitee.com)... 180.76.198.77, 180.76.199.13, 180.76.198.225 Connecting to gitee.com (gitee.com)|180.76.198.77|:443... connected. HTTP request sent, awaiting response... 302 Found Location: https://gitee.com/whaozi/kuberay/repository/blazearchive/master.zip?Expires=1754430533&Signature=8EMfEVLuEJRLPHsJPQnqkwoSfWYTon6sdYUD7VrHZcM%3D [following] --2025-08-05 14:28:54-- https://gitee.com/whaozi/kuberay/repository/blazearchive/master.zip?Expires=1754430533&Signature=8EMfEVLuEJRLPHsJPQnqkwoSfWYTon6sdYUD7VrHZcM%3D Reusing existing connection to gitee.com:443. HTTP request sent, awaiting response... 200 OK Length: unspecified [application/zip] Saving to: ‘master.zip’ master.zip [ <=> ] 10.37M 1.23MB/s in 13s ``` I think we are not handling http redirection here. If I directly use the redirected url, it works ``` from smart_open import open as open_file with open_file("https://gitee.com/whaozi/kuberay/repository/blazearchive/master.zip?Expires=1754430533&Signature=8EMfEVLuEJRLPHsJPQnqkwoSfWYTon6sdYUD7VrHZcM%3D", "rb") as fin: with open_file("/tmp/jjyao_test.zip", "wb") as fout: fout.write(fin.read()) ``` So, #### Problem is: When using runtime_env.working_dir with a remote zip URL (e.g. gitee archives), Ray’s HTTPS downloader uses the default Python-urllib user-agent, and some hosts respond with HTML rather than the archive. The working directory then contains HTML and the Ray job fails, while wget succeeds because it presents a curl-like user-agent. #### Solution _download_https_uri() now sets curl-like headers (ray-runtime-env-curl/1.0 UA + Accept: */*, configurable via RAY_RUNTIME_ENV_HTTP_USER_AGENT). This keeps Ray’s behavior consistent with curl/wget, allowing gitee and similar hosts to return the proper zip file. A regression test verifies the headers are set. ## Related issues related issues: "Fixes ray-project#52233" ## Additional information --------- Signed-off-by: yaommen <myanstu@163.com>

## Description When using `runtime_env.working_dir` with a remote zip archive URL (for example,`https://gitee.com/whaozi/kuberay/repository/archive/master.zip`), Ray downloads an HTML page instead of the actual zip file. This causes the Ray job to fail when accessing files from the working directory. Downloading the same URL with standard tools such as `wget` works as expected and returns the correct zip archive. This PR addresses the inconsistency in how `runtime_env.working_dir` handles remote archive downloads. #### for example ``` import ray ray.init(include_dashboard=False, ignore_reinit_error=True) @ray.remote( runtime_env={"working_dir": "https://gitee.com/whaozi/kuberay/repository/archive/master.zip"} ) def list_repo_files(): import pathlib return sorted(p.name for p in pathlib.Path(".").iterdir()) print(ray.get(list_repo_files.remote())) ray.shutdown() ``` https_gitee_com_whaozi_kuberay_repository_archive_master is empty, and https_gitee_com_whaozi_kuberay_repository_archive_master.zip is an HTML file <img width="1438" height="550" alt="image" src="https://github.com/user-attachments/assets/ec330c99-3bf7-431a-8f3e-6c1789e257ab" /> #### We test ``` wget https://gitee.com/whaozi/kuberay/repository/archive/master.zip --2025-08-05 14:28:52-- https://gitee.com/whaozi/kuberay/repository/archive/master.zip Resolving gitee.com (gitee.com)... 180.76.198.77, 180.76.199.13, 180.76.198.225 Connecting to gitee.com (gitee.com)|180.76.198.77|:443... connected. HTTP request sent, awaiting response... 302 Found Location: https://gitee.com/whaozi/kuberay/repository/blazearchive/master.zip?Expires=1754430533&Signature=8EMfEVLuEJRLPHsJPQnqkwoSfWYTon6sdYUD7VrHZcM%3D [following] --2025-08-05 14:28:54-- https://gitee.com/whaozi/kuberay/repository/blazearchive/master.zip?Expires=1754430533&Signature=8EMfEVLuEJRLPHsJPQnqkwoSfWYTon6sdYUD7VrHZcM%3D Reusing existing connection to gitee.com:443. HTTP request sent, awaiting response... 200 OK Length: unspecified [application/zip] Saving to: ‘master.zip’ master.zip [ <=> ] 10.37M 1.23MB/s in 13s ``` I think we are not handling http redirection here. If I directly use the redirected url, it works ``` from smart_open import open as open_file with open_file("https://gitee.com/whaozi/kuberay/repository/blazearchive/master.zip?Expires=1754430533&Signature=8EMfEVLuEJRLPHsJPQnqkwoSfWYTon6sdYUD7VrHZcM%3D", "rb") as fin: with open_file("/tmp/jjyao_test.zip", "wb") as fout: fout.write(fin.read()) ``` So, #### Problem is: When using runtime_env.working_dir with a remote zip URL (e.g. gitee archives), Ray’s HTTPS downloader uses the default Python-urllib user-agent, and some hosts respond with HTML rather than the archive. The working directory then contains HTML and the Ray job fails, while wget succeeds because it presents a curl-like user-agent. #### Solution _download_https_uri() now sets curl-like headers (ray-runtime-env-curl/1.0 UA + Accept: */*, configurable via RAY_RUNTIME_ENV_HTTP_USER_AGENT). This keeps Ray’s behavior consistent with curl/wget, allowing gitee and similar hosts to return the proper zip file. A regression test verifies the headers are set. ## Related issues related issues: "Fixes ray-project#52233" ## Additional information --------- Signed-off-by: yaommen <myanstu@163.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

support http download

c296b83

Signed-off-by: yaommen <myanstu@163.com>

myandpr requested a review from a team as a code owner December 11, 2025 11:53

gemini-code-assist bot reviewed Dec 11, 2025

View reviewed changes

cursor bot reviewed Dec 11, 2025

View reviewed changes

python/ray/_private/runtime_env/protocol.py Outdated Show resolved Hide resolved

support http download

8c0192c

Signed-off-by: yaommen <myanstu@163.com>

myandpr mentioned this pull request Dec 11, 2025

ray download https url from working_dir get html page not zip file #52233

Closed

ray-gardener bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Dec 11, 2025

edoakes self-assigned this Dec 12, 2025

support http download

e7a9a00

Signed-off-by: yaommen <myanstu@163.com>

myandpr force-pushed the support-http-download branch from 1e23d4e to e7a9a00 Compare December 17, 2025 12:47

cursor bot reviewed Dec 17, 2025

View reviewed changes

.bazelrc Outdated Show resolved Hide resolved

myandpr force-pushed the support-http-download branch from b83ed88 to e7a9a00 Compare December 17, 2025 13:07

chore: retrigger ci

0791897

Signed-off-by: yaommen <myanstu@163.com>

myandpr force-pushed the support-http-download branch from 3746858 to 0791897 Compare December 17, 2025 14:29

edoakes reviewed Dec 18, 2025

View reviewed changes

fix

4ae85b0

Signed-off-by: yaommen <myanstu@163.com>

myandpr requested a review from edoakes December 22, 2025 19:43

edoakes added the go add ONLY when ready to merge, run all tests label Dec 22, 2025

edoakes approved these changes Dec 22, 2025

View reviewed changes

edoakes enabled auto-merge (squash) December 22, 2025 22:14

chore: retrigger ci

b13fa1d

Signed-off-by: yaommen <myanstu@163.com>

github-actions bot disabled auto-merge December 23, 2025 06:08

edoakes merged commit 2709187 into ray-project:master Dec 23, 2025
6 checks passed

	with urllib.request.urlopen(request) as response:
	with urllib.request.urlopen(request, timeout=60) as response:

	def fake_urlopen(request):
	def fake_urlopen(request, timeout=None):

Conversation

myandpr commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

for example

We test

Problem is:

Solution

Related issues

Additional information

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

edoakes commented Dec 12, 2025

Uh oh!

myandpr commented Dec 17, 2025

Uh oh!

Uh oh!

myandpr commented Dec 18, 2025

Uh oh!

edoakes Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

myandpr Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

edoakes Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

myandpr Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

edoakes left a comment

Choose a reason for hiding this comment

Uh oh!

edoakes commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

myandpr commented Dec 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

myandpr commented Dec 11, 2025 •

edited

Loading

edoakes commented Dec 22, 2025 •

edited

Loading