Skip to content

Support http redirection download#59384

Merged
edoakes merged 6 commits intoray-project:masterfrom
myandpr:support-http-download
Dec 23, 2025
Merged

Support http redirection download#59384
edoakes merged 6 commits intoray-project:masterfrom
myandpr:support-http-download

Conversation

@myandpr
Copy link
Member

@myandpr myandpr commented Dec 11, 2025

Description

When using runtime_env.working_dir with a remote zip archive URL (for example,https://gitee.com/whaozi/kuberay/repository/archive/master.zip), Ray downloads an HTML page instead of the actual zip file. This causes the Ray job to fail when accessing files from the working directory.

Downloading the same URL with standard tools such as wget works as expected and returns the correct zip archive. This PR addresses the inconsistency in how runtime_env.working_dir handles remote archive downloads.

for example

import ray

ray.init(include_dashboard=False, ignore_reinit_error=True)
@ray.remote(
    runtime_env={"working_dir": "https://gitee.com/whaozi/kuberay/repository/archive/master.zip"}
)
def list_repo_files():
    import pathlib
    return sorted(p.name for p in pathlib.Path(".").iterdir())

print(ray.get(list_repo_files.remote()))
ray.shutdown()

https_gitee_com_whaozi_kuberay_repository_archive_master is empty,
and
https_gitee_com_whaozi_kuberay_repository_archive_master.zip is an HTML file
image

We test

wget https://gitee.com/whaozi/kuberay/repository/archive/master.zip
--2025-08-05 14:28:52--  https://gitee.com/whaozi/kuberay/repository/archive/master.zip
Resolving gitee.com (gitee.com)... 180.76.198.77, 180.76.199.13, 180.76.198.225
Connecting to gitee.com (gitee.com)|180.76.198.77|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://gitee.com/whaozi/kuberay/repository/blazearchive/master.zip?Expires=1754430533&Signature=8EMfEVLuEJRLPHsJPQnqkwoSfWYTon6sdYUD7VrHZcM%3D [following]
--2025-08-05 14:28:54--  https://gitee.com/whaozi/kuberay/repository/blazearchive/master.zip?Expires=1754430533&Signature=8EMfEVLuEJRLPHsJPQnqkwoSfWYTon6sdYUD7VrHZcM%3D
Reusing existing connection to gitee.com:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘master.zip’

master.zip                                 [                                                  <=>                        ]  10.37M  1.23MB/s    in 13s

I think we are not handling http redirection here. If I directly use the redirected url, it works

from smart_open import open as open_file

with open_file("https://gitee.com/whaozi/kuberay/repository/blazearchive/master.zip?Expires=1754430533&Signature=8EMfEVLuEJRLPHsJPQnqkwoSfWYTon6sdYUD7VrHZcM%3D", "rb") as fin:
    with open_file("/tmp/jjyao_test.zip", "wb") as fout:
        fout.write(fin.read())

So,

Problem is:

When using runtime_env.working_dir with a remote zip URL (e.g. gitee archives), Ray’s HTTPS downloader uses the default Python-urllib user-agent, and some hosts respond with HTML rather than the archive. The working directory then contains HTML and the Ray job fails, while wget succeeds because it presents a curl-like user-agent.

Solution

_download_https_uri() now sets curl-like headers (ray-runtime-env-curl/1.0 UA + Accept: /, configurable via RAY_RUNTIME_ENV_HTTP_USER_AGENT). This keeps Ray’s behavior consistent with curl/wget, allowing gitee and similar hosts to return the proper zip file. A regression test verifies the headers are set.

Related issues

related issues: "Fixes #52233"

Additional information

Signed-off-by: yaommen <myanstu@163.com>
@myandpr myandpr requested a review from a team as a code owner December 11, 2025 11:53
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly addresses an issue with downloading files over HTTPS by using urllib.request and setting a curl-like User-Agent header. This prevents some servers from returning incorrect content. The changes are well-structured, with a new helper function for downloading and a corresponding unit test. My main feedback is to add a timeout to the network request to prevent potential hangs. I've also suggested a small change to the test to support this.


request = urllib.request.Request(source_uri, headers=cls._http_headers())
try:
with urllib.request.urlopen(request) as response:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

It's good practice to include a timeout for network requests to prevent the process from hanging indefinitely, as the default timeout can be very long or infinite. Consider adding a reasonable timeout, for example, 60 seconds. Note that the corresponding test mock for urlopen will also need to be updated to accept a timeout parameter.

Suggested change
with urllib.request.urlopen(request) as response:
with urllib.request.urlopen(request, timeout=60) as response:

def __exit__(self, exc_type, exc, tb):
self.close()

def fake_urlopen(request):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To support the addition of a timeout to the urlopen call in _download_https_uri, this mock function's signature should be updated to accept a timeout argument.

Suggested change
def fake_urlopen(request):
def fake_urlopen(request, timeout=None):

Signed-off-by: yaommen <myanstu@163.com>
@ray-gardener ray-gardener bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Dec 11, 2025
@edoakes
Copy link
Collaborator

edoakes commented Dec 12, 2025

@myandpr can you please add some more detail about the specific problem and how this solves it to the PR description?

And is there any way to configure smart_open properly so it handles this for us?

@edoakes edoakes self-assigned this Dec 12, 2025
Signed-off-by: yaommen <myanstu@163.com>
@myandpr
Copy link
Member Author

myandpr commented Dec 17, 2025

@myandpr can you please add some more detail about the specific problem and how this solves it to the PR description?

And is there any way to configure smart_open properly so it handles this for us?
@edoakes Thanks for your suggestion,
and I have updated the PR description.
otherwise,
I’ve updated the PR so HTTPS downloads now go through smart_open whenever it’s available, passing our curl‑style User-Agent/Accept headers.
Please take a look when you have time, Thanks very much.

@myandpr myandpr force-pushed the support-http-download branch from 1e23d4e to e7a9a00 Compare December 17, 2025 12:47
@myandpr myandpr force-pushed the support-http-download branch from b83ed88 to e7a9a00 Compare December 17, 2025 13:07
Signed-off-by: yaommen <myanstu@163.com>
@myandpr myandpr force-pushed the support-http-download branch from 3746858 to 0791897 Compare December 17, 2025 14:29
@myandpr
Copy link
Member Author

myandpr commented Dec 18, 2025

@edoakes PTAL. Thanks a lot

Comment on lines +226 to +228
# Prefer smart_open so we get consistent redirect/cert handling with the
# rest of our remote protocols. Fall back to urllib if it is not
# available so HTTPS downloads keep working without extra deps.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe https downloads already depend on smart_open today, so there's no need for this special handling

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the urllib fallback and introduced _handle_https_protocol, so HTTPS now always uses smart_open

Comment on lines +277 to +279
elif protocol == "https":
cls._download_https_uri(source_uri=source_uri, dest_file=dest_file)
return
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of this code divergence with the early return, let's just have it return an open function that wraps the header settings

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done; the HTTPS branch now just calls _handle_https_protocol()

Signed-off-by: yaommen <myanstu@163.com>
@myandpr myandpr requested a review from edoakes December 22, 2025 19:43
@edoakes edoakes added the go add ONLY when ready to merge, run all tests label Dec 22, 2025
Copy link
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@edoakes edoakes enabled auto-merge (squash) December 22, 2025 22:14
@edoakes
Copy link
Collaborator

edoakes commented Dec 22, 2025

CI tests triggered. PR will auto-merge if tests pass. If not, please ping me once tests are passing.

Signed-off-by: yaommen <myanstu@163.com>
@github-actions github-actions bot disabled auto-merge December 23, 2025 06:08
@myandpr
Copy link
Member Author

myandpr commented Dec 23, 2025

CI tests triggered. PR will auto-merge if tests pass. If not, please ping me once tests are passing.

Hi @edoakes , CI is green now—thanks again for the review! Could you help merge the PR

@edoakes edoakes merged commit 2709187 into ray-project:master Dec 23, 2025
6 checks passed
AYou0207 pushed a commit to AYou0207/ray that referenced this pull request Jan 13, 2026
## Description
When using `runtime_env.working_dir` with a remote zip archive URL (for
example,`https://gitee.com/whaozi/kuberay/repository/archive/master.zip`),
Ray downloads an HTML page instead of the actual zip file. This causes
the Ray job to fail when accessing files from the working directory.

Downloading the same URL with standard tools such as `wget` works as
expected and returns the correct zip archive. This PR addresses the
inconsistency in how `runtime_env.working_dir` handles remote archive
downloads.

#### for example
```
import ray

ray.init(include_dashboard=False, ignore_reinit_error=True)
@ray.remote(
    runtime_env={"working_dir": "https://gitee.com/whaozi/kuberay/repository/archive/master.zip"}
)
def list_repo_files():
    import pathlib
    return sorted(p.name for p in pathlib.Path(".").iterdir())

print(ray.get(list_repo_files.remote()))
ray.shutdown()
```

https_gitee_com_whaozi_kuberay_repository_archive_master is empty,
and
https_gitee_com_whaozi_kuberay_repository_archive_master.zip is an HTML
file
<img width="1438" height="550" alt="image"
src="https://github.com/user-attachments/assets/ec330c99-3bf7-431a-8f3e-6c1789e257ab"
/>

#### We test
```
wget https://gitee.com/whaozi/kuberay/repository/archive/master.zip
--2025-08-05 14:28:52--  https://gitee.com/whaozi/kuberay/repository/archive/master.zip
Resolving gitee.com (gitee.com)... 180.76.198.77, 180.76.199.13, 180.76.198.225
Connecting to gitee.com (gitee.com)|180.76.198.77|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://gitee.com/whaozi/kuberay/repository/blazearchive/master.zip?Expires=1754430533&Signature=8EMfEVLuEJRLPHsJPQnqkwoSfWYTon6sdYUD7VrHZcM%3D [following]
--2025-08-05 14:28:54--  https://gitee.com/whaozi/kuberay/repository/blazearchive/master.zip?Expires=1754430533&Signature=8EMfEVLuEJRLPHsJPQnqkwoSfWYTon6sdYUD7VrHZcM%3D
Reusing existing connection to gitee.com:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘master.zip’

master.zip                                 [                                                  <=>                        ]  10.37M  1.23MB/s    in 13s
```
I think we are not handling http redirection here. If I directly use the
redirected url, it works
```
from smart_open import open as open_file

with open_file("https://gitee.com/whaozi/kuberay/repository/blazearchive/master.zip?Expires=1754430533&Signature=8EMfEVLuEJRLPHsJPQnqkwoSfWYTon6sdYUD7VrHZcM%3D", "rb") as fin:
    with open_file("/tmp/jjyao_test.zip", "wb") as fout:
        fout.write(fin.read())
```

So,
#### Problem is:
When using runtime_env.working_dir with a remote zip URL (e.g. gitee
archives), Ray’s HTTPS downloader uses the default Python-urllib
user-agent, and some hosts respond with HTML rather than the archive.
The working directory then contains HTML and the Ray job fails, while
wget succeeds because it presents a curl-like user-agent.

#### Solution
_download_https_uri() now sets curl-like headers
(ray-runtime-env-curl/1.0 UA + Accept: */*, configurable via
RAY_RUNTIME_ENV_HTTP_USER_AGENT). This keeps Ray’s behavior consistent
with curl/wget, allowing gitee and similar hosts to return the proper
zip file. A regression test verifies the headers are set.

## Related issues
related issues: "Fixes ray-project#52233"

## Additional information

---------

Signed-off-by: yaommen <myanstu@163.com>
Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>
lee1258561 pushed a commit to pinterest/ray that referenced this pull request Feb 3, 2026
## Description
When using `runtime_env.working_dir` with a remote zip archive URL (for
example,`https://gitee.com/whaozi/kuberay/repository/archive/master.zip`),
Ray downloads an HTML page instead of the actual zip file. This causes
the Ray job to fail when accessing files from the working directory.

Downloading the same URL with standard tools such as `wget` works as
expected and returns the correct zip archive. This PR addresses the
inconsistency in how `runtime_env.working_dir` handles remote archive
downloads.

#### for example
```
import ray

ray.init(include_dashboard=False, ignore_reinit_error=True)
@ray.remote(
    runtime_env={"working_dir": "https://gitee.com/whaozi/kuberay/repository/archive/master.zip"}
)
def list_repo_files():
    import pathlib
    return sorted(p.name for p in pathlib.Path(".").iterdir())

print(ray.get(list_repo_files.remote()))
ray.shutdown()
```

https_gitee_com_whaozi_kuberay_repository_archive_master is empty, 
and 
https_gitee_com_whaozi_kuberay_repository_archive_master.zip is an HTML
file
<img width="1438" height="550" alt="image"
src="https://github.com/user-attachments/assets/ec330c99-3bf7-431a-8f3e-6c1789e257ab"
/>


#### We test
```
wget https://gitee.com/whaozi/kuberay/repository/archive/master.zip
--2025-08-05 14:28:52--  https://gitee.com/whaozi/kuberay/repository/archive/master.zip
Resolving gitee.com (gitee.com)... 180.76.198.77, 180.76.199.13, 180.76.198.225
Connecting to gitee.com (gitee.com)|180.76.198.77|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://gitee.com/whaozi/kuberay/repository/blazearchive/master.zip?Expires=1754430533&Signature=8EMfEVLuEJRLPHsJPQnqkwoSfWYTon6sdYUD7VrHZcM%3D [following]
--2025-08-05 14:28:54--  https://gitee.com/whaozi/kuberay/repository/blazearchive/master.zip?Expires=1754430533&Signature=8EMfEVLuEJRLPHsJPQnqkwoSfWYTon6sdYUD7VrHZcM%3D
Reusing existing connection to gitee.com:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘master.zip’

master.zip                                 [                                                  <=>                        ]  10.37M  1.23MB/s    in 13s
```
I think we are not handling http redirection here. If I directly use the
redirected url, it works
```
from smart_open import open as open_file

with open_file("https://gitee.com/whaozi/kuberay/repository/blazearchive/master.zip?Expires=1754430533&Signature=8EMfEVLuEJRLPHsJPQnqkwoSfWYTon6sdYUD7VrHZcM%3D", "rb") as fin:
    with open_file("/tmp/jjyao_test.zip", "wb") as fout:
        fout.write(fin.read())
```

So, 
#### Problem is:
When using runtime_env.working_dir with a remote zip URL (e.g. gitee
archives), Ray’s HTTPS downloader uses the default Python-urllib
user-agent, and some hosts respond with HTML rather than the archive.
The working directory then contains HTML and the Ray job fails, while
wget succeeds because it presents a curl-like user-agent.

#### Solution
_download_https_uri() now sets curl-like headers
(ray-runtime-env-curl/1.0 UA + Accept: */*, configurable via
RAY_RUNTIME_ENV_HTTP_USER_AGENT). This keeps Ray’s behavior consistent
with curl/wget, allowing gitee and similar hosts to return the proper
zip file. A regression test verifies the headers are set.

## Related issues
related issues: "Fixes ray-project#52233"

## Additional information

---------

Signed-off-by: yaommen <myanstu@163.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
## Description
When using `runtime_env.working_dir` with a remote zip archive URL (for
example,`https://gitee.com/whaozi/kuberay/repository/archive/master.zip`),
Ray downloads an HTML page instead of the actual zip file. This causes
the Ray job to fail when accessing files from the working directory.

Downloading the same URL with standard tools such as `wget` works as
expected and returns the correct zip archive. This PR addresses the
inconsistency in how `runtime_env.working_dir` handles remote archive
downloads.

#### for example
```
import ray

ray.init(include_dashboard=False, ignore_reinit_error=True)
@ray.remote(
    runtime_env={"working_dir": "https://gitee.com/whaozi/kuberay/repository/archive/master.zip"}
)
def list_repo_files():
    import pathlib
    return sorted(p.name for p in pathlib.Path(".").iterdir())

print(ray.get(list_repo_files.remote()))
ray.shutdown()
```

https_gitee_com_whaozi_kuberay_repository_archive_master is empty,
and
https_gitee_com_whaozi_kuberay_repository_archive_master.zip is an HTML
file
<img width="1438" height="550" alt="image"
src="https://github.com/user-attachments/assets/ec330c99-3bf7-431a-8f3e-6c1789e257ab"
/>

#### We test
```
wget https://gitee.com/whaozi/kuberay/repository/archive/master.zip
--2025-08-05 14:28:52--  https://gitee.com/whaozi/kuberay/repository/archive/master.zip
Resolving gitee.com (gitee.com)... 180.76.198.77, 180.76.199.13, 180.76.198.225
Connecting to gitee.com (gitee.com)|180.76.198.77|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://gitee.com/whaozi/kuberay/repository/blazearchive/master.zip?Expires=1754430533&Signature=8EMfEVLuEJRLPHsJPQnqkwoSfWYTon6sdYUD7VrHZcM%3D [following]
--2025-08-05 14:28:54--  https://gitee.com/whaozi/kuberay/repository/blazearchive/master.zip?Expires=1754430533&Signature=8EMfEVLuEJRLPHsJPQnqkwoSfWYTon6sdYUD7VrHZcM%3D
Reusing existing connection to gitee.com:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘master.zip’

master.zip                                 [                                                  <=>                        ]  10.37M  1.23MB/s    in 13s
```
I think we are not handling http redirection here. If I directly use the
redirected url, it works
```
from smart_open import open as open_file

with open_file("https://gitee.com/whaozi/kuberay/repository/blazearchive/master.zip?Expires=1754430533&Signature=8EMfEVLuEJRLPHsJPQnqkwoSfWYTon6sdYUD7VrHZcM%3D", "rb") as fin:
    with open_file("/tmp/jjyao_test.zip", "wb") as fout:
        fout.write(fin.read())
```

So,
#### Problem is:
When using runtime_env.working_dir with a remote zip URL (e.g. gitee
archives), Ray’s HTTPS downloader uses the default Python-urllib
user-agent, and some hosts respond with HTML rather than the archive.
The working directory then contains HTML and the Ray job fails, while
wget succeeds because it presents a curl-like user-agent.

#### Solution
_download_https_uri() now sets curl-like headers
(ray-runtime-env-curl/1.0 UA + Accept: */*, configurable via
RAY_RUNTIME_ENV_HTTP_USER_AGENT). This keeps Ray’s behavior consistent
with curl/wget, allowing gitee and similar hosts to return the proper
zip file. A regression test verifies the headers are set.

## Related issues
related issues: "Fixes ray-project#52233"

## Additional information

---------

Signed-off-by: yaommen <myanstu@163.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ray download https url from working_dir get html page not zip file

2 participants