Support http redirection download#59384
Conversation
Signed-off-by: yaommen <myanstu@163.com>
There was a problem hiding this comment.
Code Review
This pull request correctly addresses an issue with downloading files over HTTPS by using urllib.request and setting a curl-like User-Agent header. This prevents some servers from returning incorrect content. The changes are well-structured, with a new helper function for downloading and a corresponding unit test. My main feedback is to add a timeout to the network request to prevent potential hangs. I've also suggested a small change to the test to support this.
|
|
||
| request = urllib.request.Request(source_uri, headers=cls._http_headers()) | ||
| try: | ||
| with urllib.request.urlopen(request) as response: |
There was a problem hiding this comment.
It's good practice to include a timeout for network requests to prevent the process from hanging indefinitely, as the default timeout can be very long or infinite. Consider adding a reasonable timeout, for example, 60 seconds. Note that the corresponding test mock for urlopen will also need to be updated to accept a timeout parameter.
| with urllib.request.urlopen(request) as response: | |
| with urllib.request.urlopen(request, timeout=60) as response: |
| def __exit__(self, exc_type, exc, tb): | ||
| self.close() | ||
|
|
||
| def fake_urlopen(request): |
Signed-off-by: yaommen <myanstu@163.com>
|
@myandpr can you please add some more detail about the specific problem and how this solves it to the PR description? And is there any way to configure |
Signed-off-by: yaommen <myanstu@163.com>
|
1e23d4e to
e7a9a00
Compare
b83ed88 to
e7a9a00
Compare
Signed-off-by: yaommen <myanstu@163.com>
3746858 to
0791897
Compare
|
@edoakes PTAL. Thanks a lot |
| # Prefer smart_open so we get consistent redirect/cert handling with the | ||
| # rest of our remote protocols. Fall back to urllib if it is not | ||
| # available so HTTPS downloads keep working without extra deps. |
There was a problem hiding this comment.
I believe https downloads already depend on smart_open today, so there's no need for this special handling
There was a problem hiding this comment.
I've removed the urllib fallback and introduced _handle_https_protocol, so HTTPS now always uses smart_open
| elif protocol == "https": | ||
| cls._download_https_uri(source_uri=source_uri, dest_file=dest_file) | ||
| return |
There was a problem hiding this comment.
instead of this code divergence with the early return, let's just have it return an open function that wraps the header settings
There was a problem hiding this comment.
Done; the HTTPS branch now just calls _handle_https_protocol()
|
CI tests triggered. PR will auto-merge if tests pass. If not, please ping me once tests are passing. |
Signed-off-by: yaommen <myanstu@163.com>
## Description When using `runtime_env.working_dir` with a remote zip archive URL (for example,`https://gitee.com/whaozi/kuberay/repository/archive/master.zip`), Ray downloads an HTML page instead of the actual zip file. This causes the Ray job to fail when accessing files from the working directory. Downloading the same URL with standard tools such as `wget` works as expected and returns the correct zip archive. This PR addresses the inconsistency in how `runtime_env.working_dir` handles remote archive downloads. #### for example ``` import ray ray.init(include_dashboard=False, ignore_reinit_error=True) @ray.remote( runtime_env={"working_dir": "https://gitee.com/whaozi/kuberay/repository/archive/master.zip"} ) def list_repo_files(): import pathlib return sorted(p.name for p in pathlib.Path(".").iterdir()) print(ray.get(list_repo_files.remote())) ray.shutdown() ``` https_gitee_com_whaozi_kuberay_repository_archive_master is empty, and https_gitee_com_whaozi_kuberay_repository_archive_master.zip is an HTML file <img width="1438" height="550" alt="image" src="https://github.com/user-attachments/assets/ec330c99-3bf7-431a-8f3e-6c1789e257ab" /> #### We test ``` wget https://gitee.com/whaozi/kuberay/repository/archive/master.zip --2025-08-05 14:28:52-- https://gitee.com/whaozi/kuberay/repository/archive/master.zip Resolving gitee.com (gitee.com)... 180.76.198.77, 180.76.199.13, 180.76.198.225 Connecting to gitee.com (gitee.com)|180.76.198.77|:443... connected. HTTP request sent, awaiting response... 302 Found Location: https://gitee.com/whaozi/kuberay/repository/blazearchive/master.zip?Expires=1754430533&Signature=8EMfEVLuEJRLPHsJPQnqkwoSfWYTon6sdYUD7VrHZcM%3D [following] --2025-08-05 14:28:54-- https://gitee.com/whaozi/kuberay/repository/blazearchive/master.zip?Expires=1754430533&Signature=8EMfEVLuEJRLPHsJPQnqkwoSfWYTon6sdYUD7VrHZcM%3D Reusing existing connection to gitee.com:443. HTTP request sent, awaiting response... 200 OK Length: unspecified [application/zip] Saving to: ‘master.zip’ master.zip [ <=> ] 10.37M 1.23MB/s in 13s ``` I think we are not handling http redirection here. If I directly use the redirected url, it works ``` from smart_open import open as open_file with open_file("https://gitee.com/whaozi/kuberay/repository/blazearchive/master.zip?Expires=1754430533&Signature=8EMfEVLuEJRLPHsJPQnqkwoSfWYTon6sdYUD7VrHZcM%3D", "rb") as fin: with open_file("/tmp/jjyao_test.zip", "wb") as fout: fout.write(fin.read()) ``` So, #### Problem is: When using runtime_env.working_dir with a remote zip URL (e.g. gitee archives), Ray’s HTTPS downloader uses the default Python-urllib user-agent, and some hosts respond with HTML rather than the archive. The working directory then contains HTML and the Ray job fails, while wget succeeds because it presents a curl-like user-agent. #### Solution _download_https_uri() now sets curl-like headers (ray-runtime-env-curl/1.0 UA + Accept: */*, configurable via RAY_RUNTIME_ENV_HTTP_USER_AGENT). This keeps Ray’s behavior consistent with curl/wget, allowing gitee and similar hosts to return the proper zip file. A regression test verifies the headers are set. ## Related issues related issues: "Fixes ray-project#52233" ## Additional information --------- Signed-off-by: yaommen <myanstu@163.com> Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>
## Description When using `runtime_env.working_dir` with a remote zip archive URL (for example,`https://gitee.com/whaozi/kuberay/repository/archive/master.zip`), Ray downloads an HTML page instead of the actual zip file. This causes the Ray job to fail when accessing files from the working directory. Downloading the same URL with standard tools such as `wget` works as expected and returns the correct zip archive. This PR addresses the inconsistency in how `runtime_env.working_dir` handles remote archive downloads. #### for example ``` import ray ray.init(include_dashboard=False, ignore_reinit_error=True) @ray.remote( runtime_env={"working_dir": "https://gitee.com/whaozi/kuberay/repository/archive/master.zip"} ) def list_repo_files(): import pathlib return sorted(p.name for p in pathlib.Path(".").iterdir()) print(ray.get(list_repo_files.remote())) ray.shutdown() ``` https_gitee_com_whaozi_kuberay_repository_archive_master is empty, and https_gitee_com_whaozi_kuberay_repository_archive_master.zip is an HTML file <img width="1438" height="550" alt="image" src="https://github.com/user-attachments/assets/ec330c99-3bf7-431a-8f3e-6c1789e257ab" /> #### We test ``` wget https://gitee.com/whaozi/kuberay/repository/archive/master.zip --2025-08-05 14:28:52-- https://gitee.com/whaozi/kuberay/repository/archive/master.zip Resolving gitee.com (gitee.com)... 180.76.198.77, 180.76.199.13, 180.76.198.225 Connecting to gitee.com (gitee.com)|180.76.198.77|:443... connected. HTTP request sent, awaiting response... 302 Found Location: https://gitee.com/whaozi/kuberay/repository/blazearchive/master.zip?Expires=1754430533&Signature=8EMfEVLuEJRLPHsJPQnqkwoSfWYTon6sdYUD7VrHZcM%3D [following] --2025-08-05 14:28:54-- https://gitee.com/whaozi/kuberay/repository/blazearchive/master.zip?Expires=1754430533&Signature=8EMfEVLuEJRLPHsJPQnqkwoSfWYTon6sdYUD7VrHZcM%3D Reusing existing connection to gitee.com:443. HTTP request sent, awaiting response... 200 OK Length: unspecified [application/zip] Saving to: ‘master.zip’ master.zip [ <=> ] 10.37M 1.23MB/s in 13s ``` I think we are not handling http redirection here. If I directly use the redirected url, it works ``` from smart_open import open as open_file with open_file("https://gitee.com/whaozi/kuberay/repository/blazearchive/master.zip?Expires=1754430533&Signature=8EMfEVLuEJRLPHsJPQnqkwoSfWYTon6sdYUD7VrHZcM%3D", "rb") as fin: with open_file("/tmp/jjyao_test.zip", "wb") as fout: fout.write(fin.read()) ``` So, #### Problem is: When using runtime_env.working_dir with a remote zip URL (e.g. gitee archives), Ray’s HTTPS downloader uses the default Python-urllib user-agent, and some hosts respond with HTML rather than the archive. The working directory then contains HTML and the Ray job fails, while wget succeeds because it presents a curl-like user-agent. #### Solution _download_https_uri() now sets curl-like headers (ray-runtime-env-curl/1.0 UA + Accept: */*, configurable via RAY_RUNTIME_ENV_HTTP_USER_AGENT). This keeps Ray’s behavior consistent with curl/wget, allowing gitee and similar hosts to return the proper zip file. A regression test verifies the headers are set. ## Related issues related issues: "Fixes ray-project#52233" ## Additional information --------- Signed-off-by: yaommen <myanstu@163.com>
## Description When using `runtime_env.working_dir` with a remote zip archive URL (for example,`https://gitee.com/whaozi/kuberay/repository/archive/master.zip`), Ray downloads an HTML page instead of the actual zip file. This causes the Ray job to fail when accessing files from the working directory. Downloading the same URL with standard tools such as `wget` works as expected and returns the correct zip archive. This PR addresses the inconsistency in how `runtime_env.working_dir` handles remote archive downloads. #### for example ``` import ray ray.init(include_dashboard=False, ignore_reinit_error=True) @ray.remote( runtime_env={"working_dir": "https://gitee.com/whaozi/kuberay/repository/archive/master.zip"} ) def list_repo_files(): import pathlib return sorted(p.name for p in pathlib.Path(".").iterdir()) print(ray.get(list_repo_files.remote())) ray.shutdown() ``` https_gitee_com_whaozi_kuberay_repository_archive_master is empty, and https_gitee_com_whaozi_kuberay_repository_archive_master.zip is an HTML file <img width="1438" height="550" alt="image" src="https://github.com/user-attachments/assets/ec330c99-3bf7-431a-8f3e-6c1789e257ab" /> #### We test ``` wget https://gitee.com/whaozi/kuberay/repository/archive/master.zip --2025-08-05 14:28:52-- https://gitee.com/whaozi/kuberay/repository/archive/master.zip Resolving gitee.com (gitee.com)... 180.76.198.77, 180.76.199.13, 180.76.198.225 Connecting to gitee.com (gitee.com)|180.76.198.77|:443... connected. HTTP request sent, awaiting response... 302 Found Location: https://gitee.com/whaozi/kuberay/repository/blazearchive/master.zip?Expires=1754430533&Signature=8EMfEVLuEJRLPHsJPQnqkwoSfWYTon6sdYUD7VrHZcM%3D [following] --2025-08-05 14:28:54-- https://gitee.com/whaozi/kuberay/repository/blazearchive/master.zip?Expires=1754430533&Signature=8EMfEVLuEJRLPHsJPQnqkwoSfWYTon6sdYUD7VrHZcM%3D Reusing existing connection to gitee.com:443. HTTP request sent, awaiting response... 200 OK Length: unspecified [application/zip] Saving to: ‘master.zip’ master.zip [ <=> ] 10.37M 1.23MB/s in 13s ``` I think we are not handling http redirection here. If I directly use the redirected url, it works ``` from smart_open import open as open_file with open_file("https://gitee.com/whaozi/kuberay/repository/blazearchive/master.zip?Expires=1754430533&Signature=8EMfEVLuEJRLPHsJPQnqkwoSfWYTon6sdYUD7VrHZcM%3D", "rb") as fin: with open_file("/tmp/jjyao_test.zip", "wb") as fout: fout.write(fin.read()) ``` So, #### Problem is: When using runtime_env.working_dir with a remote zip URL (e.g. gitee archives), Ray’s HTTPS downloader uses the default Python-urllib user-agent, and some hosts respond with HTML rather than the archive. The working directory then contains HTML and the Ray job fails, while wget succeeds because it presents a curl-like user-agent. #### Solution _download_https_uri() now sets curl-like headers (ray-runtime-env-curl/1.0 UA + Accept: */*, configurable via RAY_RUNTIME_ENV_HTTP_USER_AGENT). This keeps Ray’s behavior consistent with curl/wget, allowing gitee and similar hosts to return the proper zip file. A regression test verifies the headers are set. ## Related issues related issues: "Fixes ray-project#52233" ## Additional information --------- Signed-off-by: yaommen <myanstu@163.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
Description
When using
runtime_env.working_dirwith a remote zip archive URL (for example,https://gitee.com/whaozi/kuberay/repository/archive/master.zip), Ray downloads an HTML page instead of the actual zip file. This causes the Ray job to fail when accessing files from the working directory.Downloading the same URL with standard tools such as
wgetworks as expected and returns the correct zip archive. This PR addresses the inconsistency in howruntime_env.working_dirhandles remote archive downloads.for example
https_gitee_com_whaozi_kuberay_repository_archive_master is empty,

and
https_gitee_com_whaozi_kuberay_repository_archive_master.zip is an HTML file
We test
I think we are not handling http redirection here. If I directly use the redirected url, it works
So,
Problem is:
When using runtime_env.working_dir with a remote zip URL (e.g. gitee archives), Ray’s HTTPS downloader uses the default Python-urllib user-agent, and some hosts respond with HTML rather than the archive. The working directory then contains HTML and the Ray job fails, while wget succeeds because it presents a curl-like user-agent.
Solution
_download_https_uri() now sets curl-like headers (ray-runtime-env-curl/1.0 UA + Accept: /, configurable via RAY_RUNTIME_ENV_HTTP_USER_AGENT). This keeps Ray’s behavior consistent with curl/wget, allowing gitee and similar hosts to return the proper zip file. A regression test verifies the headers are set.
Related issues
related issues: "Fixes #52233"
Additional information