[serve][llm] Ray LLM Cloud Filesystem Restructuring: Provider-Specific Implementations by hao-aaron · Pull Request #58469 · ray-project/ray

hao-aaron · 2025-11-08T01:32:22Z

Ray LLM Cloud Filesystem Restructuring: Provider-Specific Implementations

Summary

This PR restructures the Ray LLM cloud filesystem architecture from a monolithic PyArrow-based implementation to a modular, provider-specific design. The refactor introduces:

New modular architecture: A cloud_filesystem/ module with provider-specific implementations (S3FileSystem, GCSFileSystem, AzureFileSystem) that inherit from an abstract base class
S3 optimization: Native boto3-based implementation for S3 operations, replacing PyArrow for significantly improved performance
Improved test organization: Test suite refactored from a single large file into multiple focused test modules organized by functionality

Motivation

Performance Bottleneck

The previous PyArrow-based implementation suffered from significant performance issues when handling large ML model files. Current implementation provides a unified framework for all cloud providers, using PyArrow's implementations of each filesystem type as an interface for downloads. Benchmarking on meta-Llama-3.2-1B-Instruct showed that aws cli (which is based on boto) took 7.97 seconds against PyArrow's 27.78 seconds to download.

Future-Proof Design

Beyond addressing immediate performance issues, disaggregating each cloud provider into separate filesystem classes enables:

Provider-specific optimizations: Each provider can leverage its own native SDKs
Incremental improvements: Optimizations can be introduced per provider without affecting others
Flexible implementation strategies: Different providers can use different transfer mechanisms based on what works best

Detailed Changes

1. New Modular Architecture

Directory Structure

Created a new cloud_filesystem/ module under ray/llm/_internal/common/utils/:

ray/llm/_internal/common/utils/cloud_filesystem/
├── __init__.py                 # Exports public API
├── base.py                     # Abstract base class
├── s3_filesystem.py            # S3-specific boto implementation
├── gcs_filesystem.py           # GCS implementation (delegates to PyArrow)
├── azure_filesystem.py         # Azure implementation (delegates to PyArrow)
└── pyarrow_filesystem.py       # PyArrow-based implementation

Abstract Base Class (`base.py`)

Introduced BaseCloudFileSystem abstract class defining the interface all providers must implement:

class BaseCloudFileSystem(ABC):
    @staticmethod
    @abstractmethod
    def get_file(object_uri: str, decode_as_utf_8: bool = True) -> Optional[Union[str, bytes]]:
        """Download a file from cloud storage into memory."""
        pass
    
    @staticmethod
    @abstractmethod
    def list_subfolders(folder_uri: str) -> List[str]:
        """List the immediate subfolders in a cloud directory."""
        pass
    
    @staticmethod
    @abstractmethod
    def download_files(
        path: str,
        bucket_uri: str,
        substrings_to_include: Optional[List[str]] = None,
        suffixes_to_exclude: Optional[List[str]] = None,
    ) -> None:
        """Download files from cloud storage to a local directory."""
        pass
    
    @staticmethod
    @abstractmethod
    def upload_files(local_path: str, bucket_uri: str) -> None:
        """Upload files to cloud storage."""
        pass

2. S3FileSystem: Boto3-Based Implementation

The S3FileSystem class implements all operations using boto3 commands for optimal performance, including dynamic worker count selection.

3. PyArrowFileSystem: Legacy Implementation

Extracted all PyArrow-based logic into pyarrow_filesystem.py (431 lines), preserving the original implementation.
This class serves as the implementation for GCS and Azure providers during the transition period.

4. GCS and Azure FileSystem Classes

Both GCSFileSystem and AzureFileSystem currently delegate to PyArrowFileSystem, can be optimized in future PRs:

class GCSFileSystem(BaseCloudFileSystem):
    @staticmethod
    def get_file(object_uri: str, decode_as_utf_8: bool = True):
        return PyArrowFileSystem.get_file(object_uri, decode_as_utf_8)
    
    # ... similar delegation for other methods

5. CloudFileSystem

The refactored CloudFileSystem maintains the same public API as before, ensuring zero breaking changes for existing code. However, it now acts as a router that detects the cloud provider based on the URI scheme (s3://, gs://, abfss://, azure://) and delegates all operations to the appropriate provider-specific implementation.

6. Removed Parallel Download Method

The download_files_parallel() introduced in this PR replaces download_files(), using multithreading by default.

Testing

New Test Structure

The test suite was comprehensively refactored from a single 1,105-line file (test_cloud_utils.py) into five focused test modules totaling 1,465 lines with improved organization:

test_cloud_filesystem.py
test_mirror_config.py
test_pyarrow_filesystem.py
test_s3_filesystem.py [new tests]
test_utils.py

Maintained Test Coverage

All original tests for PyArrow-based functionality were preserved and moved to appropriate modules:

PyArrow filesystem operations
Azure/ABFSS URI parsing and validation
File filtering logic
Cloud mirror configuration validation

Future Work

This refactor sets the foundation for additional optimizations:

GCS optimization: Implement native gsutil or google-cloud-storage SDK-based operations
Azure optimization: Implement native azcopy or azure-storage-blob SDK-based operations
Additional providers: Add support for other cloud storage providers (Oracle Cloud, IBM Cloud, etc.)
Fine-grained performance tuning: Provider-specific connection pooling, retry logic, and timeout configurations
Telemetry and monitoring: Add per-provider metrics for transfer speeds and success rates

Signed-off-by: ahao-anyscale <ahao@anyscale.com>

hao-aaron · 2025-11-11T01:24:26Z

failing release tests appear to be failing on main as well and are unrelated to this PR

kouroshHakha · 2025-11-11T01:43:27Z

/gemini review

gemini-code-assist

Code Review

This is an excellent pull request that significantly improves the cloud filesystem architecture by moving to a modular, provider-specific design. The refactoring to introduce a BaseCloudFileSystem and the performance optimization for S3 using the AWS CLI are great enhancements. The test suite restructuring is also a valuable improvement for maintainability. My review focuses on improving consistency in the new S3FileSystem, enhancing logging practices, and correcting a few tests that don't accurately reflect the implementation's behavior. Overall, these are high-quality changes, and my feedback aims to further increase their robustness.

python/ray/llm/tests/common/cloud/test_pyarrow_filesystem.py

python/ray/llm/tests/common/cloud/test_s3_filesystem.py

python/ray/llm/_internal/common/utils/cloud_filesystem/s3_filesystem.py

python/ray/llm/_internal/common/utils/cloud_utils.py

python/ray/llm/_internal/common/utils/cloud_filesystem/pyarrow_filesystem.py

python/ray/llm/tests/common/cloud/test_pyarrow_filesystem.py

Signed-off-by: ahao-anyscale <ahao@anyscale.com>

…ray into cloud-downloader-refactor

hao-aaron · 2025-11-12T23:36:57Z

/gemini review

gemini-code-assist

Code Review

This pull request does an excellent job of restructuring the cloud filesystem access into a modular, provider-specific architecture. The introduction of BaseCloudFileSystem and separate implementations for S3, GCS, and Azure is a solid design that will allow for future provider-specific optimizations. The refactoring of the main CloudFileSystem class to act as a router simplifies the logic and maintains a consistent API. The test suite refactoring is also a welcome improvement.

However, I've found a critical issue regarding the S3 implementation and its tests, where the tests do not match the implementation. Additionally, there are some high-severity issues related to error handling in both the new S3 and the legacy PyArrow filesystems that could mask important failures. Please see my detailed comments for suggestions on how to address these points.

python/ray/llm/tests/common/cloud/test_s3_filesystem.py

python/ray/llm/_internal/common/utils/cloud_filesystem/pyarrow_filesystem.py

python/ray/llm/_internal/common/utils/cloud_filesystem/s3_filesystem.py

Signed-off-by: ahao-anyscale <ahao@anyscale.com>

hao-aaron · 2025-11-13T00:10:40Z

/gemini review

hao-aaron · 2025-11-13T00:11:26Z

What do we think is a good way to handle errors for boto operations? right now I am just logging them with logger.error and not raising exceptions

gemini-code-assist

Code Review

This pull request is a significant and well-executed refactoring of the cloud filesystem access layer. The move to a modular, provider-specific architecture with a clear abstract base class is an excellent design choice that will improve performance, maintainability, and future extensibility. The new S3 implementation using boto3 is a great step towards better performance, and the comprehensive test suite restructuring is commendable. My review comments focus on enhancing the robustness and consistency of the new filesystem implementations, particularly around error handling, logging, and concurrency patterns to ensure the new architecture is as reliable as it is performant.

python/ray/llm/_internal/common/utils/cloud_filesystem/pyarrow_filesystem.py

gemini-code-assist · 2025-11-13T00:13:30Z

python/ray/llm/_internal/common/utils/cloud_filesystem/s3_filesystem.py

+            if failed_downloads:
+                logger.error(
+                    f"Failed to download {len(failed_downloads)} files: {failed_downloads[:5]}..."
+                )


The current implementation logs failed downloads but doesn't raise an exception. This can lead to silent failures where the caller is unaware that the download operation was incomplete. To ensure callers can react to partial or complete failures, and for consistency with other filesystem implementations, it's better to raise an exception after logging the failed downloads.

Suggested change

if failed_downloads:

logger.error(

f"Failed to download {len(failed_downloads)} files: {failed_downloads[:5]}..."

)

if failed_downloads:

error_message = (

f"Failed to download {len(failed_downloads)} files: {failed_downloads[:5]}..."

)

logger.error(error_message)

raise IOError(error_message)

gemini-code-assist · 2025-11-13T00:13:31Z

python/ray/llm/_internal/common/utils/cloud_filesystem/pyarrow_filesystem.py

+
+            return subfolders
+        except Exception as e:
+            logger.info(f"Error listing subfolders in {folder_uri}: {e}")


Logging exceptions with INFO level can lead to missed alerts in production environments where the logging level is typically WARNING or higher. To ensure that errors in listing subfolders are visible, it's better to use logger.warning or logger.error. Using logger.warning would also be consistent with the get_file method in this same file.

Suggested change

logger.info(f"Error listing subfolders in {folder_uri}: {e}")

logger.warning(f"Error listing subfolders in {folder_uri}: {e}")

gemini-code-assist · 2025-11-13T00:13:31Z

python/ray/llm/_internal/common/utils/cloud_filesystem/s3_filesystem.py

+        except Exception as e:
+            logger.error(f"Error reading {object_uri}: {e}")


While implicitly returning None from the except block is functionally correct, explicitly adding return None would improve code clarity and make it consistent with the error handling pattern in PyArrowFileSystem.get_file.

Suggested change

except Exception as e:

logger.error(f"Error reading {object_uri}: {e}")

except Exception as e:

logger.error(f"Error reading {object_uri}: {e}")

return None

hao-aaron · 2025-11-13T21:57:40Z

failing microcheck/release not related

kouroshHakha

LGTM. ~~just waiting for the release tests to pass.~~ lmcache test is jailed anyways. The other one is getting investigated.

cursor · 2025-11-18T18:09:08Z

python/ray/llm/_internal/common/utils/cloud_filesystem/s3_filesystem.py

+                return body.decode("utf-8")
+            return body
+        except Exception as e:
+            logger.error(f"Error reading {object_uri}: {e}")


Bug: S3FileSystem: Improve Error Handling Consistency

The get_file method in S3FileSystem does not explicitly return None when an exception is caught. While Python implicitly returns None, this violates the explicit contract documented in the docstring and is inconsistent with similar methods in other implementations like PyArrowFileSystem. The method should include an explicit return None after the logger.error call to properly handle error cases.

Signed-off-by: ahao-anyscale <ahao@anyscale.com>

…ray into cloud-downloader-refactor

cursor · 2025-11-18T23:55:13Z

python/ray/llm/_internal/common/utils/cloud_filesystem/s3_filesystem.py

+                return body.decode("utf-8")
+            return body
+        except Exception as e:
+            logger.error(f"Error reading {object_uri}: {e}")


Bug: Standardize File System Exception Returns

The S3FileSystem.get_file() method's exception handler is missing an explicit return None statement. While Python implicitly returns None, this is inconsistent with PyArrowFileSystem.get_file() which has an explicit return statement. This inconsistency violates the interface contract and creates a code quality issue. The method should explicitly return None in the exception handler to match the documented behavior and maintain consistency with other implementations.

hao-aaron · 2025-11-19T07:49:09Z

failing tests unrelated

…c Implementations (ray-project#58469) Signed-off-by: ahao-anyscale <ahao@anyscale.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>

…c Implementations (ray-project#58469) Signed-off-by: ahao-anyscale <ahao@anyscale.com>

…c Implementations (ray-project#58469) Signed-off-by: ahao-anyscale <ahao@anyscale.com> Signed-off-by: YK <1811651+ykdojo@users.noreply.github.com>

…c Implementations (ray-project#58469) Signed-off-by: ahao-anyscale <ahao@anyscale.com>

…c Implementations (ray-project#58469) Signed-off-by: ahao-anyscale <ahao@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

hao-aaron added 6 commits November 6, 2025 16:49

first draft

e54d618

tests refactor fixes

142b36d

Signed-off-by: ahao-anyscale <ahao@anyscale.com>

s3 tests

e29427f

Signed-off-by: ahao-anyscale <ahao@anyscale.com>

fix tests

3303017

Signed-off-by: ahao-anyscale <ahao@anyscale.com>

uri parse

401db95

Signed-off-by: ahao-anyscale <ahao@anyscale.com>

Merge branch 'master' into cloud-downloader-refactor

334e732

gemini-code-assist bot reviewed Nov 11, 2025

View reviewed changes

kouroshHakha reviewed Nov 11, 2025

View reviewed changes

hao-aaron added 2 commits November 12, 2025 14:59

use boto instead of cli, fix tests

63c9e3e

Signed-off-by: ahao-anyscale <ahao@anyscale.com>

Merge branch 'cloud-downloader-refactor' of github.com:ahao-anyscale/…

96f380f

…ray into cloud-downloader-refactor

gemini-code-assist bot reviewed Nov 12, 2025

View reviewed changes

python/ray/llm/tests/common/cloud/test_s3_filesystem.py Show resolved Hide resolved

python/ray/llm/_internal/common/utils/cloud_filesystem/pyarrow_filesystem.py Show resolved Hide resolved

python/ray/llm/_internal/common/utils/cloud_filesystem/s3_filesystem.py Outdated Show resolved Hide resolved

fix tests, add anonymous functionality

a0afccb

Signed-off-by: ahao-anyscale <ahao@anyscale.com>

gemini-code-assist bot reviewed Nov 13, 2025

View reviewed changes

Merge branch 'master' into cloud-downloader-refactor

519c011

kouroshHakha reviewed Nov 18, 2025

View reviewed changes

Merge branch 'master' into cloud-downloader-refactor

3f96aa4

kouroshHakha marked this pull request as ready for review November 18, 2025 18:06

kouroshHakha requested a review from a team as a code owner November 18, 2025 18:06

kouroshHakha added the go add ONLY when ready to merge, run all tests label Nov 18, 2025

kouroshHakha enabled auto-merge (squash) November 18, 2025 18:06

cursor bot reviewed Nov 18, 2025

View reviewed changes

ray-gardener bot added serve Ray Serve Related Issue llm labels Nov 18, 2025

fixing tests

04385ac

Signed-off-by: ahao-anyscale <ahao@anyscale.com>

Merge branch 'cloud-downloader-refactor' of github.com:ahao-anyscale/…

3a8c43a

…ray into cloud-downloader-refactor

github-actions bot disabled auto-merge November 18, 2025 23:53

cursor bot reviewed Nov 18, 2025

View reviewed changes

kouroshHakha enabled auto-merge (squash) November 19, 2025 18:28

kouroshHakha approved these changes Nov 19, 2025

View reviewed changes

kouroshHakha merged commit e42d621 into ray-project:master Nov 19, 2025
6 of 7 checks passed

400Ping pushed a commit to 400Ping/ray that referenced this pull request Nov 21, 2025

[serve][llm] Ray LLM Cloud Filesystem Restructuring: Provider-Specifi…

f53565e

…c Implementations (ray-project#58469) Signed-off-by: ahao-anyscale <ahao@anyscale.com>

SheldonTsen pushed a commit to SheldonTsen/ray that referenced this pull request Dec 1, 2025

[serve][llm] Ray LLM Cloud Filesystem Restructuring: Provider-Specifi…

099a796

…c Implementations (ray-project#58469) Signed-off-by: ahao-anyscale <ahao@anyscale.com>

	logger.info(f"Error listing subfolders in {folder_uri}: {e}")
	logger.warning(f"Error listing subfolders in {folder_uri}: {e}")

		except Exception as e:
		logger.error(f"Error reading {object_uri}: {e}")

Conversation

hao-aaron commented Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Ray LLM Cloud Filesystem Restructuring: Provider-Specific Implementations

Summary

Motivation

Performance Bottleneck

Future-Proof Design

Detailed Changes

1. New Modular Architecture

Directory Structure

Abstract Base Class (base.py)

2. S3FileSystem: Boto3-Based Implementation

3. PyArrowFileSystem: Legacy Implementation

4. GCS and Azure FileSystem Classes

5. CloudFileSystem

6. Removed Parallel Download Method

Testing

New Test Structure

Maintained Test Coverage

Future Work

Uh oh!

hao-aaron commented Nov 11, 2025

Uh oh!

kouroshHakha commented Nov 11, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hao-aaron commented Nov 12, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hao-aaron commented Nov 13, 2025

Uh oh!

hao-aaron commented Nov 13, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

hao-aaron commented Nov 13, 2025

Uh oh!

kouroshHakha left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cursor bot Nov 18, 2025

Choose a reason for hiding this comment

hao-aaron commented Nov 8, 2025 •

edited

Loading

Abstract Base Class (`base.py`)

kouroshHakha left a comment •

edited

Loading