Skip to content

[serve][llm] Ray LLM Cloud Filesystem Restructuring: Provider-Specific Implementations#58469

Merged
kouroshHakha merged 13 commits intoray-project:masterfrom
hao-aaron:cloud-downloader-refactor
Nov 19, 2025
Merged

[serve][llm] Ray LLM Cloud Filesystem Restructuring: Provider-Specific Implementations#58469
kouroshHakha merged 13 commits intoray-project:masterfrom
hao-aaron:cloud-downloader-refactor

Conversation

@hao-aaron
Copy link
Contributor

@hao-aaron hao-aaron commented Nov 8, 2025

Ray LLM Cloud Filesystem Restructuring: Provider-Specific Implementations

Summary

This PR restructures the Ray LLM cloud filesystem architecture from a monolithic PyArrow-based implementation to a modular, provider-specific design. The refactor introduces:

  • New modular architecture: A cloud_filesystem/ module with provider-specific implementations (S3FileSystem, GCSFileSystem, AzureFileSystem) that inherit from an abstract base class
  • S3 optimization: Native boto3-based implementation for S3 operations, replacing PyArrow for significantly improved performance
  • Improved test organization: Test suite refactored from a single large file into multiple focused test modules organized by functionality

Motivation

Performance Bottleneck

The previous PyArrow-based implementation suffered from significant performance issues when handling large ML model files. Current implementation provides a unified framework for all cloud providers, using PyArrow's implementations of each filesystem type as an interface for downloads. Benchmarking on meta-Llama-3.2-1B-Instruct showed that aws cli (which is based on boto) took 7.97 seconds against PyArrow's 27.78 seconds to download.

Future-Proof Design

Beyond addressing immediate performance issues, disaggregating each cloud provider into separate filesystem classes enables:

  1. Provider-specific optimizations: Each provider can leverage its own native SDKs
  2. Incremental improvements: Optimizations can be introduced per provider without affecting others
  3. Flexible implementation strategies: Different providers can use different transfer mechanisms based on what works best

Detailed Changes

1. New Modular Architecture

Directory Structure

Created a new cloud_filesystem/ module under ray/llm/_internal/common/utils/:

ray/llm/_internal/common/utils/cloud_filesystem/
├── __init__.py                 # Exports public API
├── base.py                     # Abstract base class
├── s3_filesystem.py            # S3-specific boto implementation
├── gcs_filesystem.py           # GCS implementation (delegates to PyArrow)
├── azure_filesystem.py         # Azure implementation (delegates to PyArrow)
└── pyarrow_filesystem.py       # PyArrow-based implementation

Abstract Base Class (base.py)

Introduced BaseCloudFileSystem abstract class defining the interface all providers must implement:

class BaseCloudFileSystem(ABC):
    @staticmethod
    @abstractmethod
    def get_file(object_uri: str, decode_as_utf_8: bool = True) -> Optional[Union[str, bytes]]:
        """Download a file from cloud storage into memory."""
        pass
    
    @staticmethod
    @abstractmethod
    def list_subfolders(folder_uri: str) -> List[str]:
        """List the immediate subfolders in a cloud directory."""
        pass
    
    @staticmethod
    @abstractmethod
    def download_files(
        path: str,
        bucket_uri: str,
        substrings_to_include: Optional[List[str]] = None,
        suffixes_to_exclude: Optional[List[str]] = None,
    ) -> None:
        """Download files from cloud storage to a local directory."""
        pass
    
    @staticmethod
    @abstractmethod
    def upload_files(local_path: str, bucket_uri: str) -> None:
        """Upload files to cloud storage."""
        pass

2. S3FileSystem: Boto3-Based Implementation

The S3FileSystem class implements all operations using boto3 commands for optimal performance, including dynamic worker count selection.

3. PyArrowFileSystem: Legacy Implementation

Extracted all PyArrow-based logic into pyarrow_filesystem.py (431 lines), preserving the original implementation.
This class serves as the implementation for GCS and Azure providers during the transition period.

4. GCS and Azure FileSystem Classes

Both GCSFileSystem and AzureFileSystem currently delegate to PyArrowFileSystem, can be optimized in future PRs:

class GCSFileSystem(BaseCloudFileSystem):
    @staticmethod
    def get_file(object_uri: str, decode_as_utf_8: bool = True):
        return PyArrowFileSystem.get_file(object_uri, decode_as_utf_8)
    
    # ... similar delegation for other methods

5. CloudFileSystem

The refactored CloudFileSystem maintains the same public API as before, ensuring zero breaking changes for existing code. However, it now acts as a router that detects the cloud provider based on the URI scheme (s3://, gs://, abfss://, azure://) and delegates all operations to the appropriate provider-specific implementation.

6. Removed Parallel Download Method

The download_files_parallel() introduced in this PR replaces download_files(), using multithreading by default.


Testing

New Test Structure

The test suite was comprehensively refactored from a single 1,105-line file (test_cloud_utils.py) into five focused test modules totaling 1,465 lines with improved organization:

  • test_cloud_filesystem.py
  • test_mirror_config.py
  • test_pyarrow_filesystem.py
  • test_s3_filesystem.py [new tests]
  • test_utils.py

Maintained Test Coverage

All original tests for PyArrow-based functionality were preserved and moved to appropriate modules:

  • PyArrow filesystem operations
  • Azure/ABFSS URI parsing and validation
  • File filtering logic
  • Cloud mirror configuration validation

Future Work

This refactor sets the foundation for additional optimizations:

  1. GCS optimization: Implement native gsutil or google-cloud-storage SDK-based operations
  2. Azure optimization: Implement native azcopy or azure-storage-blob SDK-based operations
  3. Additional providers: Add support for other cloud storage providers (Oracle Cloud, IBM Cloud, etc.)
  4. Fine-grained performance tuning: Provider-specific connection pooling, retry logic, and timeout configurations
  5. Telemetry and monitoring: Add per-provider metrics for transfer speeds and success rates

Signed-off-by: ahao-anyscale <ahao@anyscale.com>
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
@hao-aaron
Copy link
Contributor Author

failing release tests appear to be failing on main as well and are unrelated to this PR

@kouroshHakha
Copy link
Contributor

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is an excellent pull request that significantly improves the cloud filesystem architecture by moving to a modular, provider-specific design. The refactoring to introduce a BaseCloudFileSystem and the performance optimization for S3 using the AWS CLI are great enhancements. The test suite restructuring is also a valuable improvement for maintainability. My review focuses on improving consistency in the new S3FileSystem, enhancing logging practices, and correcting a few tests that don't accurately reflect the implementation's behavior. Overall, these are high-quality changes, and my feedback aims to further increase their robustness.

Signed-off-by: ahao-anyscale <ahao@anyscale.com>
@hao-aaron
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request does an excellent job of restructuring the cloud filesystem access into a modular, provider-specific architecture. The introduction of BaseCloudFileSystem and separate implementations for S3, GCS, and Azure is a solid design that will allow for future provider-specific optimizations. The refactoring of the main CloudFileSystem class to act as a router simplifies the logic and maintains a consistent API. The test suite refactoring is also a welcome improvement.

However, I've found a critical issue regarding the S3 implementation and its tests, where the tests do not match the implementation. Additionally, there are some high-severity issues related to error handling in both the new S3 and the legacy PyArrow filesystems that could mask important failures. Please see my detailed comments for suggestions on how to address these points.

Signed-off-by: ahao-anyscale <ahao@anyscale.com>
@hao-aaron
Copy link
Contributor Author

/gemini review

@hao-aaron
Copy link
Contributor Author

What do we think is a good way to handle errors for boto operations? right now I am just logging them with logger.error and not raising exceptions

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a significant and well-executed refactoring of the cloud filesystem access layer. The move to a modular, provider-specific architecture with a clear abstract base class is an excellent design choice that will improve performance, maintainability, and future extensibility. The new S3 implementation using boto3 is a great step towards better performance, and the comprehensive test suite restructuring is commendable. My review comments focus on enhancing the robustness and consistency of the new filesystem implementations, particularly around error handling, logging, and concurrency patterns to ensure the new architecture is as reliable as it is performant.

Comment on lines +325 to +328
if failed_downloads:
logger.error(
f"Failed to download {len(failed_downloads)} files: {failed_downloads[:5]}..."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation logs failed downloads but doesn't raise an exception. This can lead to silent failures where the caller is unaware that the download operation was incomplete. To ensure callers can react to partial or complete failures, and for consistency with other filesystem implementations, it's better to raise an exception after logging the failed downloads.

Suggested change
if failed_downloads:
logger.error(
f"Failed to download {len(failed_downloads)} files: {failed_downloads[:5]}..."
)
if failed_downloads:
error_message = (
f"Failed to download {len(failed_downloads)} files: {failed_downloads[:5]}..."
)
logger.error(error_message)
raise IOError(error_message)


return subfolders
except Exception as e:
logger.info(f"Error listing subfolders in {folder_uri}: {e}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Logging exceptions with INFO level can lead to missed alerts in production environments where the logging level is typically WARNING or higher. To ensure that errors in listing subfolders are visible, it's better to use logger.warning or logger.error. Using logger.warning would also be consistent with the get_file method in this same file.

Suggested change
logger.info(f"Error listing subfolders in {folder_uri}: {e}")
logger.warning(f"Error listing subfolders in {folder_uri}: {e}")

Comment on lines +111 to +112
except Exception as e:
logger.error(f"Error reading {object_uri}: {e}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While implicitly returning None from the except block is functionally correct, explicitly adding return None would improve code clarity and make it consistent with the error handling pattern in PyArrowFileSystem.get_file.

Suggested change
except Exception as e:
logger.error(f"Error reading {object_uri}: {e}")
except Exception as e:
logger.error(f"Error reading {object_uri}: {e}")
return None

@hao-aaron
Copy link
Contributor Author

failing microcheck/release not related

Copy link
Contributor

@kouroshHakha kouroshHakha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. just waiting for the release tests to pass. lmcache test is jailed anyways. The other one is getting investigated.

@kouroshHakha kouroshHakha marked this pull request as ready for review November 18, 2025 18:06
@kouroshHakha kouroshHakha requested a review from a team as a code owner November 18, 2025 18:06
@kouroshHakha kouroshHakha added the go add ONLY when ready to merge, run all tests label Nov 18, 2025
@kouroshHakha kouroshHakha enabled auto-merge (squash) November 18, 2025 18:06
return body.decode("utf-8")
return body
except Exception as e:
logger.error(f"Error reading {object_uri}: {e}")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: S3FileSystem: Improve Error Handling Consistency

The get_file method in S3FileSystem does not explicitly return None when an exception is caught. While Python implicitly returns None, this violates the explicit contract documented in the docstring and is inconsistent with similar methods in other implementations like PyArrowFileSystem. The method should include an explicit return None after the logger.error call to properly handle error cases.

Fix in Cursor Fix in Web

@ray-gardener ray-gardener bot added serve Ray Serve Related Issue llm labels Nov 18, 2025
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
@github-actions github-actions bot disabled auto-merge November 18, 2025 23:53
return body.decode("utf-8")
return body
except Exception as e:
logger.error(f"Error reading {object_uri}: {e}")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Standardize File System Exception Returns

The S3FileSystem.get_file() method's exception handler is missing an explicit return None statement. While Python implicitly returns None, this is inconsistent with PyArrowFileSystem.get_file() which has an explicit return statement. This inconsistency violates the interface contract and creates a code quality issue. The method should explicitly return None in the exception handler to match the documented behavior and maintain consistency with other implementations.

Fix in Cursor Fix in Web

@hao-aaron
Copy link
Contributor Author

failing tests unrelated

@kouroshHakha kouroshHakha enabled auto-merge (squash) November 19, 2025 18:28
@kouroshHakha kouroshHakha merged commit e42d621 into ray-project:master Nov 19, 2025
6 of 7 checks passed
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
…c Implementations (ray-project#58469)

Signed-off-by: ahao-anyscale <ahao@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
400Ping pushed a commit to 400Ping/ray that referenced this pull request Nov 21, 2025
…c Implementations (ray-project#58469)

Signed-off-by: ahao-anyscale <ahao@anyscale.com>
ykdojo pushed a commit to ykdojo/ray that referenced this pull request Nov 27, 2025
…c Implementations (ray-project#58469)

Signed-off-by: ahao-anyscale <ahao@anyscale.com>
Signed-off-by: YK <1811651+ykdojo@users.noreply.github.com>
SheldonTsen pushed a commit to SheldonTsen/ray that referenced this pull request Dec 1, 2025
…c Implementations (ray-project#58469)

Signed-off-by: ahao-anyscale <ahao@anyscale.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…c Implementations (ray-project#58469)

Signed-off-by: ahao-anyscale <ahao@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests llm serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants