[serve][llm] Ray LLM Cloud Filesystem Restructuring: Provider-Specific Implementations#58469
Conversation
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
|
failing release tests appear to be failing on main as well and are unrelated to this PR |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This is an excellent pull request that significantly improves the cloud filesystem architecture by moving to a modular, provider-specific design. The refactoring to introduce a BaseCloudFileSystem and the performance optimization for S3 using the AWS CLI are great enhancements. The test suite restructuring is also a valuable improvement for maintainability. My review focuses on improving consistency in the new S3FileSystem, enhancing logging practices, and correcting a few tests that don't accurately reflect the implementation's behavior. Overall, these are high-quality changes, and my feedback aims to further increase their robustness.
python/ray/llm/_internal/common/utils/cloud_filesystem/s3_filesystem.py
Outdated
Show resolved
Hide resolved
python/ray/llm/_internal/common/utils/cloud_filesystem/s3_filesystem.py
Outdated
Show resolved
Hide resolved
python/ray/llm/_internal/common/utils/cloud_filesystem/s3_filesystem.py
Outdated
Show resolved
Hide resolved
python/ray/llm/_internal/common/utils/cloud_filesystem/s3_filesystem.py
Outdated
Show resolved
Hide resolved
python/ray/llm/_internal/common/utils/cloud_filesystem/s3_filesystem.py
Outdated
Show resolved
Hide resolved
python/ray/llm/_internal/common/utils/cloud_filesystem/pyarrow_filesystem.py
Outdated
Show resolved
Hide resolved
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
…ray into cloud-downloader-refactor
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request does an excellent job of restructuring the cloud filesystem access into a modular, provider-specific architecture. The introduction of BaseCloudFileSystem and separate implementations for S3, GCS, and Azure is a solid design that will allow for future provider-specific optimizations. The refactoring of the main CloudFileSystem class to act as a router simplifies the logic and maintains a consistent API. The test suite refactoring is also a welcome improvement.
However, I've found a critical issue regarding the S3 implementation and its tests, where the tests do not match the implementation. Additionally, there are some high-severity issues related to error handling in both the new S3 and the legacy PyArrow filesystems that could mask important failures. Please see my detailed comments for suggestions on how to address these points.
python/ray/llm/_internal/common/utils/cloud_filesystem/pyarrow_filesystem.py
Show resolved
Hide resolved
python/ray/llm/_internal/common/utils/cloud_filesystem/s3_filesystem.py
Outdated
Show resolved
Hide resolved
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
|
/gemini review |
|
What do we think is a good way to handle errors for boto operations? right now I am just logging them with logger.error and not raising exceptions |
There was a problem hiding this comment.
Code Review
This pull request is a significant and well-executed refactoring of the cloud filesystem access layer. The move to a modular, provider-specific architecture with a clear abstract base class is an excellent design choice that will improve performance, maintainability, and future extensibility. The new S3 implementation using boto3 is a great step towards better performance, and the comprehensive test suite restructuring is commendable. My review comments focus on enhancing the robustness and consistency of the new filesystem implementations, particularly around error handling, logging, and concurrency patterns to ensure the new architecture is as reliable as it is performant.
python/ray/llm/_internal/common/utils/cloud_filesystem/pyarrow_filesystem.py
Show resolved
Hide resolved
| if failed_downloads: | ||
| logger.error( | ||
| f"Failed to download {len(failed_downloads)} files: {failed_downloads[:5]}..." | ||
| ) |
There was a problem hiding this comment.
The current implementation logs failed downloads but doesn't raise an exception. This can lead to silent failures where the caller is unaware that the download operation was incomplete. To ensure callers can react to partial or complete failures, and for consistency with other filesystem implementations, it's better to raise an exception after logging the failed downloads.
| if failed_downloads: | |
| logger.error( | |
| f"Failed to download {len(failed_downloads)} files: {failed_downloads[:5]}..." | |
| ) | |
| if failed_downloads: | |
| error_message = ( | |
| f"Failed to download {len(failed_downloads)} files: {failed_downloads[:5]}..." | |
| ) | |
| logger.error(error_message) | |
| raise IOError(error_message) |
|
|
||
| return subfolders | ||
| except Exception as e: | ||
| logger.info(f"Error listing subfolders in {folder_uri}: {e}") |
There was a problem hiding this comment.
Logging exceptions with INFO level can lead to missed alerts in production environments where the logging level is typically WARNING or higher. To ensure that errors in listing subfolders are visible, it's better to use logger.warning or logger.error. Using logger.warning would also be consistent with the get_file method in this same file.
| logger.info(f"Error listing subfolders in {folder_uri}: {e}") | |
| logger.warning(f"Error listing subfolders in {folder_uri}: {e}") |
| except Exception as e: | ||
| logger.error(f"Error reading {object_uri}: {e}") |
There was a problem hiding this comment.
While implicitly returning None from the except block is functionally correct, explicitly adding return None would improve code clarity and make it consistent with the error handling pattern in PyArrowFileSystem.get_file.
| except Exception as e: | |
| logger.error(f"Error reading {object_uri}: {e}") | |
| except Exception as e: | |
| logger.error(f"Error reading {object_uri}: {e}") | |
| return None |
|
failing microcheck/release not related |
| return body.decode("utf-8") | ||
| return body | ||
| except Exception as e: | ||
| logger.error(f"Error reading {object_uri}: {e}") |
There was a problem hiding this comment.
Bug: S3FileSystem: Improve Error Handling Consistency
The get_file method in S3FileSystem does not explicitly return None when an exception is caught. While Python implicitly returns None, this violates the explicit contract documented in the docstring and is inconsistent with similar methods in other implementations like PyArrowFileSystem. The method should include an explicit return None after the logger.error call to properly handle error cases.
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
…ray into cloud-downloader-refactor
| return body.decode("utf-8") | ||
| return body | ||
| except Exception as e: | ||
| logger.error(f"Error reading {object_uri}: {e}") |
There was a problem hiding this comment.
Bug: Standardize File System Exception Returns
The S3FileSystem.get_file() method's exception handler is missing an explicit return None statement. While Python implicitly returns None, this is inconsistent with PyArrowFileSystem.get_file() which has an explicit return statement. This inconsistency violates the interface contract and creates a code quality issue. The method should explicitly return None in the exception handler to match the documented behavior and maintain consistency with other implementations.
|
failing tests unrelated |
…c Implementations (ray-project#58469) Signed-off-by: ahao-anyscale <ahao@anyscale.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
…c Implementations (ray-project#58469) Signed-off-by: ahao-anyscale <ahao@anyscale.com>
…c Implementations (ray-project#58469) Signed-off-by: ahao-anyscale <ahao@anyscale.com> Signed-off-by: YK <1811651+ykdojo@users.noreply.github.com>
…c Implementations (ray-project#58469) Signed-off-by: ahao-anyscale <ahao@anyscale.com>
…c Implementations (ray-project#58469) Signed-off-by: ahao-anyscale <ahao@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
Ray LLM Cloud Filesystem Restructuring: Provider-Specific Implementations
Summary
This PR restructures the Ray LLM cloud filesystem architecture from a monolithic PyArrow-based implementation to a modular, provider-specific design. The refactor introduces:
cloud_filesystem/module with provider-specific implementations (S3FileSystem,GCSFileSystem,AzureFileSystem) that inherit from an abstract base classMotivation
Performance Bottleneck
The previous PyArrow-based implementation suffered from significant performance issues when handling large ML model files. Current implementation provides a unified framework for all cloud providers, using PyArrow's implementations of each filesystem type as an interface for downloads. Benchmarking on
meta-Llama-3.2-1B-Instructshowed that aws cli (which is based on boto) took 7.97 seconds against PyArrow's 27.78 seconds to download.Future-Proof Design
Beyond addressing immediate performance issues, disaggregating each cloud provider into separate filesystem classes enables:
Detailed Changes
1. New Modular Architecture
Directory Structure
Created a new
cloud_filesystem/module underray/llm/_internal/common/utils/:Abstract Base Class (
base.py)Introduced
BaseCloudFileSystemabstract class defining the interface all providers must implement:2. S3FileSystem: Boto3-Based Implementation
The
S3FileSystemclass implements all operations using boto3 commands for optimal performance, including dynamic worker count selection.3. PyArrowFileSystem: Legacy Implementation
Extracted all PyArrow-based logic into
pyarrow_filesystem.py(431 lines), preserving the original implementation.This class serves as the implementation for GCS and Azure providers during the transition period.
4. GCS and Azure FileSystem Classes
Both
GCSFileSystemandAzureFileSystemcurrently delegate toPyArrowFileSystem, can be optimized in future PRs:5. CloudFileSystem
The refactored
CloudFileSystemmaintains the same public API as before, ensuring zero breaking changes for existing code. However, it now acts as a router that detects the cloud provider based on the URI scheme (s3://, gs://, abfss://, azure://) and delegates all operations to the appropriate provider-specific implementation.6. Removed Parallel Download Method
The
download_files_parallel()introduced in this PR replacesdownload_files(), using multithreading by default.Testing
New Test Structure
The test suite was comprehensively refactored from a single 1,105-line file (
test_cloud_utils.py) into five focused test modules totaling 1,465 lines with improved organization:test_cloud_filesystem.pytest_mirror_config.pytest_pyarrow_filesystem.pytest_s3_filesystem.py[new tests]test_utils.pyMaintained Test Coverage
All original tests for PyArrow-based functionality were preserved and moved to appropriate modules:
Future Work
This refactor sets the foundation for additional optimizations:
gsutilorgoogle-cloud-storageSDK-based operationsazcopyorazure-storage-blobSDK-based operations