[data] Fix high memory usage with FileBasedDatasource & ParquetDatasource when using a large number of files#55978
Conversation
…source to reduce memory usage Signed-off-by: Jack Gammack <jgammack@etsy.com>
Signed-off-by: Jack Gammack <jgammack@etsy.com>
|
Thank you for the contribution. And good catch on high mem for large number of blocks. On closer look, |
Yeah it is not used anywhere and I don't see any reason to keep it. The only argument for keeping it would be to ensure that we don't break any assumptions for downstream consumers (e.g. users who extend |
|
Yes, I'm happy to update this to just remove it. The unresolved paths aren't used anywhere, and aren't accessible by users using the |
Signed-off-by: Jack Gammack <jgammack@etsy.com>
Signed-off-by: Jack Gammack <jgammack@etsy.com>
|
@gvspraveen thanks for pinging me. I'm using this for lineage tracking purpose which was not yet open-sourced. The relation is not that clear - I'm looking at the object through Maybe those unresolved_paths could be put into GCS, similarly to resolved paths? |
|
I think keeping it in the object store is okay, the size of the many file paths isn't so bad if it's just stored once. I only run into issues when I increase the number of blocks over 2,000 or so |
Thanks! I'd really love to avoid losing that metadata. |
Signed-off-by: Jack Gammack <jgammack@etsy.com>
Signed-off-by: Jack Gammack <jgammack@etsy.com>
omatthew98
left a comment
There was a problem hiding this comment.
Thanks @JDarDagran for the input, thanks @JackGammack for the change and iteration. LGTM.
|
A couple things to note after updating:
|
alexeykudinkin
left a comment
There was a problem hiding this comment.
Thank you for your contribution @JackGammack!
| ray.get_runtime_context().get_node_id(), soft=False | ||
| ) | ||
|
|
||
| self._unresolved_paths = paths |
There was a problem hiding this comment.
Let's actually just delete this field
There was a problem hiding this comment.
Are you saying to remove self._unresolved_paths_ref for ParquetDatasource, but keep it for FileBasedDatasource? Or is this an older comment
There was a problem hiding this comment.
I see no good reason for us not to nuke it in both places
There was a problem hiding this comment.
I do agree if it's not used anywhere currently especially with the big performance issue, but I'm not aware of the other ongoing work. It should be very easy to add back.
I have updated to just remove self._unresolved_paths now
Signed-off-by: Jack Gammack <jgammack@etsy.com>
Signed-off-by: Jack Gammack <jgammack@etsy.com>
…urce when using a large number of files (ray-project#55978) ## Why are these changes needed? Using `FileBasedDatasource` or `ParquetDatasource` with a very large number of files causes OOM when creating read tasks. The full list of file paths is stored in `self`, causing it to persist to every read task, leading to this warning: ``` The serialized size of your read function named 'read_task_fn' is 49.8MB. This size relatively large. As a result, Ray might excessively spill objects during execution. To fix this issue, avoid accessing `self` or other large objects in 'read_task_fn'. ``` When using a small number of blocks, OOM does not occur because the large file list is not repeated so many times. But when setting high parallelism with `override_num_blocks`, OOM occurs. This is because the full list of paths is added to `self._unresolved_paths`. This attribute isn't currently used anywhere in ray. This PR removes `self._unresolved_paths` to alleviate unexpected high memory usage with very large numbers of files. ## Related issue number Similar to this issue with Iceberg: ray-project#49054 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Jack Gammack <jgammack@etsy.com> Signed-off-by: sampan <sampan@anyscale.com>
…urce when using a large number of files (ray-project#55978) ## Why are these changes needed? Using `FileBasedDatasource` or `ParquetDatasource` with a very large number of files causes OOM when creating read tasks. The full list of file paths is stored in `self`, causing it to persist to every read task, leading to this warning: ``` The serialized size of your read function named 'read_task_fn' is 49.8MB. This size relatively large. As a result, Ray might excessively spill objects during execution. To fix this issue, avoid accessing `self` or other large objects in 'read_task_fn'. ``` When using a small number of blocks, OOM does not occur because the large file list is not repeated so many times. But when setting high parallelism with `override_num_blocks`, OOM occurs. This is because the full list of paths is added to `self._unresolved_paths`. This attribute isn't currently used anywhere in ray. This PR removes `self._unresolved_paths` to alleviate unexpected high memory usage with very large numbers of files. ## Related issue number Similar to this issue with Iceberg: ray-project#49054 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Jack Gammack <jgammack@etsy.com> Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
…urce when using a large number of files (ray-project#55978) ## Why are these changes needed? Using `FileBasedDatasource` or `ParquetDatasource` with a very large number of files causes OOM when creating read tasks. The full list of file paths is stored in `self`, causing it to persist to every read task, leading to this warning: ``` The serialized size of your read function named 'read_task_fn' is 49.8MB. This size relatively large. As a result, Ray might excessively spill objects during execution. To fix this issue, avoid accessing `self` or other large objects in 'read_task_fn'. ``` When using a small number of blocks, OOM does not occur because the large file list is not repeated so many times. But when setting high parallelism with `override_num_blocks`, OOM occurs. This is because the full list of paths is added to `self._unresolved_paths`. This attribute isn't currently used anywhere in ray. This PR removes `self._unresolved_paths` to alleviate unexpected high memory usage with very large numbers of files. ## Related issue number Similar to this issue with Iceberg: ray-project#49054 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Jack Gammack <jgammack@etsy.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
…urce when using a large number of files (ray-project#55978) ## Why are these changes needed? Using `FileBasedDatasource` or `ParquetDatasource` with a very large number of files causes OOM when creating read tasks. The full list of file paths is stored in `self`, causing it to persist to every read task, leading to this warning: ``` The serialized size of your read function named 'read_task_fn' is 49.8MB. This size relatively large. As a result, Ray might excessively spill objects during execution. To fix this issue, avoid accessing `self` or other large objects in 'read_task_fn'. ``` When using a small number of blocks, OOM does not occur because the large file list is not repeated so many times. But when setting high parallelism with `override_num_blocks`, OOM occurs. This is because the full list of paths is added to `self._unresolved_paths`. This attribute isn't currently used anywhere in ray. This PR removes `self._unresolved_paths` to alleviate unexpected high memory usage with very large numbers of files. ## Related issue number Similar to this issue with Iceberg: ray-project#49054 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Jack Gammack <jgammack@etsy.com>
…urce when using a large number of files (#55978) ## Why are these changes needed? Using `FileBasedDatasource` or `ParquetDatasource` with a very large number of files causes OOM when creating read tasks. The full list of file paths is stored in `self`, causing it to persist to every read task, leading to this warning: ``` The serialized size of your read function named 'read_task_fn' is 49.8MB. This size relatively large. As a result, Ray might excessively spill objects during execution. To fix this issue, avoid accessing `self` or other large objects in 'read_task_fn'. ``` When using a small number of blocks, OOM does not occur because the large file list is not repeated so many times. But when setting high parallelism with `override_num_blocks`, OOM occurs. This is because the full list of paths is added to `self._unresolved_paths`. This attribute isn't currently used anywhere in ray. This PR removes `self._unresolved_paths` to alleviate unexpected high memory usage with very large numbers of files. ## Related issue number Similar to this issue with Iceberg: #49054 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Jack Gammack <jgammack@etsy.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
…urce when using a large number of files (ray-project#55978) ## Why are these changes needed? Using `FileBasedDatasource` or `ParquetDatasource` with a very large number of files causes OOM when creating read tasks. The full list of file paths is stored in `self`, causing it to persist to every read task, leading to this warning: ``` The serialized size of your read function named 'read_task_fn' is 49.8MB. This size relatively large. As a result, Ray might excessively spill objects during execution. To fix this issue, avoid accessing `self` or other large objects in 'read_task_fn'. ``` When using a small number of blocks, OOM does not occur because the large file list is not repeated so many times. But when setting high parallelism with `override_num_blocks`, OOM occurs. This is because the full list of paths is added to `self._unresolved_paths`. This attribute isn't currently used anywhere in ray. This PR removes `self._unresolved_paths` to alleviate unexpected high memory usage with very large numbers of files. ## Related issue number Similar to this issue with Iceberg: ray-project#49054 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Jack Gammack <jgammack@etsy.com>
…t excessive spilling during read task serialization (#59999) ## Description This PR reintroducing the issue that A previous PR trying to solve: #55978 Specifically, this add a full list of paths to `self` and for Datasource self is captured every read_task_fn during serialization and causing this data being duplicated and cause excessive spilling. We face the similar warning and have spilling behavior: ``` The serialized size of your read function named 'read_task_fn' is 9.7MB. This size relatively large. As a result, Ray might excessively spill objects during execution. To fix this issue, avoid accessing `self` or other large objects in 'read_task_fn'. ``` Revert can solve the issue but this attribute seems to have other dependancy. This PR move the _source_paths to object store during init time and only keep the object ref in self. and introduce a property function to materialize raw path data when needed ## Related issues #55978 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Raymond Lee <lee1258561@gmail.com> Signed-off-by: lee1258561 <lee1258561@gmail.com>
…t excessive spilling during read task serialization (ray-project#59999) ## Description This PR reintroducing the issue that A previous PR trying to solve: ray-project#55978 Specifically, this add a full list of paths to `self` and for Datasource self is captured every read_task_fn during serialization and causing this data being duplicated and cause excessive spilling. We face the similar warning and have spilling behavior: ``` The serialized size of your read function named 'read_task_fn' is 9.7MB. This size relatively large. As a result, Ray might excessively spill objects during execution. To fix this issue, avoid accessing `self` or other large objects in 'read_task_fn'. ``` Revert can solve the issue but this attribute seems to have other dependancy. This PR move the _source_paths to object store during init time and only keep the object ref in self. and introduce a property function to materialize raw path data when needed ## Related issues ray-project#55978 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Raymond Lee <lee1258561@gmail.com> Signed-off-by: lee1258561 <lee1258561@gmail.com> Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
…t excessive spilling during read task serialization (ray-project#59999) ## Description This PR reintroducing the issue that A previous PR trying to solve: ray-project#55978 Specifically, this add a full list of paths to `self` and for Datasource self is captured every read_task_fn during serialization and causing this data being duplicated and cause excessive spilling. We face the similar warning and have spilling behavior: ``` The serialized size of your read function named 'read_task_fn' is 9.7MB. This size relatively large. As a result, Ray might excessively spill objects during execution. To fix this issue, avoid accessing `self` or other large objects in 'read_task_fn'. ``` Revert can solve the issue but this attribute seems to have other dependancy. This PR move the _source_paths to object store during init time and only keep the object ref in self. and introduce a property function to materialize raw path data when needed ## Related issues ray-project#55978 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Raymond Lee <lee1258561@gmail.com> Signed-off-by: lee1258561 <lee1258561@gmail.com>
…t excessive spilling during read task serialization (#59999) ## Description This PR reintroducing the issue that A previous PR trying to solve: #55978 Specifically, this add a full list of paths to `self` and for Datasource self is captured every read_task_fn during serialization and causing this data being duplicated and cause excessive spilling. We face the similar warning and have spilling behavior: ``` The serialized size of your read function named 'read_task_fn' is 9.7MB. This size relatively large. As a result, Ray might excessively spill objects during execution. To fix this issue, avoid accessing `self` or other large objects in 'read_task_fn'. ``` Revert can solve the issue but this attribute seems to have other dependancy. This PR move the _source_paths to object store during init time and only keep the object ref in self. and introduce a property function to materialize raw path data when needed ## Related issues #55978 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Raymond Lee <lee1258561@gmail.com> Signed-off-by: lee1258561 <lee1258561@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
…t excessive spilling during read task serialization (#59999) ## Description This PR reintroducing the issue that A previous PR trying to solve: #55978 Specifically, this add a full list of paths to `self` and for Datasource self is captured every read_task_fn during serialization and causing this data being duplicated and cause excessive spilling. We face the similar warning and have spilling behavior: ``` The serialized size of your read function named 'read_task_fn' is 9.7MB. This size relatively large. As a result, Ray might excessively spill objects during execution. To fix this issue, avoid accessing `self` or other large objects in 'read_task_fn'. ``` Revert can solve the issue but this attribute seems to have other dependancy. This PR move the _source_paths to object store during init time and only keep the object ref in self. and introduce a property function to materialize raw path data when needed ## Related issues #55978 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Raymond Lee <lee1258561@gmail.com> Signed-off-by: lee1258561 <lee1258561@gmail.com>
…t excessive spilling during read task serialization (ray-project#59999) ## Description This PR reintroducing the issue that A previous PR trying to solve: ray-project#55978 Specifically, this add a full list of paths to `self` and for Datasource self is captured every read_task_fn during serialization and causing this data being duplicated and cause excessive spilling. We face the similar warning and have spilling behavior: ``` The serialized size of your read function named 'read_task_fn' is 9.7MB. This size relatively large. As a result, Ray might excessively spill objects during execution. To fix this issue, avoid accessing `self` or other large objects in 'read_task_fn'. ``` Revert can solve the issue but this attribute seems to have other dependancy. This PR move the _source_paths to object store during init time and only keep the object ref in self. and introduce a property function to materialize raw path data when needed ## Related issues ray-project#55978 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Raymond Lee <lee1258561@gmail.com> Signed-off-by: lee1258561 <lee1258561@gmail.com> Signed-off-by: Adel Nour <ans9868@nyu.edu>
…t excessive spilling during read task serialization (ray-project#59999) ## Description This PR reintroducing the issue that A previous PR trying to solve: ray-project#55978 Specifically, this add a full list of paths to `self` and for Datasource self is captured every read_task_fn during serialization and causing this data being duplicated and cause excessive spilling. We face the similar warning and have spilling behavior: ``` The serialized size of your read function named 'read_task_fn' is 9.7MB. This size relatively large. As a result, Ray might excessively spill objects during execution. To fix this issue, avoid accessing `self` or other large objects in 'read_task_fn'. ``` Revert can solve the issue but this attribute seems to have other dependancy. This PR move the _source_paths to object store during init time and only keep the object ref in self. and introduce a property function to materialize raw path data when needed ## Related issues ray-project#55978 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Raymond Lee <lee1258561@gmail.com> Signed-off-by: lee1258561 <lee1258561@gmail.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
…t excessive spilling during read task serialization (ray-project#59999) ## Description This PR reintroducing the issue that A previous PR trying to solve: ray-project#55978 Specifically, this add a full list of paths to `self` and for Datasource self is captured every read_task_fn during serialization and causing this data being duplicated and cause excessive spilling. We face the similar warning and have spilling behavior: ``` The serialized size of your read function named 'read_task_fn' is 9.7MB. This size relatively large. As a result, Ray might excessively spill objects during execution. To fix this issue, avoid accessing `self` or other large objects in 'read_task_fn'. ``` Revert can solve the issue but this attribute seems to have other dependancy. This PR move the _source_paths to object store during init time and only keep the object ref in self. and introduce a property function to materialize raw path data when needed ## Related issues ray-project#55978 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Raymond Lee <lee1258561@gmail.com> Signed-off-by: lee1258561 <lee1258561@gmail.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
Why are these changes needed?
Using
FileBasedDatasourceorParquetDatasourcewith a very large number of files causes OOM when creating read tasks. The full list of file paths is stored inself, causing it to persist to every read task, leading to this warning:When using a small number of blocks, OOM does not occur because the large file list is not repeated so many times. But when setting high parallelism with
override_num_blocks, OOM occurs.This is because the full list of paths is added to
self._unresolved_paths. This attribute isn't currently used anywhere in ray. This PR removesself._unresolved_pathsto alleviate unexpected high memory usage with very large numbers of files.Related issue number
Similar to this issue with Iceberg: #49054
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.