[Data] DefaultClusterAutoscalerV2 raises KeyError: 'CPU' Fix#60208
[Data] DefaultClusterAutoscalerV2 raises KeyError: 'CPU' Fix#60208bveeramani merged 2 commits intoray-project:masterfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request effectively resolves a KeyError by safely accessing the 'CPU' resource using dict.get() with a default value. This change is crucial for preventing crashes on nodes that may not explicitly define 'CPU' resources, such as dedicated GPU nodes. For enhanced robustness and consistency, a similar safe access pattern is recommended for the 'memory' resource.
python/ray/data/_internal/cluster_autoscaler/default_cluster_autoscaler_v2.py
Outdated
Show resolved
Hide resolved
| node_resource_spec = _NodeResourceSpec.of( | ||
| cpu=r["CPU"], gpu=r.get("GPU", 0), mem=r["memory"] | ||
| cpu=r.get("CPU", 0), gpu=r.get("GPU", 0), mem=r.get("memory", 0) | ||
| ) |
There was a problem hiding this comment.
Can you add a regression test? e.g., a test where ray.nodes() returns nodes with no logical resources?
There was a problem hiding this comment.
Hey yes that makes sense , sorry for missing that , I will add that first thing tomorrow.
There was a problem hiding this comment.
Also I am sorry I know this is the right platform for all these questions,
- Is there a specific Python version (e.g., 3.9, 3.10) that is most stable/recommended for development right now?
- I noticed the dev guide covers Linux, macOS, and Windows. Given that Ray has a complex C++ backend with OS-level dependencies, how does the project ensure changes on one OS don't break others? Is there anything specific I should watch out for on macOS?
There was a problem hiding this comment.
Ah, no need to apologize for asking questions!
I'd recommend using 3.10, because it's the lowest supported version.
I think we test against different operating systems as part of our release process. AFAIK most Ray Data devs use Mac, so you should be good on that front
There was a problem hiding this comment.
I have added the test as needed please review again, sorry for the delay was facing issues with python 3.13 for deps installation for testing but switching to 3.10 helped. Thanks
There was a problem hiding this comment.
My bad for using 2 github accounts will be using this one from now on
Signed-off-by: Limark Dcunha <limarkdcunha@gmail.com>
Signed-off-by: Limark Dcunha <limarkdcunha@gmail.com>
|
@bveeramani I have fixed all the issues but I don't seem to have privileges to merge to master ,can you please take care of it Thanks |
|
Done, ty for the contribution! |
…ject#60208) This PR updates DefaultClusterAutoscalerV2 to safely handle nodes with 0 logical CPUs by replacing direct dictionary access (r["CPU"]) with r.get("CPU", 0), preventing crashes on dedicated GPU nodes. This fix has been discussed firsthand with @bveeramani. "Fixes ray-project#60166" --------- Signed-off-by: Limark Dcunha <limarkdcunha@gmail.com> Signed-off-by: jinbum-kim <jinbum9958@gmail.com>
…ject#60208) This PR updates DefaultClusterAutoscalerV2 to safely handle nodes with 0 logical CPUs by replacing direct dictionary access (r["CPU"]) with r.get("CPU", 0), preventing crashes on dedicated GPU nodes. This fix has been discussed firsthand with @bveeramani. "Fixes ray-project#60166" --------- Signed-off-by: Limark Dcunha <limarkdcunha@gmail.com>
…ject#60208) This PR updates DefaultClusterAutoscalerV2 to safely handle nodes with 0 logical CPUs by replacing direct dictionary access (r["CPU"]) with r.get("CPU", 0), preventing crashes on dedicated GPU nodes. This fix has been discussed firsthand with @bveeramani. "Fixes ray-project#60166" --------- Signed-off-by: Limark Dcunha <limarkdcunha@gmail.com>
…ject#60208) This PR updates DefaultClusterAutoscalerV2 to safely handle nodes with 0 logical CPUs by replacing direct dictionary access (r["CPU"]) with r.get("CPU", 0), preventing crashes on dedicated GPU nodes. This fix has been discussed firsthand with @bveeramani. "Fixes ray-project#60166" --------- Signed-off-by: Limark Dcunha <limarkdcunha@gmail.com>
…ject#60208) This PR updates DefaultClusterAutoscalerV2 to safely handle nodes with 0 logical CPUs by replacing direct dictionary access (r["CPU"]) with r.get("CPU", 0), preventing crashes on dedicated GPU nodes. This fix has been discussed firsthand with @bveeramani. "Fixes ray-project#60166" --------- Signed-off-by: Limark Dcunha <limarkdcunha@gmail.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
…ject#60208) This PR updates DefaultClusterAutoscalerV2 to safely handle nodes with 0 logical CPUs by replacing direct dictionary access (r["CPU"]) with r.get("CPU", 0), preventing crashes on dedicated GPU nodes. This fix has been discussed firsthand with @bveeramani. "Fixes ray-project#60166" --------- Signed-off-by: Limark Dcunha <limarkdcunha@gmail.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
This PR updates DefaultClusterAutoscalerV2 to safely handle nodes with 0 logical CPUs by replacing direct dictionary access (r["CPU"]) with r.get("CPU", 0), preventing crashes on dedicated GPU nodes.
This fix has been discussed firsthand with @bveeramani.
"Fixes #60166"