[Autoscaler] Fix v1 autoscaler TypeError when using bundle_label_selectors#59850
Conversation
…ctors PR 54843 broke v1 autoscaler by writing new format with nested dicts to KV. Fix by extracting 'resources' field for v1 (v2 uses GCS RPC with full format). Signed-off-by: dragongu <andrewgu@vip.qq.com>
There was a problem hiding this comment.
Code Review
This pull request effectively resolves a TypeError in the v1 autoscaler when request_resources is called with bundle_label_selectors. The issue stemmed from the v1 autoscaler's inability to process the newer resource request format. The implemented solution correctly introduces version-specific logic, using GCS RPC for v2 and writing a backward-compatible format to the KV store for v1. The changes are clear, logical, and well-contained. The accompanying test modifications are excellent, as they not only assert the correct data format for v1 but also verify that the operation which previously caused the crash now executes successfully. The code quality is high, and I have no further recommendations.
|
@dragongu what are the issues you've run into? We'd like to fully migrate to v2 and deprecate/remove v1 soon, so definitely want to sort those out! |
@edoakes Yes, we’re planning to move to v2 soon. The main issue is that our KubeRay is on a very old version (0.5.6). We hit some problems during the upgrade and had to roll back in a hurry. I’ll dig into it more today, but the v1 fix is still pretty important since it’s blocking my Ray upgrade. |
…ctors (ray-project#59850) Signed-off-by: dragongu <andrewgu@vip.qq.com>
…ctors (ray-project#59850) Signed-off-by: dragongu <andrewgu@vip.qq.com> Signed-off-by: jeffery4011 <jefferyshen1015@gmail.com>
…ctors (ray-project#59850) Signed-off-by: dragongu <andrewgu@vip.qq.com>
…ctors (ray-project#59850) Signed-off-by: dragongu <andrewgu@vip.qq.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
…ctors (ray-project#59850) Signed-off-by: dragongu <andrewgu@vip.qq.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
Fix autoscaler v1 TypeError when request_resources uses label selectors
Summary
Fix a TypeError in autoscaler v1 when
request_resources()is called withbundle_label_selectors.Problem
PR #54843 introduced
bundle_label_selectorsparameter torequest_resources(). When this parameter is used, the function writes resources in the new format to GCS KV:{"resources": {"CPU": 1}, "label_selector": {"region": "us-west1"}}For autoscaler v2, this format is handled correctly via GCS RPC. However, autoscaler v1 reads from KV and extracts only the
resourcesfield for backward compatibility (commands.py:237). But whenmonitor.load_metrics.summary()is called, it invokesfreq_of_dicts()which attempts to hash these resource dictionaries.If the resource dictionary contains nested structures (e.g., when label selectors are present in v2 format), the default serializer fails with:
This happens because Python dicts are mutable and cannot be used as dictionary keys or set elements directly.
Reproduction
Solution
Update
commands.pyto ensure autoscaler v1 compatibility by always extracting only theresourcesfield when writing to KV store, regardless of format:This ensures v1 autoscaler only sees simple
ResourceDictformat like{"CPU": 1}without nested structures, preventing the TypeError infreq_of_dicts().