[core][autoscaler] Retry GCP project metadata updates on HTTP 412 errors#60429
[core][autoscaler] Retry GCP project metadata updates on HTTP 412 errors#60429edoakes merged 4 commits intoray-project:masterfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a retry mechanism to handle HTTP 412 precondition failures when updating GCP project metadata for SSH keys. This is a good improvement for robustness in concurrent environments. My main feedback is to make the retry loop bounded to prevent potential infinite loops and to add a small backoff delay. This will make the retry logic safer and more robust.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b722b8dcd1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
b722b8d to
71a0e8c
Compare
Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
71a0e8c to
e9418e0
Compare
a8677a0 to
9495ec7
Compare
Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
9495ec7 to
ab909bd
Compare
|
Hi @edoakes, please review it again 🙏 |
…ors (ray-project#60429) When the autoscaler tries to launch a Ray cluster on GCP, it puts a new SSH key into the project metadata if necessary. The update may results into an HTTP 412 precondition failure if there are concurrent tries to update the metadata. The error will look like this: ```python googleapiclient.errors.HttpError: <HttpError 412 when requesting https://compute.googleapis.com/compute/v1/projects/my_gcp_project/setCommonInstanceMetadata?alt=json returned "Supplied fingerprint does not match current metadata fingerprint.". Details: "[{'message': 'Supplied fingerprint does not match current metadata fingerprint.', 'domain': 'global', 'reason': 'conditionNotMet', 'location': 'If-Match', 'locationType': 'header'}]"> ``` The error can only be resolved by retrying. Therefore, to provide a better user experience, this PR does the retry for the users automatically: 1. Catch the error. 2. Reload the metadata and update it again. --------- Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
…ors (ray-project#60429) When the autoscaler tries to launch a Ray cluster on GCP, it puts a new SSH key into the project metadata if necessary. The update may results into an HTTP 412 precondition failure if there are concurrent tries to update the metadata. The error will look like this: ```python googleapiclient.errors.HttpError: <HttpError 412 when requesting https://compute.googleapis.com/compute/v1/projects/my_gcp_project/setCommonInstanceMetadata?alt=json returned "Supplied fingerprint does not match current metadata fingerprint.". Details: "[{'message': 'Supplied fingerprint does not match current metadata fingerprint.', 'domain': 'global', 'reason': 'conditionNotMet', 'location': 'If-Match', 'locationType': 'header'}]"> ``` The error can only be resolved by retrying. Therefore, to provide a better user experience, this PR does the retry for the users automatically: 1. Catch the error. 2. Reload the metadata and update it again. --------- Signed-off-by: Rueian Huang <rueiancsie@gmail.com> Signed-off-by: jinbum-kim <jinbum9958@gmail.com>
…ors (ray-project#60429) When the autoscaler tries to launch a Ray cluster on GCP, it puts a new SSH key into the project metadata if necessary. The update may results into an HTTP 412 precondition failure if there are concurrent tries to update the metadata. The error will look like this: ```python googleapiclient.errors.HttpError: <HttpError 412 when requesting https://compute.googleapis.com/compute/v1/projects/my_gcp_project/setCommonInstanceMetadata?alt=json returned "Supplied fingerprint does not match current metadata fingerprint.". Details: "[{'message': 'Supplied fingerprint does not match current metadata fingerprint.', 'domain': 'global', 'reason': 'conditionNotMet', 'location': 'If-Match', 'locationType': 'header'}]"> ``` The error can only be resolved by retrying. Therefore, to provide a better user experience, this PR does the retry for the users automatically: 1. Catch the error. 2. Reload the metadata and update it again. --------- Signed-off-by: Rueian Huang <rueiancsie@gmail.com> Signed-off-by: 400Ping <jiekaichang@apache.org>
…ors (ray-project#60429) When the autoscaler tries to launch a Ray cluster on GCP, it puts a new SSH key into the project metadata if necessary. The update may results into an HTTP 412 precondition failure if there are concurrent tries to update the metadata. The error will look like this: ```python googleapiclient.errors.HttpError: <HttpError 412 when requesting https://compute.googleapis.com/compute/v1/projects/my_gcp_project/setCommonInstanceMetadata?alt=json returned "Supplied fingerprint does not match current metadata fingerprint.". Details: "[{'message': 'Supplied fingerprint does not match current metadata fingerprint.', 'domain': 'global', 'reason': 'conditionNotMet', 'location': 'If-Match', 'locationType': 'header'}]"> ``` The error can only be resolved by retrying. Therefore, to provide a better user experience, this PR does the retry for the users automatically: 1. Catch the error. 2. Reload the metadata and update it again. --------- Signed-off-by: Rueian Huang <rueiancsie@gmail.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
…ors (ray-project#60429) When the autoscaler tries to launch a Ray cluster on GCP, it puts a new SSH key into the project metadata if necessary. The update may results into an HTTP 412 precondition failure if there are concurrent tries to update the metadata. The error will look like this: ```python googleapiclient.errors.HttpError: <HttpError 412 when requesting https://compute.googleapis.com/compute/v1/projects/my_gcp_project/setCommonInstanceMetadata?alt=json returned "Supplied fingerprint does not match current metadata fingerprint.". Details: "[{'message': 'Supplied fingerprint does not match current metadata fingerprint.', 'domain': 'global', 'reason': 'conditionNotMet', 'location': 'If-Match', 'locationType': 'header'}]"> ``` The error can only be resolved by retrying. Therefore, to provide a better user experience, this PR does the retry for the users automatically: 1. Catch the error. 2. Reload the metadata and update it again. --------- Signed-off-by: Rueian Huang <rueiancsie@gmail.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
When the autoscaler tries to launch a Ray cluster on GCP, it puts a new SSH key into the project metadata if necessary. The update may results into an HTTP 412 precondition failure if there are concurrent tries to update the metadata. The error will look like this:
The error can only be resolved by retrying. Therefore, to provide a better user experience, this PR does the retry for the users automatically: