Skip to content

Commit 1db331e

Browse files
rueianpeterxcli
authored andcommitted
[core][autoscaler] Retry GCP project metadata updates on HTTP 412 errors (ray-project#60429)
When the autoscaler tries to launch a Ray cluster on GCP, it puts a new SSH key into the project metadata if necessary. The update may results into an HTTP 412 precondition failure if there are concurrent tries to update the metadata. The error will look like this: ```python googleapiclient.errors.HttpError: <HttpError 412 when requesting https://compute.googleapis.com/compute/v1/projects/my_gcp_project/setCommonInstanceMetadata?alt=json returned "Supplied fingerprint does not match current metadata fingerprint.". Details: "[{'message': 'Supplied fingerprint does not match current metadata fingerprint.', 'domain': 'global', 'reason': 'conditionNotMet', 'location': 'If-Match', 'locationType': 'header'}]"> ``` The error can only be resolved by retrying. Therefore, to provide a better user experience, this PR does the retry for the users automatically: 1. Catch the error. 2. Reload the metadata and update it again. --------- Signed-off-by: Rueian Huang <rueiancsie@gmail.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
1 parent da5a1e3 commit 1db331e

File tree

1 file changed

+18
-1
lines changed

1 file changed

+18
-1
lines changed

python/ray/autoscaler/_private/gcp/config.py

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -549,7 +549,24 @@ def _configure_key_pair(config, compute):
549549
)
550550
public_key, private_key = generate_rsa_key_pair()
551551

552-
_create_project_ssh_key_pair(project, public_key, ssh_user, compute)
552+
for attempt in range(MAX_POLLS):
553+
try:
554+
_create_project_ssh_key_pair(project, public_key, ssh_user, compute)
555+
break
556+
except errors.HttpError as e:
557+
if e.resp.status != 412 or attempt == MAX_POLLS - 1:
558+
raise
559+
logger.warning(
560+
"GCP project metadata update conflict for %s (%s); retrying",
561+
config["provider"]["project_id"],
562+
e,
563+
)
564+
time.sleep(POLL_INTERVAL)
565+
project = (
566+
compute.projects()
567+
.get(project=config["provider"]["project_id"])
568+
.execute()
569+
)
553570

554571
# Create the directory if it doesn't exists
555572
private_key_dir = os.path.dirname(private_key_path)

0 commit comments

Comments
 (0)