-
Notifications
You must be signed in to change notification settings - Fork 109
Description
What would you like to be added:
A way for JobSet's failure policy rules to match pod.status.reason.
Why is this needed:
Currently, JobSet's failure policy rules (jobset.spec.failurePolicy.rules) only act on Job failure reasons (job.status.conditions[].reason). While a Job's Pod failure policy (job.spec.podFailurePolicy) can propagate Pod failures to the Job level, it is limited to matching on:
pod.status.conditions[].typepod.status.conditions[].statuspod.status.containerStatuses[].state.terminated.namepod.status.containerStatuses[].state.terminated.exitCodepod.status.initContainerStatuses[].state.terminated.namepod.status.initContainerStatuses[].state.terminated.exitCode
This becomes a problem when Pods fail with a minimal status like the following (observed in production)
status:
phase: "Failed"
reason: "UnexpectedAdmissionError"
message: ...
qosClass: ...
startTime: ...Since there are no fields pod.status.conditions[] and pod.status.containerStatuses[], there is no way for the Job's Pod failure policy to bubble up the failure reason to JobSet's failure policy to allow for things such as restarting the JobSet without counting towards maxRestarts.
Possible solutions:
- (1) Make kubelet add a Pod condition for
UnexpectedAdmissionError- This is the probably the ideal solution, but it is also the most complex and the slowest to deliver
- (2) Make the Job API support matching
pod.status.reason- This is simpler than changing kubelet and would be faster to deliver, but it is probably still too slow (besides implementation, it would require a KEP and 2~3 Kubernetes release cycles)
- (3) Make the JobSet API support matching
pod.status.reason- This is probably not the ideal soltuion, but pragramatically it would be the fastest to deliver (besides implementation, it would require updating the failure policy KEP and one JobSet release cycle)
This enhancement requires the following artifacts:
- Design doc (technically only update the failure policy KEP)
- API change
- Docs update
Metadata
Metadata
Assignees
Labels
Type
Projects
Status