Skip to content

Add a way for JobSet's failure policy rules to match pod.status.reason #1091

@GiuseppeTT

Description

@GiuseppeTT

What would you like to be added:

A way for JobSet's failure policy rules to match pod.status.reason.

Why is this needed:

Currently, JobSet's failure policy rules (jobset.spec.failurePolicy.rules) only act on Job failure reasons (job.status.conditions[].reason). While a Job's Pod failure policy (job.spec.podFailurePolicy) can propagate Pod failures to the Job level, it is limited to matching on:

  • pod.status.conditions[].type
  • pod.status.conditions[].status
  • pod.status.containerStatuses[].state.terminated.name
  • pod.status.containerStatuses[].state.terminated.exitCode
  • pod.status.initContainerStatuses[].state.terminated.name
  • pod.status.initContainerStatuses[].state.terminated.exitCode

This becomes a problem when Pods fail with a minimal status like the following (observed in production)

status:
  phase: "Failed"
  reason: "UnexpectedAdmissionError"
  message: ...
  qosClass: ...
  startTime: ...

Since there are no fields pod.status.conditions[] and pod.status.containerStatuses[], there is no way for the Job's Pod failure policy to bubble up the failure reason to JobSet's failure policy to allow for things such as restarting the JobSet without counting towards maxRestarts.

Possible solutions:

  • (1) Make kubelet add a Pod condition for UnexpectedAdmissionError
    • This is the probably the ideal solution, but it is also the most complex and the slowest to deliver
  • (2) Make the Job API support matching pod.status.reason
    • This is simpler than changing kubelet and would be faster to deliver, but it is probably still too slow (besides implementation, it would require a KEP and 2~3 Kubernetes release cycles)
  • (3) Make the JobSet API support matching pod.status.reason
    • This is probably not the ideal soltuion, but pragramatically it would be the fastest to deliver (besides implementation, it would require updating the failure policy KEP and one JobSet release cycle)

This enhancement requires the following artifacts:

  • Design doc (technically only update the failure policy KEP)
  • API change
  • Docs update

Metadata

Metadata

Assignees

No one assigned

    Labels

    lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.

    Type

    No type

    Projects

    Status

    Untriaged

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions