Skip to content

Configurable graceful shutdown timeout for controller manager #121

@garyleungsky

Description

@garyleungsky

What problem are you facing?

When my Kubernetes cluster moves pods to different nodes (e.g., during node maintenance, scaling, or resource rebalancing), the OpenTofu provider pod can be terminated while it's running a tofu apply operation. These operations often take longer than 30 seconds to complete.

Currently, I'm seeing the following error in the pod logs:

crossplane-opentofu-provider: error: Cannot start controller manager: failed waiting for all runnables to end within grace period of 30s: context deadline exceeded

Current situation:

  • I've set terminationGracePeriodSeconds to a larger value (e.g., 300 seconds) in the pod spec, which gives Kubernetes more time before forcefully killing the pod
  • However, the controller manager itself has a hardcoded GracefulShutdownTimeout that defaults to 30 seconds
  • This means even though Kubernetes is willing to wait 5 minutes, the controller manager only waits 30 seconds for its workers (which run tofu apply commands) to finish gracefully
  • After 30 seconds, the controller manager gives up and exits, potentially leaving Terraform/OpenTofu operations in an inconsistent state

Why this is a problem:

  • Long-running tofu apply operations (>30s) get interrupted during pod migrations
  • This can leave infrastructure in an inconsistent state
  • The terminationGracePeriodSeconds setting alone doesn't solve the problem because the controller manager has its own internal timeout
  • There's currently no way to configure the controller manager's GracefulShutdownTimeout to match the pod's termination grace period

How could Upbound help solve your problem?

Add a new command-line flag (similar to --poll and --max-reconcile-rate) to configure the controller manager's GracefulShutdownTimeout. This would allow users to set a timeout that matches their terminationGracePeriodSeconds and the expected duration of their tofu apply operations.

Benefits:

  • Users can configure the graceful shutdown timeout to match their workload requirements
  • Prevents premature termination of long-running tofu apply operations
  • Aligns with Kubernetes pod lifecycle management best practices
  • Follows the existing pattern of other configurable timeouts in the provider (like --timeout, etc.)

Related configuration:
This would work in conjunction with:

  • terminationGracePeriodSeconds in the pod spec (Kubernetes-level timeout)
  • --timeout flag (controls how long individual tofu processes may run)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions