What problem are you facing?
When my Kubernetes cluster moves pods to different nodes (e.g., during node maintenance, scaling, or resource rebalancing), the OpenTofu provider pod can be terminated while it's running a tofu apply operation. These operations often take longer than 30 seconds to complete.
Currently, I'm seeing the following error in the pod logs:
crossplane-opentofu-provider: error: Cannot start controller manager: failed waiting for all runnables to end within grace period of 30s: context deadline exceeded
Current situation:
- I've set
terminationGracePeriodSeconds to a larger value (e.g., 300 seconds) in the pod spec, which gives Kubernetes more time before forcefully killing the pod
- However, the controller manager itself has a hardcoded
GracefulShutdownTimeout that defaults to 30 seconds
- This means even though Kubernetes is willing to wait 5 minutes, the controller manager only waits 30 seconds for its workers (which run
tofu apply commands) to finish gracefully
- After 30 seconds, the controller manager gives up and exits, potentially leaving Terraform/OpenTofu operations in an inconsistent state
Why this is a problem:
- Long-running
tofu apply operations (>30s) get interrupted during pod migrations
- This can leave infrastructure in an inconsistent state
- The
terminationGracePeriodSeconds setting alone doesn't solve the problem because the controller manager has its own internal timeout
- There's currently no way to configure the controller manager's
GracefulShutdownTimeout to match the pod's termination grace period
How could Upbound help solve your problem?
Add a new command-line flag (similar to --poll and --max-reconcile-rate) to configure the controller manager's GracefulShutdownTimeout. This would allow users to set a timeout that matches their terminationGracePeriodSeconds and the expected duration of their tofu apply operations.
Benefits:
- Users can configure the graceful shutdown timeout to match their workload requirements
- Prevents premature termination of long-running
tofu apply operations
- Aligns with Kubernetes pod lifecycle management best practices
- Follows the existing pattern of other configurable timeouts in the provider (like
--timeout, etc.)
Related configuration:
This would work in conjunction with:
terminationGracePeriodSeconds in the pod spec (Kubernetes-level timeout)
--timeout flag (controls how long individual tofu processes may run)
What problem are you facing?
When my Kubernetes cluster moves pods to different nodes (e.g., during node maintenance, scaling, or resource rebalancing), the OpenTofu provider pod can be terminated while it's running a
tofu applyoperation. These operations often take longer than 30 seconds to complete.Currently, I'm seeing the following error in the pod logs:
Current situation:
terminationGracePeriodSecondsto a larger value (e.g., 300 seconds) in the pod spec, which gives Kubernetes more time before forcefully killing the podGracefulShutdownTimeoutthat defaults to 30 secondstofu applycommands) to finish gracefullyWhy this is a problem:
tofu applyoperations (>30s) get interrupted during pod migrationsterminationGracePeriodSecondssetting alone doesn't solve the problem because the controller manager has its own internal timeoutGracefulShutdownTimeoutto match the pod's termination grace periodHow could Upbound help solve your problem?
Add a new command-line flag (similar to
--polland--max-reconcile-rate) to configure the controller manager'sGracefulShutdownTimeout. This would allow users to set a timeout that matches theirterminationGracePeriodSecondsand the expected duration of theirtofu applyoperations.Benefits:
tofu applyoperations--timeout, etc.)Related configuration:
This would work in conjunction with:
terminationGracePeriodSecondsin the pod spec (Kubernetes-level timeout)--timeoutflag (controls how long individual tofu processes may run)