estuary-cdk: enforce deadline for graceful shutdown#3921
Merged
Conversation
When the `stopping.event` is set for any reason (ex: a stream encountered an exception, it's been 24 hours since the capture restarted), it's possible for other streams to block the graceful shutdown indefinitely. Captures have stalled in these situations and required manual intervention to get unstuck. This commit adds `enforce_shutdown` that triggers after `stopping.event` is set. It waits 30 minutes for the graceful shutdown to complete. If those 30 minutes elapse and the connector hasn't exited, the `TaskGroup`'s running tasks are signaled to cancel. If the connector is _still_ running 5 minutes after attempting to cancel the tasks, then the CDK forces an exit with `os._exit(1)` as a last resort.
Alex-Bair
commented
Feb 21, 2026
Comment on lines
+200
to
+216
| try: | ||
| async with asyncio.TaskGroup() as tg: | ||
| # Start enforce_shutdown after tg is available | ||
| enforce_shutdown_task = asyncio.create_task(enforce_shutdown(tg)) | ||
|
|
||
| task = Task( | ||
| log.getChild("capture"), | ||
| ConnectorStatus(log, stopping), | ||
| "capture", | ||
| self.output, | ||
| stopping, | ||
| tg, | ||
| ) | ||
| log.event.status("Capture started") | ||
| await capture(task) | ||
| except* TerminateTaskGroup: | ||
| pass # Expected when enforce_shutdown terminates the task group |
Member
Author
There was a problem hiding this comment.
Note: Terminating a task group isn't natively supported by the standard library, so I used a TerminateTaskGroup exception to trigger the task group to terminate itself. Usage of a TermianteTaskGroup exception to trigger task group termination was copied from the Python docs.
nicolaslazo
approved these changes
Feb 23, 2026
Contributor
nicolaslazo
left a comment
There was a problem hiding this comment.
I don't recall seeing any connectors that try to handle cancellation signals, but I still think the os._exit(1) call is a reasonable fallback measure. Looks good, thanks Alex
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description:
When the
stopping.eventis set for any reason (ex: a stream encountered an exception, it's been 24 hours since the capture restarted), it's possible for other streams to block the graceful shutdown indefinitely. Captures have stalled in these situations and required manual intervention to get unstuck.This commit adds
enforce_shutdownthat triggers afterstopping.eventis set. It waits 30 minutes for the graceful shutdown to complete. If those 30 minutes elapse and the connector hasn't exited, theTaskGroup's running tasks are signaled to cancel. If the connector is still running 5 minutes after attempting to cancel the tasks, then the CDK forces an exit withos._exit(1)as a last resort.Workflow steps:
(How does one use this feature, and how has it changed)
Documentation links affected:
(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)
Notes for reviewers:
Tested on a development stack. Confirmed that: