Skip to content

Fix deadlock issues by implementing timeout handling for git operations#3420

Merged
MoralCode merged 5 commits intoaugurlabs:mainfrom
shlokgilda:fix/git-deadlock-issue
Dec 11, 2025
Merged

Fix deadlock issues by implementing timeout handling for git operations#3420
MoralCode merged 5 commits intoaugurlabs:mainfrom
shlokgilda:fix/git-deadlock-issue

Conversation

@shlokgilda
Copy link
Copy Markdown
Collaborator

Description
Fixes subprocess pipe deadlock in facade git operations that caused tasks to hang indefinitely on large repositories.

The Problem:
Facade tasks were getting stuck for 12+ hours on git checkout operations when processing large repos (600+ MB pack files). The issue was a subprocess pipe buffer deadlock - git would write more than 64KB of output to stdout/stderr, fill the OS pipe buffer, and block waiting for someone to read it, while Python blocked waiting for git to finish. Neither could proceed.

Additional details in comments of issue #3319.

The Fix:

  1. Replaced Popen().wait() with subprocess.run() - Modern API that handles process lifecycle correctly
  2. Added stdout=DEVNULL, stderr=DEVNULL - Prevents pipe creation entirely by redirecting to /dev/null. Git output is immediately discarded by the kernel, so no buffer can fill up.
  3. Added conservative timeouts - Protects against truly hung operations:
    • 2 hours for git clone (large repos)
    • 10 minutes for git pull and git checkout
    • 5 minutes for git reset and git clean
    • 1 minute for branch/remote queries
  4. Used check=False - Manual error handling for retry logic and custom error messages

Why subprocess.run() is sufficient:

  • Internally calls .wait() for us (blocking call)
  • With capture_output=True, internally calls .communicate()
  • With stdout=DEVNULL, no communication needed (no pipes to read)
  • Cleaner, safer API that prevents the deadlock pattern entirely

Testing:

  • Verified all subprocess calls now have appropriate timeouts
  • Confirmed no orphaned exception handlers
  • Pattern follows Python subprocess best practices

Notes for Reviewers
The key insight is that subprocess.run() with stdout=DEVNULL creates no pipes at all - output goes directly to the kernel's null device. This makes deadlock impossible regardless of output size. The timeouts are conservative to allow legitimate long operations while preventing infinite hangs.

Signed commits

  • Yes, I signed my commits.

GenAI Disclosure: Used Claude Code to draft this PR.

Signed-off-by: Shlok Gilda <gildashlok@hotmail.com>
@MoralCode
Copy link
Copy Markdown
Collaborator

holy git ops batman - i didnt think there were so many places we did git subprocesses lol

@MoralCode
Copy link
Copy Markdown
Collaborator

GenAI Disclosure: Used Claude Code to draft this PR.
Do you mean just the PR description? or the pr diff contents itself?

@MoralCode MoralCode added disclosed-ai Label for contributions that contain disclosed, reviewed, or responsibly-submitted AI content. discussion Seeking active feedback, usually for items under active development labels Nov 19, 2025
@shlokgilda
Copy link
Copy Markdown
Collaborator Author

shlokgilda commented Nov 19, 2025

GenAI Disclosure: Used Claude Code to draft this PR.
Do you mean just the PR description? or the pr diff contents itself?

Aah, my bad! I meant just the PR description.

@MoralCode
Copy link
Copy Markdown
Collaborator

Aah, my bad! I meant just the PR description.

no worries, just checking! cant wait to find time to finish the policy part of #3371 so we can have a better template for people to use

@sgoggins
Copy link
Copy Markdown
Collaborator

Outstanding contribution! Reviewing .

This comment was marked as spam.

Copy link
Copy Markdown
Collaborator

@MoralCode MoralCode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall i love this change, just figure we should deduplicate all this process-running infrastructure into a single function that gets reused

Comment thread augur/tasks/git/util/facade_worker/facade_worker/repofetch.py Outdated
… facade operation

Signed-off-by: Shlok Gilda <gildashlok@hotmail.com>
Copy link
Copy Markdown
Collaborator

@MoralCode MoralCode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good! just a small adjustment to further deduplicate the various code branches within the run_git_command helper

Comment thread augur/tasks/git/util/facade_worker/facade_worker/config.py Outdated
Signed-off-by: Shlok Gilda <gildashlok@hotmail.com>
Comment thread augur/tasks/git/util/facade_worker/facade_worker/config.py Outdated
Co-authored-by: Adrian Edwards <17362949+MoralCode@users.noreply.github.com>
Signed-off-by: Shlok Gilda <gildashlok@hotmail.com>
@MoralCode
Copy link
Copy Markdown
Collaborator

@shlokgilda has been thoroughly testing this and has confirmed that the facade workers are flowing.

I trust this as far as testing goes and am going to mark this as ready

@MoralCode MoralCode added the ready Items tested and seeking additional approvals or a merge. Usually for items under active development label Dec 3, 2025
@sgoggins
Copy link
Copy Markdown
Collaborator

sgoggins commented Dec 6, 2025

@MoralCode : Re0running podman end to end, which failed.

@MoralCode
Copy link
Copy Markdown
Collaborator

rerunning again. the test passed (all expected log lines seen in the run) but was marked failed because its shutdown took too long and it got killed

@sgoggins
Copy link
Copy Markdown
Collaborator

sgoggins commented Dec 9, 2025

rerunning again. the test passed (all expected log lines seen in the run) but was marked failed because its shutdown took too long and it got killed

I think your prior comment and the ready tag signal its ready to merge, but since there were commits since your last review @MoralCode , I am double checking.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +120 to +123
if return_code == 0:
parent_commits = set(stdout.split(os.linesep))
else:
parent_commits = set() # Return empty set on timeout or error
Copy link

Copilot AI Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning an empty set on timeout or error could lead to incorrect behavior in downstream code. When get_parent_commits_set returns an empty set, the calling functions (trim_commits_post_analysis_facade_task and analyze_commits_in_parallel) will compute missing_commits = parent_commits - existing_commits or trimmed_commits = existing_commits - parent_commits, which could incorrectly classify all existing commits as trimmed.

Consider either:

  1. Raising an exception on timeout/error so the calling code can handle it appropriately
  2. Returning None to distinguish between "no commits" (valid empty set) and "error occurred" (None)
  3. Adding a check in the calling code to verify the git log succeeded before proceeding

The error is already logged in run_git_command, but silently returning an empty set masks the failure from the calling code.

Copilot uses AI. Check for mistakes.
stdout=subprocess.PIPE, shell=True)

parent_commits = set(parents.stdout.read().decode("utf-8",errors="ignore").split(os.linesep))
def get_parent_commits_set(absolute_repo_path, facade_helper, logger=None):
Copy link

Copilot AI Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The logger parameter is defined but never used in the function. Consider removing it if it's not needed, or use it to log errors when return_code != 0 instead of silently returning an empty set.

Copilot uses AI. Check for mistakes.
@MoralCode
Copy link
Copy Markdown
Collaborator

I havent re-reviewed to mark it approved because I havent explicitly tested it personally, but i consider all my review comments to be resolved and I trust @shlokgilda's testing

Copy link
Copy Markdown
Collaborator

@Ulincsys Ulincsys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Collaborator

@IsaacMilarky IsaacMilarky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@MoralCode MoralCode merged commit 7dcc282 into augurlabs:main Dec 11, 2025
10 checks passed
@shlokgilda shlokgilda deleted the fix/git-deadlock-issue branch December 11, 2025 17:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

disclosed-ai Label for contributions that contain disclosed, reviewed, or responsibly-submitted AI content. discussion Seeking active feedback, usually for items under active development ready Items tested and seeking additional approvals or a merge. Usually for items under active development

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants