Fix deadlock issues by implementing timeout handling for git operations#3420
Fix deadlock issues by implementing timeout handling for git operations#3420MoralCode merged 5 commits intoaugurlabs:mainfrom
Conversation
Signed-off-by: Shlok Gilda <gildashlok@hotmail.com>
|
holy git ops batman - i didnt think there were so many places we did git subprocesses lol |
|
Aah, my bad! I meant just the PR description. |
no worries, just checking! cant wait to find time to finish the policy part of #3371 so we can have a better template for people to use |
|
Outstanding contribution! Reviewing . |
MoralCode
left a comment
There was a problem hiding this comment.
Overall i love this change, just figure we should deduplicate all this process-running infrastructure into a single function that gets reused
… facade operation Signed-off-by: Shlok Gilda <gildashlok@hotmail.com>
MoralCode
left a comment
There was a problem hiding this comment.
Overall this looks good! just a small adjustment to further deduplicate the various code branches within the run_git_command helper
Signed-off-by: Shlok Gilda <gildashlok@hotmail.com>
Co-authored-by: Adrian Edwards <17362949+MoralCode@users.noreply.github.com> Signed-off-by: Shlok Gilda <gildashlok@hotmail.com>
|
@shlokgilda has been thoroughly testing this and has confirmed that the facade workers are flowing. I trust this as far as testing goes and am going to mark this as ready |
|
@MoralCode : Re0running podman end to end, which failed. |
|
rerunning again. the test passed (all expected log lines seen in the run) but was marked failed because its shutdown took too long and it got killed |
I think your prior comment and the ready tag signal its ready to merge, but since there were commits since your last review @MoralCode , I am double checking. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if return_code == 0: | ||
| parent_commits = set(stdout.split(os.linesep)) | ||
| else: | ||
| parent_commits = set() # Return empty set on timeout or error |
There was a problem hiding this comment.
Returning an empty set on timeout or error could lead to incorrect behavior in downstream code. When get_parent_commits_set returns an empty set, the calling functions (trim_commits_post_analysis_facade_task and analyze_commits_in_parallel) will compute missing_commits = parent_commits - existing_commits or trimmed_commits = existing_commits - parent_commits, which could incorrectly classify all existing commits as trimmed.
Consider either:
- Raising an exception on timeout/error so the calling code can handle it appropriately
- Returning None to distinguish between "no commits" (valid empty set) and "error occurred" (None)
- Adding a check in the calling code to verify the git log succeeded before proceeding
The error is already logged in run_git_command, but silently returning an empty set masks the failure from the calling code.
| stdout=subprocess.PIPE, shell=True) | ||
|
|
||
| parent_commits = set(parents.stdout.read().decode("utf-8",errors="ignore").split(os.linesep)) | ||
| def get_parent_commits_set(absolute_repo_path, facade_helper, logger=None): |
There was a problem hiding this comment.
[nitpick] The logger parameter is defined but never used in the function. Consider removing it if it's not needed, or use it to log errors when return_code != 0 instead of silently returning an empty set.
|
I havent re-reviewed to mark it approved because I havent explicitly tested it personally, but i consider all my review comments to be resolved and I trust @shlokgilda's testing |
Description
Fixes subprocess pipe deadlock in facade git operations that caused tasks to hang indefinitely on large repositories.
The Problem:
Facade tasks were getting stuck for 12+ hours on
git checkoutoperations when processing large repos (600+ MB pack files). The issue was a subprocess pipe buffer deadlock - git would write more than 64KB of output to stdout/stderr, fill the OS pipe buffer, and block waiting for someone to read it, while Python blocked waiting for git to finish. Neither could proceed.Additional details in comments of issue #3319.
The Fix:
Popen().wait()withsubprocess.run()- Modern API that handles process lifecycle correctlystdout=DEVNULL, stderr=DEVNULL- Prevents pipe creation entirely by redirecting to/dev/null. Git output is immediately discarded by the kernel, so no buffer can fill up.git clone(large repos)git pullandgit checkoutgit resetandgit cleancheck=False- Manual error handling for retry logic and custom error messagesWhy
subprocess.run()is sufficient:.wait()for us (blocking call)capture_output=True, internally calls.communicate()stdout=DEVNULL, no communication needed (no pipes to read)Testing:
Notes for Reviewers
The key insight is that
subprocess.run()withstdout=DEVNULLcreates no pipes at all - output goes directly to the kernel's null device. This makes deadlock impossible regardless of output size. The timeouts are conservative to allow legitimate long operations while preventing infinite hangs.Signed commits
GenAI Disclosure: Used Claude Code to draft this PR.