Skip to content

Fix race condition timeout decorator (bugfix)#1442

Merged
Hook25 merged 6 commits intomainfrom
fix_race_condition_timeout
Sep 3, 2024
Merged

Fix race condition timeout decorator (bugfix)#1442
Hook25 merged 6 commits intomainfrom
fix_race_condition_timeout

Conversation

@Hook25
Copy link
Copy Markdown
Collaborator

@Hook25 Hook25 commented Aug 29, 2024

Description

The timeout decorator seldomly fails to receive either a value or an exception from the subprocess. I am currently unable to reproduce this issue in any way but I've seen it happen on a very slow device running core20. Either way this improves the implementation in several key ways:

  1. Don't check empty before getting from the queue: this is an anti-pattern and not guaranteed to work
  2. Don't put a value we can determine is un-picklable in the queue: this crashes the writer thread of the queue with no error for Queue.
  3. Handle the case where the value is not picklable: this was handled the same as a normal exception, but it is a special case where we should really crash the program with a clear message
  4. Bounded wait for gets: If the parent fails to receive anything after the child has died, we can't block the process on get. Rather what we do is raise a SystemExit to crash the parent and explain what happened
  5. Best effort retry on get: I'm not sure if this is what is happening but given that the decoding is done in a separate thread, it could be that the first get is simply being called before it is done, so lets retry. This costs us (in those situations) at most 200ms, we can afford that
  6. Close the queue in child process: This should also make the flush of the child thread more deterministic in some situations (or so some claim, from the history of the queue implementation, it is unclear to me how this could be the case)

This also contains a driveby fix to the issue described in this PR (removal of the -- before the pid): #1441

And a new test to prove that the solution in the decorator works (while using the builtin kill or any signal wont, to the best of my knowledge)

Resolved issues

Fixes: CHECKBOX-1544

Documentation

New comments to explain why some things are done (including why subprocess.run is used)

Tests

New test that proves the kill behaviour works. Also new coverage for all the new lines

Note: some lines appear uncovered because the test runs them in a separate process

@codecov
Copy link
Copy Markdown

codecov bot commented Aug 29, 2024

Codecov Report

Attention: Patch coverage is 57.89474% with 16 lines in your changes missing coverage. Please review.

Project coverage is 45.70%. Comparing base (d725523) to head (e0cdab7).
Report is 123 commits behind head on main.

Files with missing lines Patch % Lines
...eckbox-support/checkbox_support/helpers/timeout.py 57.89% 16 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1442      +/-   ##
==========================================
- Coverage   45.72%   45.70%   -0.02%     
==========================================
  Files         367      367              
  Lines       39134    39158      +24     
  Branches     6618     6622       +4     
==========================================
+ Hits        17895    17899       +4     
- Misses      20565    20584      +19     
- Partials      674      675       +1     
Flag Coverage Δ
checkbox-support 59.51% <57.89%> (-0.21%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Hook25 Hook25 force-pushed the fix_race_condition_timeout branch from bf61458 to 8ce0d84 Compare August 29, 2024 12:34
@eugene-yujinwu
Copy link
Copy Markdown
Contributor

eugene-yujinwu commented Aug 30, 2024

I test the timeout on my side with this fix, the -- issue is gone. The test log is as below:

> 1. Press continue 
> 2. Wait until the message "INSERT NOW" is printed on the screen
> 3. Connect USB 2.0 storage device
> 
> Pick an action
>     => press ENTER to continue
>   c => add a comment
>   s => skip this job
>   q => save the session and quit
> [csq]: 
> ... 8< -------------------------------------------------------------------------
> 
> 
> INSERT NOW
> 
> 
> Timeout: 30 seconds
> Traceback (most recent call last):
>   File "/usr/bin/checkbox-support-run_watcher", line 8, in <module>
>     sys.exit(main())
>   File "/usr/lib/python3/dist-packages/checkbox_support/helpers/timeout.py", line 137, in _f
>     return run_with_timeout(f, timeout_s, *args, **kwargs)
>   File "/usr/lib/python3/dist-packages/checkbox_support/helpers/timeout.py", line 107, in run_with_timeout
>     raise TimeoutError("Task unable to finish in {}s".format(timeout_s))
> TimeoutError: Task unable to finish in 30s
> ------------------------------------------------------------------------- >8 ---
> Outcome: job failed
> Finalizing session that hasn't been submitted anywhere: checkbox-run-2024-08-30T02.55.38
> ==================================[ Results ]===================================
>  ☒ : USB 2.0 storage device insertion detected

@eugene-yujinwu
Copy link
Copy Markdown
Contributor

eugene-yujinwu commented Aug 30, 2024

Some more test :

Python 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import subprocess
>>> ppid=683324
>>> subprocess.run("kill -9  -{}".format(ppid), shell=True)
/bin/sh: 1: kill: No such process

CompletedProcess(args='kill -9  -683324', returncode=1)

Copy link
Copy Markdown
Collaborator

@pieqq pieqq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a while to review this, because I'm not familiar with the multiprocessing module (nor queuing, nor threading... :)).

I did some tests locally, adding a @timeout decorator to a function spawning other long running processes, and it seems to work as expected.

This looks OK to me, with some documentation improvements and tiny typos.

A few things that are related to this PR, but not part of it:

  1. Usage for this decorator is not documented anywhere. At the very least, there should be a very simple example at the top of the timeout.py file.
  2. Using @timeout decorator (without passing the duration wanted) results in some weird outcome depending when you use it. Maybe there should be a check to make sure a duration is passed (and it's a number)
  3. Nice stuff offered by checkbox_support should be documented in the official doc, otherwise there is very little chance outside users and test authors use them.

Hook25 and others added 2 commits September 3, 2024 09:39
Co-authored-by: Pierre Equoy <pierre.equoy@canonical.com>
Copy link
Copy Markdown
Collaborator

@pieqq pieqq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback and the fixes!

@Hook25 Hook25 merged commit d3078dd into main Sep 3, 2024
@Hook25 Hook25 deleted the fix_race_condition_timeout branch September 3, 2024 09:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants