ktesting: tone down warning about leaked test goroutine#401
ktesting: tone down warning about leaked test goroutine#401k8s-ci-robot merged 1 commit intokubernetes:mainfrom
Conversation
|
This issue is currently awaiting triage. If klog contributors determine this is a relevant issue, they will accept it by applying the The DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Let's fix the linter issues first, see #402 |
| logger.WithCallDepth(1).Error(nil, "WARNING: test kept at least one goroutine running after test completion", "callstack", string(dbg.Stacks(false))) | ||
| l.shared.goroutineWarningDone = true | ||
| duration := time.Since(l.shared.stopTime) | ||
| if duration > stopGracePeriod { |
There was a problem hiding this comment.
This seems will bury the real goroutine leakage if the duration is less than stopGracePeriod, do I understand right here?
Another point I can't get here is if the test completes, why will it wait for another 10s?
There was a problem hiding this comment.
If the "leaked" goroutine stops after less than stopGracePeriod, then it wasn't really a leak. It was "just" a test that didn't block and wait for goroutines at the end. This is very common.
There was a problem hiding this comment.
Totally misunderstood the meaning here. I'm not quite understand how it works but I still have a silly question like: Will we wait and detect the goroutine leakage after test completes? Or why the duration from stop will be greater than stopGracePeriod.
There was a problem hiding this comment.
No, we don't wait. That is something that uber/goleak does, but not this code. This code is merely reporting on leaked goroutines if and when it detects them, which is when they keep logging.
The stopGracePeriod is the grace period that goroutines have before their log output triggers this warning. I can rename it if you have a better suggestion.
There was a problem hiding this comment.
I recalled goleak has maxRetries, then can't we just configure the goleak? It's configurable now. Both the retry times and sleep duration. I think somehow the same as stopGracePeriod. Will this lead to other problems?
There was a problem hiding this comment.
keep the warning for goroutines that are more likely to be real leaks while not printing it for goroutines which terminate quickly after test completion.
Definitely agree.
We cannot depend on goleak here
Yes for klog... I mean for callers, like test framework. I suggested in this way is just because the goroutine leakage detection duration is mostly decided by goleak, the maxRetries(default to 20), maxSleep(default to 100ms) specifically, the duration ~= maxRetries * (maxSleep + filterStacks), have no idea whether 10s fits here. However, the feature is reverted... uber-go/goleak#116, so no way to configure now.
As a user of klog, happy with either way to avoid confusion.
There was a problem hiding this comment.
Definitely agree.
So LGTM?
I mean for callers, like test framework.
Leak detection with goleak only works when tests run sequentially or in TestMain. I'm not sure how useful it is in an E2E test suite, because the actual code under test runs elsewhere. For integration and unit tests, individual test owners have to integrate goleak. Some apiserver-based tests have it.
have no idea whether 10s fits here
So this is about how klog will interact with goleak. I don't have a strong opinion here. I picked 10 seconds because it should be long enough to no produce "false positives" for the "I am shutting down after test completion" goroutines and short enough to give feedback for others. Because it's not causing a test failure, it can be lower than the goleak timeouts.
There was a problem hiding this comment.
are these 10s added to the test duration? 10s sounds like a lot to me
There was a problem hiding this comment.
No, they are not added. At the time that this gets called, the test has already completed.
|
/assign @dims For approval. A new release for Kubernetes 1.31 also makes sense. |
|
Hmm, we want a few more things for a new release, like #404 |
|
/retest Does that work for GitHub actions? The linter wasn't happy, for some reason. |
|
Linter failures fixed in main. |
| if !l.shared.goroutineWarningDone { | ||
| logger.WithCallDepth(1).Error(nil, "WARNING: test kept at least one goroutine running after test completion", "callstack", string(dbg.Stacks(false))) | ||
| l.shared.goroutineWarningDone = true | ||
| duration := time.Since(l.shared.stopTime) |
There was a problem hiding this comment.
does this access need to be locked?
There was a problem hiding this comment.
Yes! Good catch, thanks.
/lgtm cancel
There was a problem hiding this comment.
Hrrm, no. fallbackLogger gets called while the mutex is locked. This wasn't clear here, so I've added a comment explaining that (https://github.com/kubernetes/klog/compare/83145e470293dc67bede46353012cededd97e031..ed62027e604261af3e9c42c7a50092a0db074f19).
83145e4 to
ed62027
Compare
|
/hold For second approval. |
Not terminating immediately is fairly normal: many controllers log "terminating" messages while they shut down, which often is right after test completion, if that is when the test cancels the context and then doesn't wait for goroutines (which is often not possible). Therefore the warning now only gets printed if that happens after more than 10 seconds since test completion, and then the warning is only printed as an info message, not an error.
ed62027 to
cd24012
Compare
|
/lgtm |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: aojea, dims, kerthcet The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/hold cancel |
[](https://renovatebot.com) This PR contains the following updates: | Package | Change | Age | Adoption | Passing | Confidence | |---|---|---|---|---|---| | [k8s.io/klog/v2](https://togithub.com/kubernetes/klog) | `v2.120.1` -> `v2.130.0` | [](https://docs.renovatebot.com/merge-confidence/) | [](https://docs.renovatebot.com/merge-confidence/) | [](https://docs.renovatebot.com/merge-confidence/) | [](https://docs.renovatebot.com/merge-confidence/) | --- ### Release Notes <details> <summary>kubernetes/klog (k8s.io/klog/v2)</summary> ### [`v2.130.0`](https://togithub.com/kubernetes/klog/releases/tag/v2.130.0): Prepare klog release for Kubernetes v1.31 (Take 1) [Compare Source](https://togithub.com/kubernetes/klog/compare/v2.120.1...v2.130.0) #### What's Changed - chore(\*): fix spelling of Intel Corporation by [@​jsoref](https://togithub.com/jsoref) in [https://github.com/kubernetes/klog/pull/399](https://togithub.com/kubernetes/klog/pull/399) - build: fix some linter warnings by [@​pohly](https://togithub.com/pohly) in [https://github.com/kubernetes/klog/pull/402](https://togithub.com/kubernetes/klog/pull/402) - examples: fix linter warning by [@​pohly](https://togithub.com/pohly) in [https://github.com/kubernetes/klog/pull/406](https://togithub.com/kubernetes/klog/pull/406) - Do not acquire lock for file.Sync() fsync call by [@​1978629634](https://togithub.com/1978629634) in [https://github.com/kubernetes/klog/pull/404](https://togithub.com/kubernetes/klog/pull/404) - ktesting: tone down warning about leaked test goroutine by [@​pohly](https://togithub.com/pohly) in [https://github.com/kubernetes/klog/pull/401](https://togithub.com/kubernetes/klog/pull/401) #### New Contributors - [@​jsoref](https://togithub.com/jsoref) made their first contribution in [https://github.com/kubernetes/klog/pull/399](https://togithub.com/kubernetes/klog/pull/399) - [@​1978629634](https://togithub.com/1978629634) made their first contribution in [https://github.com/kubernetes/klog/pull/404](https://togithub.com/kubernetes/klog/pull/404) **Full Changelog**: kubernetes/klog@v2.120.1...v2.130.0 </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR has been generated by [Mend Renovate](https://www.mend.io/free-developer-tools/renovate/). View repository job log [here](https://developer.mend.io/github/open-telemetry/opentelemetry-go-contrib). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy40MTAuMSIsInVwZGF0ZWRJblZlciI6IjM3LjQxMC4xIiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6WyJTa2lwIENoYW5nZWxvZyIsImRlcGVuZGVuY2llcyJdfQ==--> Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
[](https://renovatebot.com) This PR contains the following updates: | Package | Change | Age | Adoption | Passing | Confidence | |---|---|---|---|---|---| | [k8s.io/klog/v2](https://togithub.com/kubernetes/klog) | `v2.120.1` -> `v2.130.0` | [](https://docs.renovatebot.com/merge-confidence/) | [](https://docs.renovatebot.com/merge-confidence/) | [](https://docs.renovatebot.com/merge-confidence/) | [](https://docs.renovatebot.com/merge-confidence/) | --- > [!WARNING] > Some dependencies could not be looked up. Check the Dependency Dashboard for more information. --- ### Release Notes <details> <summary>kubernetes/klog (k8s.io/klog/v2)</summary> ### [`v2.130.0`](https://togithub.com/kubernetes/klog/releases/tag/v2.130.0): Prepare klog release for Kubernetes v1.31 (Take 1) [Compare Source](https://togithub.com/kubernetes/klog/compare/v2.120.1...v2.130.0) #### What's Changed - chore(\*): fix spelling of Intel Corporation by [@​jsoref](https://togithub.com/jsoref) in [https://github.com/kubernetes/klog/pull/399](https://togithub.com/kubernetes/klog/pull/399) - build: fix some linter warnings by [@​pohly](https://togithub.com/pohly) in [https://github.com/kubernetes/klog/pull/402](https://togithub.com/kubernetes/klog/pull/402) - examples: fix linter warning by [@​pohly](https://togithub.com/pohly) in [https://github.com/kubernetes/klog/pull/406](https://togithub.com/kubernetes/klog/pull/406) - Do not acquire lock for file.Sync() fsync call by [@​1978629634](https://togithub.com/1978629634) in [https://github.com/kubernetes/klog/pull/404](https://togithub.com/kubernetes/klog/pull/404) - ktesting: tone down warning about leaked test goroutine by [@​pohly](https://togithub.com/pohly) in [https://github.com/kubernetes/klog/pull/401](https://togithub.com/kubernetes/klog/pull/401) #### New Contributors - [@​jsoref](https://togithub.com/jsoref) made their first contribution in [https://github.com/kubernetes/klog/pull/399](https://togithub.com/kubernetes/klog/pull/399) - [@​1978629634](https://togithub.com/1978629634) made their first contribution in [https://github.com/kubernetes/klog/pull/404](https://togithub.com/kubernetes/klog/pull/404) **Full Changelog**: kubernetes/klog@v2.120.1...v2.130.0 </details> --- ### Configuration 📅 **Schedule**: Branch creation - "on tuesday" (UTC), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR has been generated by [Mend Renovate](https://www.mend.io/free-developer-tools/renovate/). View repository job log [here](https://developer.mend.io/github/open-telemetry/opentelemetry-collector-contrib). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy40MTAuMSIsInVwZGF0ZWRJblZlciI6IjM3LjQxMC4xIiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6WyJkZXBlbmRlbmNpZXMiLCJyZW5vdmF0ZWJvdCJdfQ==--> --------- Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> Co-authored-by: opentelemetrybot <107717825+opentelemetrybot@users.noreply.github.com>
What this PR does / why we need it:
Not terminating immediately is fairly normal: many controllers log "terminating" messages while they shut down, which often is right after test completion, if that is when the test cancels the context and then doesn't wait for goroutines (which is often not possible).
Therefore the warning now only gets printed if that happens after more than 10 seconds since test completion, and then the warning is only printed as an info message, not an error.
Release note:
/cc @aojea @soltysh @kerthcet