[9.4](backport #50137) filestream: Fix shutdown logic and improve benchmark#50184
Open
mergify[bot] wants to merge 2 commits into9.4from
Open
[9.4](backport #50137) filestream: Fix shutdown logic and improve benchmark#50184mergify[bot] wants to merge 2 commits into9.4from
mergify[bot] wants to merge 2 commits into9.4from
Conversation
Fix filestream benchmark correctness and shutdown, add high-file-count sub-benchmarks **Shutdown fix (`notifyObserver`):** During shutdown the watcher goroutine that drains `notifyChan` exits before harvesters finish. The old blocking send in `notifyObserver` stalled every closing harvester until the task group's 1-minute timeout expired. Replace the blocking send with a `select` on `canceler.Done()` so harvesters unblock immediately when the input is cancelled. **Benchmark fixes (correctness):** The inode-mode benchmarks were silently broken: `file_identity` defaulted to `fingerprint` even though `prospector.scanner.fingerprint.enabled` was `false`, so every file received the same empty-fingerprint identity and only one harvester was started out of N files. Explicitly set `file_identity.native` / `file_identity.fingerprint` to match the scanner mode so each file gets its own identity and harvester. **Benchmark fixes (hangs / timeouts):** Without `close.reader.on_eof: true` harvesters waited for more data after EOF, preventing the pipeline from closing until the 60-second `task.Group.Stop` timeout expired. Combined with a 1-second `check_interval` this made multi-file benchmarks extremely slow. Set `close.reader.on_eof: true` and lower `check_interval` to 100 ms so harvesters close promptly. **Benchmark refactoring:** - Consolidate duplicated single-file / multi-file / inode / fingerprint sub-benchmarks into a table-driven loop. - Add 1 000-file and 10 000-file fingerprint sub-benchmarks to stress per-file overhead (logger cloning, reader pipeline setup, fingerprint I/O). - Replace deprecated logging setup with local loggers. - Only buffer the event channel when events are actually collected, avoiding a 10 000-slot channel allocation in benchmarks that discard events. (cherry picked from commit 2977528)
Contributor
🤖 GitHub commentsJust comment with:
|
Contributor
|
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
3 tasks
Contributor
Author
|
This pull request has not been merged yet. Could you please review and merge it @orestisfl? 🙏 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Proposed commit message
Fix filestream benchmark correctness and shutdown, add high-file-count sub-benchmarks
Shutdown fix (
notifyObserver):During shutdown the watcher goroutine that drains
notifyChanexits beforeharvesters finish. The old blocking send in
notifyObserverstalled everyclosing harvester until the task group's 1-minute timeout expired. Replace the
blocking send with a
selectoncanceler.Done()so harvesters unblockimmediately when the input is cancelled.
Benchmark fixes (correctness):
The inode-mode benchmarks were silently broken:
file_identitydefaulted tofingerprinteven thoughprospector.scanner.fingerprint.enabledwasfalse,so every file received the same empty-fingerprint identity and only one
harvester was started out of N files. Explicitly set
file_identity.native/file_identity.fingerprintto match the scanner mode so each file gets its ownidentity and harvester.
Benchmark fixes (hangs / timeouts):
Without
close.reader.on_eof: trueharvesters waited for more data after EOF,preventing the pipeline from closing until the 60-second
task.Group.Stoptimeout expired. Combined with a 1-second
check_intervalthis made multi-filebenchmarks extremely slow. Set
close.reader.on_eof: trueand lowercheck_intervalto 100 ms so harvesters close promptly.Benchmark refactoring:
sub-benchmarks into a table-driven loop.
overhead (logger cloning, reader pipeline setup, fingerprint I/O).
10 000-slot channel allocation in benchmarks that discard events.
Checklist
I have made corresponding changes to the documentationI have made corresponding change to the default configuration filesstresstest.shscript to run them under stress conditions and race detector to verify their stability.I have added an entry in./changelog/fragmentsusing the changelog tool.How to test this PR locally