Fork the sending of file chunks during recovery#74164
Fork the sending of file chunks during recovery#74164DaveCTurner merged 1 commit intoelastic:masterfrom
Conversation
Today if sending file chunks is CPU-bound (e.g. when using compression) then we tend to concentrate all that work onto relatively few threads, even if `indices.recovery.max_concurrent_file_chunks` is increased. With this commit we fork the transmission of each chunk onto its own thread so that the CPU-bound work can happen in parallel.
|
Pinging @elastic/es-distributed (Team:Distributed) |
|
I benchmarked this with a 2.7GB (2757508192B) single-shard index being recovered between two nodes both running on my localhost, so network performance was effectively infinite. Here are the recovery times in milliseconds; "no fork" being today's behaviour and "fork" being the behaviour with this PR applied.
The cases without compression aren't CPU-bound so it didn't make much difference, but when compression is enabled it has a pretty significant effect. I also increased the max for the concurrent chunks setting to 8 since I was still seeing improvements at today's max of 5. |
original-brownbear
left a comment
There was a problem hiding this comment.
LGTM, but I do wonder if compression has to be this expensive, this seems broken somehow.
| // Fork the actual sending onto a separate thread so we can send them concurrently even if CPU-bound (e.g. using compression). | ||
| // The AsyncIOProcessor and MultiFileWriter both concentrate their work onto fewer threads if possible, but once we have | ||
| // chunks to send we want to increase parallelism again. | ||
| threadPool.generic().execute(ActionRunnable.wrap(listener, l -> |
There was a problem hiding this comment.
Do you know why compression is so expensive (i.e. which part of it?). I wonder if we could get an improvement out of increasing the chunk size if most of the cost is the per compression start/end overhead (maybe there's something to improve there as well).
There was a problem hiding this comment.
I think it's just that DEFLATE is expensive, especially if the data is already deflated (as it is for stored fields which make up the bulk of this index). pv _1t.fdt | gzip > /dev/null indicates my machine can only sustain 25MiBps on a single thread, which is about the performance we see with no parallelism.
There was a problem hiding this comment.
Right now I remember us benchmarking DEFLATE before => nothing we can do but parallelize I guess :)
|
Thanks Armin! |
Today if sending file chunks is CPU-bound (e.g. when using compression) then we tend to concentrate all that work onto relatively few threads, even if `indices.recovery.max_concurrent_file_chunks` is increased. With this commit we fork the transmission of each chunk onto its own thread so that the CPU-bound work can happen in parallel.
* master: (284 commits) [DOCS] Update central reporting image (elastic#74195) [DOCS] SQL: Document `null` handing for string functions (elastic#74201) Fix Snapshot Docs Listing Query Params in Body Incorrectly (elastic#74196) [DOCS] EQL: Note EQL uses `fields` parameter (elastic#74194) Mute failing MixedClusterClientYamlTestSuiteIT test {p0=indices.split/20_source_mapping/Split index ignores target template mapping} test (elastic#74198) Cleanup Duplicate Constants in Snapshot XContent Params (elastic#74114) [DOC] Add watcher to the threadpool doc (elastic#73935) [Rest Api Compatibility] Validate Query typed api (elastic#74171) Replace deprecated `script.cache.*` settings with `script.context.$constext.cache_*` in documentation. (elastic#74144) Pin Alpine Linux version in Docker builds (elastic#74169) Fix clone API settings docs bug (elastic#74175) [ML] refactor internal datafeed management (elastic#74018) Disable query cache for FunctionScoreQuery and ScriptScoreQuery (elastic#74060) Fork the sending of file chunks during recovery (elastic#74164) RuntimeField.Builder should not extend FieldMapper.Builder (elastic#73840) Run CheckIndex on metadata index before loading (elastic#73239) Deprecate setting version on analyzers (elastic#74073) Add test with null transform id in stats request (elastic#74130) Order imports when reformatting (elastic#74059) Move deprecation code from xpack core to deprecation module. (elastic#74120) ...


Today if sending file chunks is CPU-bound (e.g. when using compression)
then we tend to concentrate all that work onto relatively few threads,
even if
indices.recovery.max_concurrent_file_chunksis increased. Withthis commit we fork the transmission of each chunk onto its own thread
so that the CPU-bound work can happen in parallel.