Restore from Individual Shard Snapshot Files in Parallel by original-brownbear · Pull Request #48110 · elastic/elasticsearch

original-brownbear · 2019-10-16T08:49:14Z

The code in this PR is rather to illustrate the amount of change necessary to allow for faster restores and demonstrate required code changes than for review as it does not limit concurrency in any way.

In #42791 we fixed the order in which files are uploaded to snapshots, making snapshots upload the individual file for each shard in parallel and working shard-by-shard in terms of ordering the uploads for various shards in the snapshot.

For restores from snapshots however we currently run all shards in parallel using only a single thread per shard for downloading files. This is needlessly inefficient and significantly slows down restores from Cloud repositories.

I think we should move to the same ordering for restores. Parallelize by files and order by shards.
This should significantly speed up restores for shards (especially those with many files) as well as speed up the restore process end-to-end since if we order by shards we restore the first primaries more quickly and thus the replica recovery can run in parallel to the restore in a more efficient manner.

…o async-restore

elasticmachine · 2019-10-16T08:49:16Z

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

The code here was needlessly complicated when it enqueued all file uploads up-front. Instead, we can go with a cleaner worker + queue pattern here by taking the max-parallelism from the threadpool info. Also, I slightly simplified the rethrow and listener (step listener is pointless when you add the callback in the next line) handling it since I noticed that we were needlessly rethrowing in the same code and that wasn't worth a separate PR.

dnhatn

I understand the patch, and it looks great to me. However, I am not familiar enough with the codebase to LGTM. Thanks Armin.

ywelsch

LGTM. @tlrx should also give this a final review

original-brownbear · 2019-10-30T09:58:34Z

Jenkins run elasticsearch-ci/2 (unrelated ML failure)

tlrx

LGTM, thanks Armin.

original-brownbear · 2019-10-30T11:39:52Z

Thanks all!

…8686) Make restoring shard snapshots run in parallel on the `SNAPSHOT` thread-pool.

With the changes in #48110 there is no more need to block a generic thread when waiting for the multi file transfer in `CcrRepository`.

Follow up to #48110 cleaning up the redundant future uses that were left over from that change.

original-brownbear added 17 commits October 11, 2019 21:16

step 1

3639a6f

step 2

acacf4f

async restore

0c4a128

Merge remote-tracking branch 'elastic/master' into async-restore

a4beb48

bck

1fd86af

Merge remote-tracking branch 'elastic/master' into async-restore

5fb8169

another step

eb6f102

another step

9deb3cd

Merge remote-tracking branch 'elastic/master' into async-restore

711c43a

Merge remote-tracking branch 'elastic/master' into async-restore

af00465

async

b4d6d88

Merge remote-tracking branch 'elastic/master' into async-restore

19c8f35

Merge remote-tracking branch 'elastic/master' into async-restore

3d16462

parallel

13bf860

Merge remote-tracking branch 'elastic/master' into async-restore

17cf524

Merge branch 'master' of https://github.com/elastic/elasticsearch int…

eaf9761

…o async-restore

bck

51e7c34

original-brownbear added :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs team-discuss labels Oct 16, 2019

original-brownbear changed the title ~~Restore from Snapshots in Parallel~~ Restore from Individual Shard Snapshot Files in Parallel Oct 16, 2019

original-brownbear added 9 commits October 16, 2019 17:15

Merge remote-tracking branch 'elastic/master' into async-restore

4b28df3

fix mistake

852efe3

simpler

77b6ab2

fix test

f5e18c7

smarter snapshot info

cea3e85

cleaner

087aeeb

shorter

3bb0683

more randomness

ea7f6ce

CR: comments

215609e

original-brownbear requested a review from ywelsch October 29, 2019 13:38

dnhatn reviewed Oct 29, 2019

View reviewed changes

ywelsch approved these changes Oct 30, 2019

View reviewed changes

Merge remote-tracking branch 'elastic/master' into async-restore

6630105

original-brownbear requested a review from tlrx October 30, 2019 09:25

tlrx approved these changes Oct 30, 2019

View reviewed changes

original-brownbear merged commit e58fc03 into elastic:master Oct 30, 2019

original-brownbear deleted the async-restore branch October 30, 2019 11:40

original-brownbear added backport pending and removed backport pending labels Oct 30, 2019

original-brownbear mentioned this pull request Oct 30, 2019

Restore from Individual Shard Snapshot Files in Parallel (#48110) #48686

Merged

original-brownbear added a commit that referenced this pull request Oct 30, 2019

Restore from Individual Shard Snapshot Files in Parallel (#48110) (#4…

52e5ceb

…8686) Make restoring shard snapshots run in parallel on the `SNAPSHOT` thread-pool.

This was referenced Nov 1, 2019

Cleanup Redundant Futures in Recovery Code #48805

Merged

Make CcrRepository#restore non-Blocking #48814

Merged

original-brownbear added a commit that referenced this pull request Nov 1, 2019

Make CcrRepository#restore non-Blocking (#48814)

568a367

With the changes in #48110 there is no more need to block a generic thread when waiting for the multi file transfer in `CcrRepository`.

original-brownbear mentioned this pull request Nov 1, 2019

Make CcrRepository#restore non-Blocking (#48814) #48823

Merged

original-brownbear added a commit that referenced this pull request Nov 1, 2019

Make CcrRepository#restore non-Blocking (#48814) (#48823)

e26d01e

With the changes in #48110 there is no more need to block a generic thread when waiting for the multi file transfer in `CcrRepository`.

original-brownbear added a commit that referenced this pull request Nov 2, 2019

Cleanup Redundant Futures in Recovery Code (#48805)

6742d9c

Follow up to #48110 cleaning up the redundant future uses that were left over from that change.

original-brownbear mentioned this pull request Nov 2, 2019

Cleanup Redundant Futures in Recovery Code (#48805) #48832

Merged

original-brownbear added a commit that referenced this pull request Nov 2, 2019

Cleanup Redundant Futures in Recovery Code (#48805) (#48832)

a22f6fb

Follow up to #48110 cleaning up the redundant future uses that were left over from that change.

This was referenced Feb 3, 2020

[meta] 7.6 release elastic/elasticsearch-net#4340

Closed

[meta] 7.6 release elastic/elasticsearch-net#4341

Closed

original-brownbear restored the async-restore branch August 6, 2020 18:38

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restore from Individual Shard Snapshot Files in Parallel#48110

Restore from Individual Shard Snapshot Files in Parallel#48110
original-brownbear merged 49 commits intoelastic:masterfrom
original-brownbear:async-restore

original-brownbear commented Oct 16, 2019

Uh oh!

elasticmachine commented Oct 16, 2019

Uh oh!

dnhatn left a comment

Uh oh!

ywelsch left a comment

Uh oh!

original-brownbear commented Oct 30, 2019

Uh oh!

tlrx left a comment

Uh oh!

original-brownbear commented Oct 30, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

original-brownbear commented Oct 16, 2019

Uh oh!

elasticmachine commented Oct 16, 2019

Uh oh!

dnhatn left a comment

Choose a reason for hiding this comment

Uh oh!

ywelsch left a comment

Choose a reason for hiding this comment

Uh oh!

original-brownbear commented Oct 30, 2019

Uh oh!

tlrx left a comment

Choose a reason for hiding this comment

Uh oh!

original-brownbear commented Oct 30, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants