Skip to content

Keep snapshot restore state and routing table in sync#20836

Merged
ywelsch merged 3 commits intoelastic:masterfrom
ywelsch:fix/snap-restore-update-entries
Oct 12, 2016
Merged

Keep snapshot restore state and routing table in sync#20836
ywelsch merged 3 commits intoelastic:masterfrom
ywelsch:fix/snap-restore-update-entries

Conversation

@ywelsch
Copy link
Copy Markdown
Contributor

@ywelsch ywelsch commented Oct 10, 2016

The snapshot restore state tracks information about shards being restored from a snapshot in the cluster state. For example it records if a shard has been successfully restored or if restoring it was not possible due to a corruption of the snapshot. Recording these events is usually based on changes to the shard routing table, i.e., when a shard is started after a successful restore or failed after an unsuccessful one. As of now, there were two communication channels to transmit recovery failure / success to update the routing table and the restore state. This lead to issues where a shard was failed but the restore state was not updated due to connection issues between data and master node. In some rare situations, this lead to an issue where the restore state could not be properly cleaned up anymore by the master, making it impossible to start new restore operations. The following change updates routing table and restore state in the same cluster state update so that both always stay in sync. It also eliminates the extra communication channel for restore operations and uses the standard cluster state listener mechanism to update restore listener upon successful completion of a snapshot restore.

Closes #19774

The snapshot restore state tracks information about shards being restored from a snapshot in the cluster state. For
example it records if a shard has been successfully restored or if restoring it was not possible due to a corruption of
the snapshot. Recording these events is usually based on changes to the shard routing table, i.e., when a shard is
started after a successful restore or failed after an unsuccessful one. As of now, there were two communication channels
to transmit recovery failure / success to update the routing table and the restore state. This lead to issues where a
shard was failed but the restore state was not updated due to connection issues between data and master node. In some
rare situations, this lead to an issue where the restore state could not be properly cleaned up anymore by the master,
making it impossible to start new restore operations. The following change updates routing table and restore state in
the same cluster state update so that both always stay in sync. It also eliminates the extra communication channel for
restore operations and uses standard cluster state listener mechanism to update restore listener upon successful
completion of a snapshot.
@ywelsch ywelsch added >bug review :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v6.0.0-alpha1 labels Oct 10, 2016
Copy link
Copy Markdown
Contributor

@imotov imotov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple of minor comments. Otherwise, LGTM.

assert newRoutingTable.validate(newMetaData); // validates the routing table is coherent with the cluster state metadata
final ClusterState newState = ClusterState.builder(oldState).routingTable(newRoutingTable).metaData(newMetaData).build();
final RestoreInProgress restoreInProgress = allocation.custom(RestoreInProgress.TYPE);
RestoreInProgress updatedRestoreInProgress = allocation.updateRestoreInfoWithRoutingChanges(restoreInProgress);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't seem to cause any issues, but I think moving this into the if statement bellow might help to clarify the logic. This method can be called when no restore takes place and we kind of plunge head on into updating restore info without even checking if restore actually takes place. We check it inside updateRestoreInfoWithRoutingChanges, but I think it might make the logic clearer if we checked it here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree

RecoverySource recoverySource = failedShard.recoverySource();
if (recoverySource.getType() == RecoverySource.Type.SNAPSHOT) {
Snapshot snapshot = ((SnapshotRecoverySource) recoverySource).snapshot();
if (Lucene.isCorruptionException(unassignedInfo.getFailure().getCause())) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment explaining why we only fail in case of lucene corruption?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@ywelsch ywelsch merged commit 0750470 into elastic:master Oct 12, 2016
@ywelsch
Copy link
Copy Markdown
Contributor Author

ywelsch commented Oct 12, 2016

thanks for the review @imotov

ywelsch added a commit to ywelsch/elasticsearch that referenced this pull request Oct 31, 2016
The snapshot restore state tracks information about shards being restored from a snapshot in the cluster state. For example it records if a shard has been successfully restored or if restoring it was not possible due to a corruption of the snapshot. Recording these events is usually based on changes to the shard routing table, i.e., when a shard is started after a successful restore or failed after an unsuccessful one. As of now, there were two communication channels to transmit recovery failure / success to update the routing table and the restore state. This lead to issues where a shard was failed but the restore state was not updated due to connection issues between data and master node. In some rare situations, this lead to an issue where the restore state could not be properly cleaned up anymore by the master, making it impossible to start new restore operations. The following change updates routing table and restore state in the same cluster state update so that both always stay in sync. It also eliminates the extra communication channel for restore operations and uses standard cluster state listener mechanism to update restore listener upon successful
completion of a snapshot.
@ILMostro
Copy link
Copy Markdown

Is there any hope of this being backported to 1.7 version?

@jasontedor
Copy link
Copy Markdown
Member

No, sorry, 1.7 is end-of-life.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v6.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants