-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Distinguish between unresponsive node and unreachable node #72968
Copy link
Copy link
Closed
Labels
:Distributed/DistributedA catch all label for anything in the Distributed Area. Please avoid if you can.A catch all label for anything in the Distributed Area. Please avoid if you can.>enhancementTeam:DistributedMeta label for distributed team.Meta label for distributed team.team-discuss
Description
Today, Elasticsearch log emit very similar messages around unresponsive node and unreachable node.
As an end-user, it is not easy to tell whether the problem lies in the network (platform) layer when the destination is completely unreachable or in Elasticsearch when it was overwhelmed with requests and becomes slow to respond.
Some of the relevant logs look like:
[2021-05-07T15:02:28,704][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [elastic-05] collector [cluster_stats] timed out when collecting data
[2021-05-07T15:02:57,757][ERROR][o.e.x.m.c.e.EnrichStatsCollector] [elastic-05] collector [enrich_coordinator_stats] timed out when collecting data
[2021-05-07T15:03:07,786][ERROR][o.e.x.m.c.i.IndexRecoveryCollector] [elastic-05] collector [index_recovery] timed out when collecting data
[2021-05-07T15:03:17,801][ERROR][o.e.x.m.c.i.IndexStatsCollector] [elastic-05] collector [index-stats] timed out when collecting data
[2021-05-07T15:16:43,101][WARN ][o.e.c.c.Coordinator ] [elastic-05] failed to validate incoming join request from node [{elastic-04}{uaH7bAt2TgaLhcKCkxpu6Q}{r3IPD7o4SXarFHKmjlNy0Q}{xx.xx.xx.xx}{10.10.10.5:9300}{dim}{xpack.installed=true}] org.elasticsearch.transport.NodeDisconnectedException: [elastic-04][10.10.10.5:9300][internal:cluster/coordination/join/validate] disconnected
[2021-05-07T15:18:12,085][INFO ][o.e.c.c.C.CoordinatorPublication] [elastic-05] after [10s] publication of cluster state version [536546] is still waiting for {elastic-04}{uaH7bAt2TgaLhcKCkxpu6Q}{r3IPD7o4SXarFHKmjlNy0Q}{xx.xx.xx.xx}{10.10.10.5:9300}{dim}{xpack.installed=true} [SENT_APPLY_COMMIT]
[2021-05-07T15:18:28,006][WARN ][o.e.c.r.a.AllocationService] [elastic-05] failing shard [failed shard, shard [index_v1][0], node[uaH7bAt2TgaLhcKCkxpu6Q], [P], recovery_source[existing store recovery; bootstrap_history_uuid=false], s[INITIALIZING], a[id=eOGoLSllTG6UfYJPYg6cNg], unassigned_info[[reason=ALLOCATION_FAILED], at[2021-05-07T05:18:12.222Z], failed_attempts[4], failed_nodes[[uaH7bAt2TgaLhcKCkxpu6Q]], delayed=false, details[failed shard on node [uaH7bAt2TgaLhcKCkxpu6Q]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[index_v1][0]: obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [shard creation]]; ], allocation_status[no_valid_shard_copy]], message [failed to create shard], failure [IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[index_v1][0]: obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [shard creation]]; ], markAsStale [true]]
[2021-05-07T15:49:33,607][INFO ][o.e.c.s.MasterService ] [elastic-05] node-left[{elastic-02}{4aWwRxagRAyG_WkUHhV2qg}{moK_JcYnS1efIYWdNrlBkA}{xx.xx.xx.xx}{10.10.10.11:9300}{xpack.installed=true} reason: disconnected], term: 219, version: 538374, delta: removed {{elastic-02}{4aWwRxagRAyG_WkUHhV2qg}{moK_JcYnS1efIYWdNrlBkA}{xx.xx.xx.xx}{10.10.10.11:9300}{xpack.installed=true}}
[2021-05-07T15:49:33,661][INFO ][o.e.c.s.ClusterApplierService] [elastic-05] removed {{elastic-02}{4aWwRxagRAyG_WkUHhV2qg}{moK_JcYnS1efIYWdNrlBkA}{xx.xx.xx.xx}{10.10.10.11:9300}{xpack.installed=true}}, term: 219, version: 538374, reason: Publication{term=219, version=538374}
[2021-05-07T15:50:41,662][INFO ][o.e.c.s.MasterService ] [elastic-05] node-join[{elastic-02}{4aWwRxagRAyG_WkUHhV2qg}{l2U2E0UbR96U_8ykelwcDw}{xx.xx.xx.xx}{10.10.10.11:9300}{xpack.installed=true} join existing leader], term: 219, version: 538375, delta: added {{elastic-02}{4aWwRxagRAyG_WkUHhV2qg}{l2U2E0UbR96U_8ykelwcDw}{xx.xx.xx.xx}{10.10.10.11:9300}{xpack.installed=true}}
[2021-05-07T15:50:42,445][INFO ][o.e.c.s.ClusterApplierService] [elastic-05] added {{elastic-02}{4aWwRxagRAyG_WkUHhV2qg}{l2U2E0UbR96U_8ykelwcDw}{xx.xx.xx.xx}{10.10.10.11:9300}{xpack.installed=true}}, term: 219, version: 538375, reason: Publication{term=219, version=538375}
It would be great if Elasticsearch can intercept early and stop running some of these checkup services just reporting the node is unreachable via ping and retry later.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
:Distributed/DistributedA catch all label for anything in the Distributed Area. Please avoid if you can.A catch all label for anything in the Distributed Area. Please avoid if you can.>enhancementTeam:DistributedMeta label for distributed team.Meta label for distributed team.team-discuss
Type
Fields
Give feedbackNo fields configured for issues without a type.