Distinguish between unresponsive node and unreachable node

Today, Elasticsearch log emit very similar messages around _unresponsive_ node and _unreachable_ node.

As an end-user, it is not easy to tell whether the problem lies in the network (platform) layer when the destination is completely unreachable or in Elasticsearch when it was overwhelmed with requests and becomes slow to respond.

Some of the relevant logs look like:
```
[2021-05-07T15:02:28,704][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [elastic-05] collector [cluster_stats] timed out when collecting data
[2021-05-07T15:02:57,757][ERROR][o.e.x.m.c.e.EnrichStatsCollector] [elastic-05] collector [enrich_coordinator_stats] timed out when collecting data
[2021-05-07T15:03:07,786][ERROR][o.e.x.m.c.i.IndexRecoveryCollector] [elastic-05] collector [index_recovery] timed out when collecting data
[2021-05-07T15:03:17,801][ERROR][o.e.x.m.c.i.IndexStatsCollector] [elastic-05] collector [index-stats] timed out when collecting data
[2021-05-07T15:16:43,101][WARN ][o.e.c.c.Coordinator      ] [elastic-05] failed to validate incoming join request from node [{elastic-04}{uaH7bAt2TgaLhcKCkxpu6Q}{r3IPD7o4SXarFHKmjlNy0Q}{xx.xx.xx.xx}{10.10.10.5:9300}{dim}{xpack.installed=true}] org.elasticsearch.transport.NodeDisconnectedException: [elastic-04][10.10.10.5:9300][internal:cluster/coordination/join/validate] disconnected
[2021-05-07T15:18:12,085][INFO ][o.e.c.c.C.CoordinatorPublication] [elastic-05] after [10s] publication of cluster state version [536546] is still waiting for {elastic-04}{uaH7bAt2TgaLhcKCkxpu6Q}{r3IPD7o4SXarFHKmjlNy0Q}{xx.xx.xx.xx}{10.10.10.5:9300}{dim}{xpack.installed=true} [SENT_APPLY_COMMIT]
[2021-05-07T15:18:28,006][WARN ][o.e.c.r.a.AllocationService] [elastic-05] failing shard [failed shard, shard [index_v1][0], node[uaH7bAt2TgaLhcKCkxpu6Q], [P], recovery_source[existing store recovery; bootstrap_history_uuid=false], s[INITIALIZING], a[id=eOGoLSllTG6UfYJPYg6cNg], unassigned_info[[reason=ALLOCATION_FAILED], at[2021-05-07T05:18:12.222Z], failed_attempts[4], failed_nodes[[uaH7bAt2TgaLhcKCkxpu6Q]], delayed=false, details[failed shard on node [uaH7bAt2TgaLhcKCkxpu6Q]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[index_v1][0]: obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [shard creation]]; ], allocation_status[no_valid_shard_copy]], message [failed to create shard], failure [IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[index_v1][0]: obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [shard creation]]; ], markAsStale [true]]
[2021-05-07T15:49:33,607][INFO ][o.e.c.s.MasterService    ] [elastic-05] node-left[{elastic-02}{4aWwRxagRAyG_WkUHhV2qg}{moK_JcYnS1efIYWdNrlBkA}{xx.xx.xx.xx}{10.10.10.11:9300}{xpack.installed=true} reason: disconnected], term: 219, version: 538374, delta: removed {{elastic-02}{4aWwRxagRAyG_WkUHhV2qg}{moK_JcYnS1efIYWdNrlBkA}{xx.xx.xx.xx}{10.10.10.11:9300}{xpack.installed=true}}
[2021-05-07T15:49:33,661][INFO ][o.e.c.s.ClusterApplierService] [elastic-05] removed {{elastic-02}{4aWwRxagRAyG_WkUHhV2qg}{moK_JcYnS1efIYWdNrlBkA}{xx.xx.xx.xx}{10.10.10.11:9300}{xpack.installed=true}}, term: 219, version: 538374, reason: Publication{term=219, version=538374}
[2021-05-07T15:50:41,662][INFO ][o.e.c.s.MasterService    ] [elastic-05] node-join[{elastic-02}{4aWwRxagRAyG_WkUHhV2qg}{l2U2E0UbR96U_8ykelwcDw}{xx.xx.xx.xx}{10.10.10.11:9300}{xpack.installed=true} join existing leader], term: 219, version: 538375, delta: added {{elastic-02}{4aWwRxagRAyG_WkUHhV2qg}{l2U2E0UbR96U_8ykelwcDw}{xx.xx.xx.xx}{10.10.10.11:9300}{xpack.installed=true}}
[2021-05-07T15:50:42,445][INFO ][o.e.c.s.ClusterApplierService] [elastic-05] added {{elastic-02}{4aWwRxagRAyG_WkUHhV2qg}{l2U2E0UbR96U_8ykelwcDw}{xx.xx.xx.xx}{10.10.10.11:9300}{xpack.installed=true}}, term: 219, version: 538375, reason: Publication{term=219, version=538375}
```

It would be great if Elasticsearch can intercept early and stop running some of these checkup services just reporting the node is unreachable via ping and retry later. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distinguish between unresponsive node and unreachable node #72968

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Distinguish between unresponsive node and unreachable node #72968

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions