Grid Hub (3.141.59) - Grid Console and status endpoints unresponsive after node terminated/killed

## 🐛 Bug Report

Grid Hub Console (``/grid/console``) and ``/wd/hub/status`` unavailable (times out) for a long time (minutes) after a node is killed/terminated.

Note that the length of unavailability differs by infrastructure; with hub and nodes running on the same host in Docker, this lasts for approximately 30 seconds (reproduction scenario below). With hub and nodes running via Docker images but on separate AWS EC2 instances, this lasts for 2-4 _minutes_.

## To Reproduce

1. Run a Hub instance, preferably using the 3.141.59 Docker image (though this is _not_ specific to Docker).
2. Run a Node instance, either Chrome or Firefox, using the 3.141.59 Docker image (though once again, this is _not_ specific to Docker). Ensure that the Node registers to the Hub and shows up on the Console.
3. Terminate the Node instance.
4. Hub console (``/grid/console``) and status (``/wd/hub/status``) hang/timeout all connections/requests for ~30 seconds. Note that static pages, such as the 404 page for the ``/`` endpoint, serve normally during this time.

Running the same Docker images, but without docker-compose and on separate AWS EC2 Instances, results in a 120-second outage, a short (~5-second) period of responsiveness, and then another 120-second outage.

## Expected behavior

Nodes disappearing from the grid will not affect the Hub, specifically the Console and definitely not the ``/wd/hub/status`` endpoint (which is used in the official [docker health check script](https://github.com/SeleniumHQ/docker-selenium/blob/master/Base/check-grid.sh)).

_Alternatively, as a workaround,_ a hub API endpoint exists to forcibly, immediately remove a node from the Grid, which can be called by a node just before it terminates.

## Test script or set of commands reproducing this issue

I've used a [docker-compose](https://docs.docker.com/compose/) file for this for ease of reproduction, but this issue does not appear to be Docker-specific. However, I happen to be deploying Selenium via Docker and that also seems like the best way to reproduce the issue and eliminate as much variability in system configuration as possible.

To reproduce the issue:

1. Save the below docker-compose file somewhere locally as ``docker-compose.yml``.
2. ``docker-compose up`` to start the containers. Ensure that both start and the node registers; verify via http://127.0.0.1:4444/grid/console and http://127.0.0.1:4444/wd/hub/status
3. In another shell/terminal, run: ``while true; do if curl -s -m 5 -o /dev/null http://127.0.0.1:4444/wd/hub/status; then echo "OK: $(date)"; else echo "FAIL: $(date)"; fi; sleep 5; done``
4. In yet another shell, run: ``docker stop node-chrome``

The output of the curl commands will be something like below, where the first FAIL occurs immediately after the node is stopped:

```
OK: Tue 25 Feb 2020 09:10:38 AM EST
OK: Tue 25 Feb 2020 09:10:43 AM EST
OK: Tue 25 Feb 2020 09:10:48 AM EST
FAIL: Tue 25 Feb 2020 09:10:58 AM EST
FAIL: Tue 25 Feb 2020 09:11:08 AM EST
FAIL: Tue 25 Feb 2020 09:11:18 AM EST
FAIL: Tue 25 Feb 2020 09:11:28 AM EST
OK: Tue 25 Feb 2020 09:11:37 AM EST
OK: Tue 25 Feb 2020 09:11:45 AM EST
OK: Tue 25 Feb 2020 09:11:53 AM EST
OK: Tue 25 Feb 2020 09:12:01 AM EST
OK: Tue 25 Feb 2020 09:12:09 AM EST
```

The corresponding log output from docker-compose:

```
selenium-hub    | 14:02:52.932 INFO [DefaultGridRegistry.add] - Registered a node http://172.18.0.3:5555
chrome_1        | 14:02:52.933 INFO [SelfRegisteringRemote.registerToHub] - The node is registered to the hub and ready to use
selenium-hub    | 14:02:52.940 DEBUG [RegistrationServlet.lambda$process$0] - proxy added http://172.18.0.3:5555
chrome_1        | Trapped SIGTERM/SIGINT/x so shutting down supervisord...
chrome_1        | 2020-02-25 14:10:52,303 WARN received SIGTERM indicating exit request
chrome_1        | 2020-02-25 14:10:52,303 INFO waiting for xvfb, selenium-node to die
chrome_1        | 2020-02-25 14:10:52,304 INFO stopped: selenium-node (terminated by SIGTERM)
chrome_1        | 2020-02-25 14:10:52,305 INFO stopped: xvfb (terminated by SIGTERM)
chrome_1        | Shutdown complete
grid-ng_chrome_1 exited with code 143
selenium-hub    | 14:12:01.203 DEBUG [DefaultRemoteProxy.isAlive] - Failed to check status of node: No route to host (Host unreachable)
selenium-hub    | 14:12:09.257 DEBUG [DefaultRemoteProxy.isAlive] - Failed to check status of node: No route to host (Host unreachable)
selenium-hub    | 14:12:09.257 INFO [DefaultRemoteProxy.onEvent] - Marking the node http://172.18.0.3:5555 as down: cannot reach the node for 2 tries
selenium-hub    | 14:12:17.310 DEBUG [DefaultRemoteProxy.isAlive] - Failed to check status of node: No route to host (Host unreachable)
selenium-hub    | 14:12:25.363 DEBUG [DefaultRemoteProxy.isAlive] - Failed to check status of node: No route to host (Host unreachable)
selenium-hub    | 14:12:33.416 DEBUG [DefaultRemoteProxy.isAlive] - Failed to check status of node: No route to host (Host unreachable)
selenium-hub    | 14:12:41.470 DEBUG [DefaultRemoteProxy.isAlive] - Failed to check status of node: No route to host (Host unreachable)
selenium-hub    | 14:12:49.523 DEBUG [DefaultRemoteProxy.isAlive] - Failed to check status of node: No route to host (Host unreachable)
selenium-hub    | 14:12:57.577 DEBUG [DefaultRemoteProxy.isAlive] - Failed to check status of node: No route to host (Host unreachable)
selenium-hub    | 14:13:05.630 DEBUG [DefaultRemoteProxy.isAlive] - Failed to check status of node: No route to host (Host unreachable)
selenium-hub    | 14:13:13.683 DEBUG [DefaultRemoteProxy.isAlive] - Failed to check status of node: No route to host (Host unreachable)
selenium-hub    | 14:13:13.683 INFO [DefaultRemoteProxy.onEvent] - Unregistering the node http://172.18.0.3:5555 because it's been down for 64426 milliseconds
selenium-hub    | 14:13:13.684 WARN [DefaultGridRegistry.removeIfPresent] - Cleaning up stale test sessions on the unregistered node http://172.18.0.3:5555
```

### docker compose file

```yaml
version: "3"
services:
  selenium-hub:
    image: selenium/hub:3.141.59
    container_name: selenium-hub
    ports:
      - "4444:4444"
    environment:
      - SE_OPTS=-debug
  chrome:
    image: selenium/node-chrome:3.141.59
    container_name: node-chrome
    volumes:
      - /dev/shm:/dev/shm
    depends_on:
      - selenium-hub
    environment:
      - HUB_HOST=selenium-hub
      - HUB_PORT=4444
      - SE_OPTS=-servlets org.openqa.grid.web.servlet.LifecycleServlet
```

### Alternate Issue Reproduction in AWS EC2

1. Start an EC2 instance from the Amazon Linux 2 AMI, in a private subnet of a VPC. We'll call this "HUB_INSTANCE". Apply security groups to allow access to port 4444 from the VPC.
2. On this instance, ``docker run --net bridge -m 0b -e "SE_OPTS=-debug" -p 4444:4444 selenium/hub:3.141.59``
3. Verify that http://HUB_INSTANCE_IP:4444/grid/console and http://HUB_INSTANCE_IP:4444/wd/hub/status are available.
4. Start a second EC2 instance from the Amazon Linux 2 AMI, in the same private subnet of a VPC. We'll call this "NODE_INSTANCE". Apply security groups to allow access to port 5555 from the VPC.
5. On this instance, ``docker run --net bridge -m 0b -e "REMOTE_HOST=http://<NODE_INSTANCE IP>:5555" -e "HUB_HOST=<HUB_INSTANCE IP>" -e "HUB_PORT=4444" -p 5555:5555 -v /dev/shm:/dev/shm selenium/node-chrome:3.141.59``
6. Verify that http://HUB_INSTANCE_IP:4444/grid/console shows the node instance connected and http://HUB_INSTANCE_IP:4444/wd/hub/status is available.
7. On a third instance (or locally if you have VPN into your VPC), run: ``while true; do if curl -s -m 5 -o /dev/null http://<HUB_INSTANCE IP>:4444/wd/hub/status; then echo "OK: $(date)"; else echo "FAIL: $(date)"; fi; sleep 5; done``
8. Terminate NODE_INSTANCE: ``aws ec2 terminate-instances --instance-ids <NODE_INSTANCE ID>``

The ``/wd/hub/status`` endpoint becomes unreachable for XXX seconds, as shown below (output from command in step 7). Requests to ``/grid/console`` similarly timeout.

```
OK: Tue 25 Feb 2020 11:54:43 AM EST
OK: Tue 25 Feb 2020 11:54:48 AM EST
OK: Tue 25 Feb 2020 11:54:53 AM EST
FAIL: Tue 25 Feb 2020 11:55:03 AM EST
FAIL: Tue 25 Feb 2020 11:55:13 AM EST
FAIL: Tue 25 Feb 2020 11:55:23 AM EST
FAIL: Tue 25 Feb 2020 11:55:33 AM EST
FAIL: Tue 25 Feb 2020 11:55:43 AM EST
FAIL: Tue 25 Feb 2020 11:55:53 AM EST
FAIL: Tue 25 Feb 2020 11:56:03 AM EST
FAIL: Tue 25 Feb 2020 11:56:13 AM EST
FAIL: Tue 25 Feb 2020 11:56:23 AM EST
FAIL: Tue 25 Feb 2020 11:56:33 AM EST
FAIL: Tue 25 Feb 2020 11:56:43 AM EST
FAIL: Tue 25 Feb 2020 11:56:53 AM EST
FAIL: Tue 25 Feb 2020 11:57:03 AM EST
FAIL: Tue 25 Feb 2020 11:57:13 AM EST
FAIL: Tue 25 Feb 2020 11:57:23 AM EST
FAIL: Tue 25 Feb 2020 11:57:33 AM EST
FAIL: Tue 25 Feb 2020 11:57:43 AM EST
FAIL: Tue 25 Feb 2020 11:57:53 AM EST
FAIL: Tue 25 Feb 2020 11:58:03 AM EST
FAIL: Tue 25 Feb 2020 11:58:13 AM EST
FAIL: Tue 25 Feb 2020 11:58:23 AM EST
FAIL: Tue 25 Feb 2020 11:58:33 AM EST
FAIL: Tue 25 Feb 2020 11:58:43 AM EST
FAIL: Tue 25 Feb 2020 11:58:53 AM EST
FAIL: Tue 25 Feb 2020 11:59:03 AM EST
FAIL: Tue 25 Feb 2020 11:59:13 AM EST
FAIL: Tue 25 Feb 2020 11:59:23 AM EST
OK: Tue 25 Feb 2020 11:59:28 AM EST
OK: Tue 25 Feb 2020 11:59:33 AM EST
OK: Tue 25 Feb 2020 11:59:38 AM EST
```

The hub's logs are as follows:

```
11:47:46.541 INFO [DefaultGridRegistry.add] - Registered a node http://10.230.92.112:5555
11:47:46.542 DEBUG [RegistrationServlet.lambda$process$0] - proxy added http://10.230.92.112:5555
# Terminated instance here
11:54:51.851 DEBUG [DefaultRemoteProxy.isAlive] - Failed to check status of node: Failed to connect to /10.230.92.112:5555
11:57:06.811 DEBUG [DefaultRemoteProxy.isAlive] - Failed to check status of node: Failed to connect to /10.230.92.112:5555
11:57:06.811 INFO [DefaultRemoteProxy.onEvent] - Marking the node http://10.230.92.112:5555 as down: cannot reach the node for 2 tries
11:59:21.979 DEBUG [DefaultRemoteProxy.isAlive] - Failed to check status of node: Failed to connect to /10.230.92.112:5555
11:59:21.979 INFO [DefaultRemoteProxy.onEvent] - Unregistering the node http://10.230.92.112:5555 because it's been down for 135168 milliseconds
11:59:21.979 WARN [DefaultGridRegistry.removeIfPresent] - Cleaning up stale test sessions on the unregistered node http://10.230.92.112:5555
```

## Environment

OS: Amazon Linux 2
Browser: N/A
Browser version: N/A
Browser Driver version: N/A
Language Bindings version: N/A
Selenium Grid version (if applicable): 3.141.59

## PS

I _think_ this issue might be a duplicate of https://github.com/SeleniumHQ/docker-selenium/issues/113 which says, in [this comment](https://github.com/SeleniumHQ/docker-selenium/issues/113#issuecomment-401122046) that it was caused by "a regression introduced in selenium > 3.8, should be fixed in 3.13". But 3.13 was quite a while before 3.141.59.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Grid Hub (3.141.59) - Grid Console and status endpoints unresponsive after node terminated/killed #8055

🐛 Bug Report

To Reproduce

Expected behavior

Test script or set of commands reproducing this issue

docker compose file

Alternate Issue Reproduction in AWS EC2

Environment

PS

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Grid Hub (3.141.59) - Grid Console and status endpoints unresponsive after node terminated/killed #8055

Description

🐛 Bug Report

To Reproduce

Expected behavior

Test script or set of commands reproducing this issue

docker compose file

Alternate Issue Reproduction in AWS EC2

Environment

PS

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions