-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
Description
🐛 Bug Report
Grid Hub Console (/grid/console) and /wd/hub/status unavailable (times out) for a long time (minutes) after a node is killed/terminated.
Note that the length of unavailability differs by infrastructure; with hub and nodes running on the same host in Docker, this lasts for approximately 30 seconds (reproduction scenario below). With hub and nodes running via Docker images but on separate AWS EC2 instances, this lasts for 2-4 minutes.
To Reproduce
- Run a Hub instance, preferably using the 3.141.59 Docker image (though this is not specific to Docker).
- Run a Node instance, either Chrome or Firefox, using the 3.141.59 Docker image (though once again, this is not specific to Docker). Ensure that the Node registers to the Hub and shows up on the Console.
- Terminate the Node instance.
- Hub console (
/grid/console) and status (/wd/hub/status) hang/timeout all connections/requests for ~30 seconds. Note that static pages, such as the 404 page for the/endpoint, serve normally during this time.
Running the same Docker images, but without docker-compose and on separate AWS EC2 Instances, results in a 120-second outage, a short (~5-second) period of responsiveness, and then another 120-second outage.
Expected behavior
Nodes disappearing from the grid will not affect the Hub, specifically the Console and definitely not the /wd/hub/status endpoint (which is used in the official docker health check script).
Alternatively, as a workaround, a hub API endpoint exists to forcibly, immediately remove a node from the Grid, which can be called by a node just before it terminates.
Test script or set of commands reproducing this issue
I've used a docker-compose file for this for ease of reproduction, but this issue does not appear to be Docker-specific. However, I happen to be deploying Selenium via Docker and that also seems like the best way to reproduce the issue and eliminate as much variability in system configuration as possible.
To reproduce the issue:
- Save the below docker-compose file somewhere locally as
docker-compose.yml. docker-compose upto start the containers. Ensure that both start and the node registers; verify via http://127.0.0.1:4444/grid/console and http://127.0.0.1:4444/wd/hub/status- In another shell/terminal, run:
while true; do if curl -s -m 5 -o /dev/null http://127.0.0.1:4444/wd/hub/status; then echo "OK: $(date)"; else echo "FAIL: $(date)"; fi; sleep 5; done - In yet another shell, run:
docker stop node-chrome
The output of the curl commands will be something like below, where the first FAIL occurs immediately after the node is stopped:
OK: Tue 25 Feb 2020 09:10:38 AM EST
OK: Tue 25 Feb 2020 09:10:43 AM EST
OK: Tue 25 Feb 2020 09:10:48 AM EST
FAIL: Tue 25 Feb 2020 09:10:58 AM EST
FAIL: Tue 25 Feb 2020 09:11:08 AM EST
FAIL: Tue 25 Feb 2020 09:11:18 AM EST
FAIL: Tue 25 Feb 2020 09:11:28 AM EST
OK: Tue 25 Feb 2020 09:11:37 AM EST
OK: Tue 25 Feb 2020 09:11:45 AM EST
OK: Tue 25 Feb 2020 09:11:53 AM EST
OK: Tue 25 Feb 2020 09:12:01 AM EST
OK: Tue 25 Feb 2020 09:12:09 AM EST
The corresponding log output from docker-compose:
selenium-hub | 14:02:52.932 INFO [DefaultGridRegistry.add] - Registered a node http://172.18.0.3:5555
chrome_1 | 14:02:52.933 INFO [SelfRegisteringRemote.registerToHub] - The node is registered to the hub and ready to use
selenium-hub | 14:02:52.940 DEBUG [RegistrationServlet.lambda$process$0] - proxy added http://172.18.0.3:5555
chrome_1 | Trapped SIGTERM/SIGINT/x so shutting down supervisord...
chrome_1 | 2020-02-25 14:10:52,303 WARN received SIGTERM indicating exit request
chrome_1 | 2020-02-25 14:10:52,303 INFO waiting for xvfb, selenium-node to die
chrome_1 | 2020-02-25 14:10:52,304 INFO stopped: selenium-node (terminated by SIGTERM)
chrome_1 | 2020-02-25 14:10:52,305 INFO stopped: xvfb (terminated by SIGTERM)
chrome_1 | Shutdown complete
grid-ng_chrome_1 exited with code 143
selenium-hub | 14:12:01.203 DEBUG [DefaultRemoteProxy.isAlive] - Failed to check status of node: No route to host (Host unreachable)
selenium-hub | 14:12:09.257 DEBUG [DefaultRemoteProxy.isAlive] - Failed to check status of node: No route to host (Host unreachable)
selenium-hub | 14:12:09.257 INFO [DefaultRemoteProxy.onEvent] - Marking the node http://172.18.0.3:5555 as down: cannot reach the node for 2 tries
selenium-hub | 14:12:17.310 DEBUG [DefaultRemoteProxy.isAlive] - Failed to check status of node: No route to host (Host unreachable)
selenium-hub | 14:12:25.363 DEBUG [DefaultRemoteProxy.isAlive] - Failed to check status of node: No route to host (Host unreachable)
selenium-hub | 14:12:33.416 DEBUG [DefaultRemoteProxy.isAlive] - Failed to check status of node: No route to host (Host unreachable)
selenium-hub | 14:12:41.470 DEBUG [DefaultRemoteProxy.isAlive] - Failed to check status of node: No route to host (Host unreachable)
selenium-hub | 14:12:49.523 DEBUG [DefaultRemoteProxy.isAlive] - Failed to check status of node: No route to host (Host unreachable)
selenium-hub | 14:12:57.577 DEBUG [DefaultRemoteProxy.isAlive] - Failed to check status of node: No route to host (Host unreachable)
selenium-hub | 14:13:05.630 DEBUG [DefaultRemoteProxy.isAlive] - Failed to check status of node: No route to host (Host unreachable)
selenium-hub | 14:13:13.683 DEBUG [DefaultRemoteProxy.isAlive] - Failed to check status of node: No route to host (Host unreachable)
selenium-hub | 14:13:13.683 INFO [DefaultRemoteProxy.onEvent] - Unregistering the node http://172.18.0.3:5555 because it's been down for 64426 milliseconds
selenium-hub | 14:13:13.684 WARN [DefaultGridRegistry.removeIfPresent] - Cleaning up stale test sessions on the unregistered node http://172.18.0.3:5555
docker compose file
version: "3"
services:
selenium-hub:
image: selenium/hub:3.141.59
container_name: selenium-hub
ports:
- "4444:4444"
environment:
- SE_OPTS=-debug
chrome:
image: selenium/node-chrome:3.141.59
container_name: node-chrome
volumes:
- /dev/shm:/dev/shm
depends_on:
- selenium-hub
environment:
- HUB_HOST=selenium-hub
- HUB_PORT=4444
- SE_OPTS=-servlets org.openqa.grid.web.servlet.LifecycleServletAlternate Issue Reproduction in AWS EC2
- Start an EC2 instance from the Amazon Linux 2 AMI, in a private subnet of a VPC. We'll call this "HUB_INSTANCE". Apply security groups to allow access to port 4444 from the VPC.
- On this instance,
docker run --net bridge -m 0b -e "SE_OPTS=-debug" -p 4444:4444 selenium/hub:3.141.59 - Verify that http://HUB_INSTANCE_IP:4444/grid/console and http://HUB_INSTANCE_IP:4444/wd/hub/status are available.
- Start a second EC2 instance from the Amazon Linux 2 AMI, in the same private subnet of a VPC. We'll call this "NODE_INSTANCE". Apply security groups to allow access to port 5555 from the VPC.
- On this instance,
docker run --net bridge -m 0b -e "REMOTE_HOST=http://<NODE_INSTANCE IP>:5555" -e "HUB_HOST=<HUB_INSTANCE IP>" -e "HUB_PORT=4444" -p 5555:5555 -v /dev/shm:/dev/shm selenium/node-chrome:3.141.59 - Verify that http://HUB_INSTANCE_IP:4444/grid/console shows the node instance connected and http://HUB_INSTANCE_IP:4444/wd/hub/status is available.
- On a third instance (or locally if you have VPN into your VPC), run:
while true; do if curl -s -m 5 -o /dev/null http://<HUB_INSTANCE IP>:4444/wd/hub/status; then echo "OK: $(date)"; else echo "FAIL: $(date)"; fi; sleep 5; done - Terminate NODE_INSTANCE:
aws ec2 terminate-instances --instance-ids <NODE_INSTANCE ID>
The /wd/hub/status endpoint becomes unreachable for XXX seconds, as shown below (output from command in step 7). Requests to /grid/console similarly timeout.
OK: Tue 25 Feb 2020 11:54:43 AM EST
OK: Tue 25 Feb 2020 11:54:48 AM EST
OK: Tue 25 Feb 2020 11:54:53 AM EST
FAIL: Tue 25 Feb 2020 11:55:03 AM EST
FAIL: Tue 25 Feb 2020 11:55:13 AM EST
FAIL: Tue 25 Feb 2020 11:55:23 AM EST
FAIL: Tue 25 Feb 2020 11:55:33 AM EST
FAIL: Tue 25 Feb 2020 11:55:43 AM EST
FAIL: Tue 25 Feb 2020 11:55:53 AM EST
FAIL: Tue 25 Feb 2020 11:56:03 AM EST
FAIL: Tue 25 Feb 2020 11:56:13 AM EST
FAIL: Tue 25 Feb 2020 11:56:23 AM EST
FAIL: Tue 25 Feb 2020 11:56:33 AM EST
FAIL: Tue 25 Feb 2020 11:56:43 AM EST
FAIL: Tue 25 Feb 2020 11:56:53 AM EST
FAIL: Tue 25 Feb 2020 11:57:03 AM EST
FAIL: Tue 25 Feb 2020 11:57:13 AM EST
FAIL: Tue 25 Feb 2020 11:57:23 AM EST
FAIL: Tue 25 Feb 2020 11:57:33 AM EST
FAIL: Tue 25 Feb 2020 11:57:43 AM EST
FAIL: Tue 25 Feb 2020 11:57:53 AM EST
FAIL: Tue 25 Feb 2020 11:58:03 AM EST
FAIL: Tue 25 Feb 2020 11:58:13 AM EST
FAIL: Tue 25 Feb 2020 11:58:23 AM EST
FAIL: Tue 25 Feb 2020 11:58:33 AM EST
FAIL: Tue 25 Feb 2020 11:58:43 AM EST
FAIL: Tue 25 Feb 2020 11:58:53 AM EST
FAIL: Tue 25 Feb 2020 11:59:03 AM EST
FAIL: Tue 25 Feb 2020 11:59:13 AM EST
FAIL: Tue 25 Feb 2020 11:59:23 AM EST
OK: Tue 25 Feb 2020 11:59:28 AM EST
OK: Tue 25 Feb 2020 11:59:33 AM EST
OK: Tue 25 Feb 2020 11:59:38 AM EST
The hub's logs are as follows:
11:47:46.541 INFO [DefaultGridRegistry.add] - Registered a node http://10.230.92.112:5555
11:47:46.542 DEBUG [RegistrationServlet.lambda$process$0] - proxy added http://10.230.92.112:5555
# Terminated instance here
11:54:51.851 DEBUG [DefaultRemoteProxy.isAlive] - Failed to check status of node: Failed to connect to /10.230.92.112:5555
11:57:06.811 DEBUG [DefaultRemoteProxy.isAlive] - Failed to check status of node: Failed to connect to /10.230.92.112:5555
11:57:06.811 INFO [DefaultRemoteProxy.onEvent] - Marking the node http://10.230.92.112:5555 as down: cannot reach the node for 2 tries
11:59:21.979 DEBUG [DefaultRemoteProxy.isAlive] - Failed to check status of node: Failed to connect to /10.230.92.112:5555
11:59:21.979 INFO [DefaultRemoteProxy.onEvent] - Unregistering the node http://10.230.92.112:5555 because it's been down for 135168 milliseconds
11:59:21.979 WARN [DefaultGridRegistry.removeIfPresent] - Cleaning up stale test sessions on the unregistered node http://10.230.92.112:5555
Environment
OS: Amazon Linux 2
Browser: N/A
Browser version: N/A
Browser Driver version: N/A
Language Bindings version: N/A
Selenium Grid version (if applicable): 3.141.59
PS
I think this issue might be a duplicate of SeleniumHQ/docker-selenium#113 which says, in this comment that it was caused by "a regression introduced in selenium > 3.8, should be fixed in 3.13". But 3.13 was quite a while before 3.141.59.