-
Notifications
You must be signed in to change notification settings - Fork 157
Description
Observed behavior
I have a NATS client running at a remote site that occasionally experiences network unreliability. During these periods of unreliability we see unstable TCP connections. Generally, the NATS client will detect the problem and reconnect. But on some rare occasions I've found that the client will hang while attempting to reconnect and will never recover until the process is restarted. I've only seen this happen around 2-3 times in the last year.
The last time this happened I captured a stack trace with eu-stack. Most of the threads are just running natsCondition_Wait waiting for messages to come in. But the nats.c reconnect thread is interesting:
TID 3471564:
#0 0x00007f5d080fd9ec read
#1 0x00007f5d088dd091 sock_read
#2 0x00007f5d088cd44b bread_conv
#3 0x00007f5d088d0445 bio_read_intern
#4 0x00007f5d088d05c7 BIO_read
#5 0x00007f5d08d96e9c ssl3_read_n.part.0
#6 0x00007f5d08d98e4a ssl3_read_bytes
#7 0x00007f5d08da85d0 state_machine.part.0
#8 0x00000000004a24ae _makeTLSConn
#9 0x00000000004a4978 _processConnInit
#10 0x00000000004a8ed9 _doReconnect
#11 0x00000000004de0d0 _threadStart
#12 0x00007f5d08089c02 start_thread
#13 0x00007f5d0810ec40 __clone3
The thread never advances out of this state. I think this is what is causing the hang. It appears that several function calls were optimized out so I couldn't perfectly trace the code but think that this is happening.
- The TCP connection drops, nats.c detects that and being reconnecting.
- The new TCP connection is established in _doReconnect.
- Really poor network conditions cause the established connection to be dropped almost immediately.
- nats.c moves into _processConnInit and _makeTLSConn.
- _makeTLSConn sets the connection to blocking mode.
- The OpenSSL handshake begins and
read()is called.
Here we are stuck permanently. No data is ever read because the connection has been lost. Since we are in blocking mode we cannot timeout the read(). It does not appear that TCP keepalives are enabled so the kernel can't tell us that the connection has dropped either.
Expected behavior
_makeTLSConn can return with an error if the TCP connection has been lost when no data has been read.
Server and client version
nats.c 3.10.1
nats-server 2.9.19
nats-streaming-server 0.25.5
Host environment
RHEL 9.4 x86_64
Steps to reproduce
I have not found a reliable way to reproduce this issue.