[c++] handle socket connection-closed errors gracefully in distributed training#7178
Open
wagner-austin wants to merge 2 commits intomicrosoft:masterfrom
Open
[c++] handle socket connection-closed errors gracefully in distributed training#7178wagner-austin wants to merge 2 commits intomicrosoft:masterfrom
wagner-austin wants to merge 2 commits intomicrosoft:masterfrom
Conversation
…d training During distributed training shutdown, workers can close sockets while peers are still communicating, causing Log::Fatal to crash the process with ECONNRESET or EPIPE. Return SOCKET_ERROR for connection-closed errors instead of crashing, and throw a catchable std::runtime_error from callers.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
IsConnectionClosedError()helper to identify expected shutdown errorsstd::runtime_errorinstead of looping forever on a closed connectionProblem
During distributed training (e.g. via Dask), workers finish training and call
free_network()independently. When one worker closes its sockets while another is still communicating,TcpSocket::Send()orTcpSocket::Recv()receives ECONNRESET (code 54 on macOS, 104 on Linux) or EPIPE (code 32). The current code callsLog::Fatal()for any socket error, which prints to stderr and throwsstd::runtime_errordeep in the socket layer — before the caller can handle it. This kills the worker process, and Dask reports it was "killed by signal 11".This fix returns
SOCKET_ERRORfrom the socket layer for connection-closed errors, letting the caller inLinkers::Send()/Linkers::Recv()throw an exception that propagates up through the Python bindings where Dask's existing exception handling can catch it gracefully.CI error messages:
Validation on fork CI: 24 CI runs on
fix-socket-error-crashbranch — macOS gcc regular job (where the Dask crash occurs) passed 22/24 times (2 cancelled), 0 Dask crashes (vs ~30% baseline crash rate on upstream).Changes
src/network/socket_wrapper.hpp:IsConnectionClosedError()helper to identify connection-closed errors (ECONNRESET, EPIPE, ENOTCONN, ESHUTDOWN and Windows equivalents)Send()andRecv(): returnSOCKET_ERRORfor connection-closed errors instead of callingLog::Fatal(), letting callers handle gracefullyLog::Fatal()as beforesrc/network/linkers.h:Linkers::Send()andLinkers::Recv(): check return value and throwstd::runtime_errorinstead of looping forever on a closed connectionsrc/network/linkers_socket.cpp:ListenThread(): check return value ofRecv()during initializationTest Plan
Related Issue
Contributes to: #4074
Contributes to: #6197
Contributes to: #5963