Skip to content

[c++] handle socket connection-closed errors gracefully in distributed training#7178

Open
wagner-austin wants to merge 2 commits intomicrosoft:masterfrom
wagner-austin:fix-socket-error-crash-v2
Open

[c++] handle socket connection-closed errors gracefully in distributed training#7178
wagner-austin wants to merge 2 commits intomicrosoft:masterfrom
wagner-austin:fix-socket-error-crash-v2

Conversation

@wagner-austin
Copy link
Contributor

Summary

  • Handle socket connection-closed errors gracefully instead of crashing during distributed training shutdown
  • Add IsConnectionClosedError() helper to identify expected shutdown errors
  • Callers throw a catchable std::runtime_error instead of looping forever on a closed connection

Problem

During distributed training (e.g. via Dask), workers finish training and call free_network() independently. When one worker closes its sockets while another is still communicating, TcpSocket::Send() or TcpSocket::Recv() receives ECONNRESET (code 54 on macOS, 104 on Linux) or EPIPE (code 32). The current code calls Log::Fatal() for any socket error, which prints to stderr and throws std::runtime_error deep in the socket layer — before the caller can handle it. This kills the worker process, and Dask reports it was "killed by signal 11".

This fix returns SOCKET_ERROR from the socket layer for connection-closed errors, letting the caller in Linkers::Send()/Linkers::Recv() throw an exception that propagates up through the Python bindings where Dask's existing exception handling can catch it gracefully.

CI error messages:

lightgbm.basic.LightGBMError: Socket recv error, Connection reset by peer (code: 54)
distributed.nanny:nanny.py:761 Worker process 52290 was killed by signal 11

Validation on fork CI: 24 CI runs on fix-socket-error-crash branch — macOS gcc regular job (where the Dask crash occurs) passed 22/24 times (2 cancelled), 0 Dask crashes (vs ~30% baseline crash rate on upstream).

Changes

src/network/socket_wrapper.hpp:

  • Add IsConnectionClosedError() helper to identify connection-closed errors (ECONNRESET, EPIPE, ENOTCONN, ESHUTDOWN and Windows equivalents)
  • In Send() and Recv(): return SOCKET_ERROR for connection-closed errors instead of calling Log::Fatal(), letting callers handle gracefully
  • Other socket errors still call Log::Fatal() as before

src/network/linkers.h:

  • In Linkers::Send() and Linkers::Recv(): check return value and throw std::runtime_error instead of looping forever on a closed connection

src/network/linkers_socket.cpp:

  • In ListenThread(): check return value of Recv() during initialization

Test Plan

  • Pre-commit passes (cpplint, typos, whitespace)
  • Windows build (MSVC) compiles
  • C++ unit tests pass (31/31)
  • Linux build (GCC via WSL) compiles
  • Fork CI: 24 runs, macOS Dask test job passed 22/24 (2 cancelled), 0 Dask crashes

Related Issue

Contributes to: #4074
Contributes to: #6197
Contributes to: #5963

…d training

During distributed training shutdown, workers can close sockets while peers
are still communicating, causing Log::Fatal to crash the process with
ECONNRESET or EPIPE. Return SOCKET_ERROR for connection-closed errors
instead of crashing, and throw a catchable std::runtime_error from callers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants