Skip to content

Fix WebRTC light client sync stalls#2222

Open
timwu20 wants to merge 4 commits intosmol-dot:mainfrom
ChainSafe:tim/haiko-webrtc-deadlock-fix
Open

Fix WebRTC light client sync stalls#2222
timwu20 wants to merge 4 commits intosmol-dot:mainfrom
ChainSafe:tim/haiko-webrtc-deadlock-fix

Conversation

@timwu20
Copy link
Copy Markdown

@timwu20 timwu20 commented Mar 20, 2026

Summary

Fixes several issues that cause smoldot light clients connecting over WebRTC to stall during or after warp sync. These were observed against a litep2p-based Polkadot node where the light client would connect, complete the warp sync handshake, but then fail to receive block announcements or Grandpa messages.

Root causes identified and fixed

  1. Missed wakeup in WebRTC multi-stream task loop — When the coordinator accepts an inbound notification substream (AcceptInNotifications), the handshake response bytes are queued internally, but the platform's wait_read_write_again only wakes on incoming network data or timers. On WebRTC, each substream is a separate data channel, so if the remote peer is waiting for the handshake response before sending data on that channel, neither side makes progress. Fixed by adding an event_listener::Event that wakes all substream futures when inject_coordinator_message queues write data.

  2. Warp sync source starts with finalized_block_height = 0 — When a new peer is added as a warp sync source, its finalized_block_height was initialized to 0 instead of the best_block_number reported in the gossip handshake. Warp sync only triggers when source.finalized_block_height > warped_header_number + 32, so a source at height 0 would never trigger warp sync until a Grandpa neighbor packet arrived (which requires established notification substreams — a chicken-and-egg problem on first connect). Fixed by threading best_block_number through add_source().

  3. UnknownTargetBlock justification error causes 40-second ban — During initial sync, the finality target block hasn't been imported yet, so justification verification returns UnknownTargetBlock. This was treated as a ban-worthy error, preventing the only connected peer from being used for 40 seconds. Fixed by excluding this specific error from the ban logic in the standalone sync service.

  4. Tx/Grandpa outbound substream retry hammering — When outbound Transactions or Grandpa notification substreams are refused by the peer, smoldot retried immediately with zero delay in a tight loop (~30 retries/second). On WebRTC, this starves the connection and prevents other traffic from flowing. Additionally, litep2p requires these outbound substreams to be negotiated before it considers the notification protocols established — without them, it won't send block announcements even though the block announce substream itself was successfully negotiated. Fixed by replacing the immediate retry with a deferred retry queue: failed attempts are stored with a 2-second delay, processed at the top of next_event(), and an async timer branch in the event loop ensures retries fire at the correct time rather than piggy-backing on unrelated events.

Changes by commit

1. Fix WebRTC notification handshake stall in multi-stream task loop

  • Add coordinator_write_ready event to wake substream futures when inject_coordinator_message queues write data
  • Substream wait futures now race wait_read_write_again against the write-ready notification

2. Fix warp sync initialization and UnknownTargetBlock ban during initial sync

  • warp_sync::AddSource now takes best_block_number parameter, used as initial finalized_block_height
  • Thread best_block_number through all::AddSource* structs
  • Exclude UnknownTargetBlock justification errors from peer ban in standalone sync service

3. Add diagnostic logging and deferred notification retry mechanism

  • Add substream_id to connection-activity log line in multi-stream task loop
  • Add GossipInboundResult event to surface inbound notification substream outcomes
  • Replace immediate Tx/Grandpa retry with PendingNotificationOutRetry queue (2-second delay)
  • Add next_notification_retry_time() method to ChainNetwork for caller timer integration
  • Add async timer branches in both light-base and full-node event loops

4. Add WebRTC diagnostic logging and fix inbound data channel handling

  • Route browser-side WebRTC diagnostics through smoldot's log system via logCallback
  • Add hasNegotiated guard to prevent unnecessary SDP re-negotiation
  • Handle already-open inbound data channels whose onopen event was missed

haikoschol and others added 4 commits February 18, 2026 19:58
…l sync

Initialize warp sync source finalized_block_height with best_block_number
from gossip handshake instead of 0, so warp sync triggers immediately
without waiting for a GrandPa neighbor packet. Also exclude
UnknownTargetBlock justification errors from the 40s ban, as they are
benign during initial catchup sync.
Add substream_id to connection-activity log in multi-stream task loop
to identify which WebRTC data channel each read/write belongs to.

Add GossipInboundResult event to surface inbound notification substream
outcomes (accepted, rejected duplicate, rejected cold-open) for debugging.

Replace immediate Tx/Grandpa substream retry-on-failure with a 2-second
deferred retry to avoid starving WebRTC connections with rapid substream
open attempts.
Route browser-side WebRTC diagnostics through smoldot's log system via
a new logCallback field on ConnectionConfig. Add a hasNegotiated guard
to prevent unnecessary SDP re-negotiation in webrtc-direct mode, and
handle already-open inbound data channels whose onopen event was missed.
@timwu20 timwu20 marked this pull request as ready for review March 20, 2026 20:17
@timwu20 timwu20 requested a review from tomaka as a code owner March 20, 2026 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants