Fix DRSM nil pointer crashes in distributed deployments#197
Fix DRSM nil pointer crashes in distributed deployments#197gab-arrobo merged 4 commits intoomec-project:mainfrom
Conversation
|
@midwell, thank you for your contribution. We will review your PR as soon as we fix an issue we are having with the GitHub Actions in this repository |
|
@midwell please rebase your PR. Thanks! |
Add defensive nil checks to prevent crashes when processing MongoDB change stream events in distributed deployments. Fixes three crash scenarios: 1. Empty owner field - skip to prevent resource leaks 2. Nil chunk pointer - chunk not yet in global table (out-of-order events) 3. Nil pod pointer - pod not yet registered locally Skip invalid updates with warning logs rather than crashing. Eventual consistency maintained by periodic checkAllChunks() resync (3 seconds). Tested with multiple instances during pod failures and network partitions. Signed-off-by: Edvin Lindqvist <edvin.lindqvist@forsway.com>
5e281f6 to
98ef42f
Compare
|
Rebase done. |
There was a problem hiding this comment.
Pull Request Overview
This PR adds critical defensive nil checks to the DRSM change stream handler to prevent nil pointer crashes in distributed deployments during pod failovers, network partitions, and out-of-order event processing.
Key Changes:
- Validates owner field is non-empty before processing chunk ownership updates
- Adds nil checks for chunk and pod lookups before dereferencing pointers
- Includes defensive map initialization to prevent assignment panics
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
|
@midwell, diff --git a/VERSION b/VERSION
index 86a2e95..f01291b 100644
--- a/VERSION
+++ b/VERSION
@@ -1 +1 @@
-1.5.7-dev
+1.5.7 |
…ization Co-authored-by: Arrobo, Gabriel <gabriel.arrobo@intel.com> Signed-off-by: Edvin Lindqvist <edvin.lindqvist@forsway.com>
Signed-off-by: Edvin Lindqvist <edvin.lindqvist@forsway.com>
Summary
Adds defensive nil checks in
drsm/updates.goto prevent production crashes when processing MongoDB change stream events in distributed deployments. These fixes address critical stability issues thatoccur during pod failovers, network partitions, and out-of-order event delivery.
Problem
The DRSM (Distributed Resource State Manager) MongoDB change stream handler crashes with nil pointer panics in production when:
I've seen multiple crashes related to these problems.
Current Crash Points
Line 156: No validation of owner field
Line 163: No nil check after global table lookup
Line 165: No nil check before map access
Impact Without Fix
Changes
Prevents: Assigning chunks to empty owner "", causing resource leaks and making chunks unrecoverable.
Scenario: Database corruption or manual MongoDB operations create incomplete documents.
Prevents: Most common crash - cp.Owner.PodName = owner panics when chunk not yet in memory.
Scenario: Update event arrives before insert event, or chunk was deleted but update event is stale. The periodic checkAllChunks() task (runs every 3 seconds) maintains eventual consistency.
Prevents: Crash on podD.podChunks[c] = cp when pod not yet registered locally.
Scenario: Pod keepalive hasn't arrived yet, or arrived out-of-order with chunk update. Eventual consistency maintained by keepalive events and periodic resync.
Prevents: Panic on map assignment if podChunks wasn't initialized (shouldn't happen if addPod() was called, but defensive).
Design Decision: Skip vs. Crash vs. Create
For all three nil checks, we skip the update with warning log rather than:
❌ Crash (current behavior): Causes service outage for all active sessions
❌ Create missing entries: Could create invalid state (e.g., pods without keepalives that never expire, chunks with wrong metadata)
✅ Skip and log: Safe approach because: