Skip to content

Fix DRSM nil pointer crashes in distributed deployments#197

Merged
gab-arrobo merged 4 commits intoomec-project:mainfrom
midwell:fix_drsm-nil-pointer-crashes
Oct 24, 2025
Merged

Fix DRSM nil pointer crashes in distributed deployments#197
gab-arrobo merged 4 commits intoomec-project:mainfrom
midwell:fix_drsm-nil-pointer-crashes

Conversation

@midwell
Copy link
Copy Markdown
Contributor

@midwell midwell commented Oct 22, 2025

Summary

Adds defensive nil checks in drsm/updates.go to prevent production crashes when processing MongoDB change stream events in distributed deployments. These fixes address critical stability issues that
occur during pod failovers, network partitions, and out-of-order event delivery.

Problem

The DRSM (Distributed Resource State Manager) MongoDB change stream handler crashes with nil pointer panics in production when:

  1. Out-of-order events: MongoDB change streams don't guarantee ordering across different documents. Chunk ownership updates can arrive before the chunk insert or pod keepalive events.
  2. Network partitions: During network issues between network functions and MongoDB, stale or incomplete update events may be delivered.
  3. Pod failover scenarios: When pods crash and chunks are reassigned, the change stream may contain references to chunks or pods not yet in local memory.
  4. Malformed events: Database corruption or manual operations can create update events with missing required fields.

I've seen multiple crashes related to these problems.

Current Crash Points

Line 156: No validation of owner field

owner := s.Update.UpdFields.PodId
// owner could be empty string ""

Line 163: No nil check after global table lookup

cp := d.globalChunkTbl[c]
// cp is nil if chunk not in table
cp.Owner.PodName = owner  // PANIC: nil pointer dereference

Line 165: No nil check before map access

podD := d.podMap[owner]
// podD is nil if pod not in map
podD.podChunks[c] = cp  // PANIC: nil pointer dereference

Impact Without Fix

  • Service outage: Network function crashes → all active sessions drop → subscribers lose connectivity
  • Cascading failures: Multiple instances can crash on the same stale event
  • Resource leaks: Silent data corruption from empty owner fields causes chunks to become unrecoverable

Changes

  1. Empty Owner Validation (Lines 157-160)
owner := s.Update.UpdFields.PodId
if owner == "" {
   logger.DrsmLog.Warnf("stream(Update): missing owner in update for doc %s", s.DId.Id)
   continue
}

Prevents: Assigning chunks to empty owner "", causing resource leaks and making chunks unrecoverable.

Scenario: Database corruption or manual MongoDB operations create incomplete documents.

  1. Nil Chunk Pointer Check (Lines 165-170)
cp := d.globalChunkTbl[c]
if cp == nil {
   logger.DrsmLog.Warnf("stream(Update): chunk %d not found in global table for owner %s - will be corrected by periodic resync", c, owner)
   continue
}

Prevents: Most common crash - cp.Owner.PodName = owner panics when chunk not yet in memory.

Scenario: Update event arrives before insert event, or chunk was deleted but update event is stale. The periodic checkAllChunks() task (runs every 3 seconds) maintains eventual consistency.

  1. Nil Pod Pointer Check (Lines 176-180)
podD, found := d.podMap[owner]
if !found || podD == nil {
   logger.DrsmLog.Warnf("stream(Update): pod %s not in local map for chunk %d update - will be corrected when keepalive arrives or during periodic resync", owner, c)
   continue
}

Prevents: Crash on podD.podChunks[c] = cp when pod not yet registered locally.

Scenario: Pod keepalive hasn't arrived yet, or arrived out-of-order with chunk update. Eventual consistency maintained by keepalive events and periodic resync.

  1. Nil Map Initialization (Lines 181-184)
if podD.podChunks == nil {
   podD.podChunks = make(map[int32]*chunk)
}

Prevents: Panic on map assignment if podChunks wasn't initialized (shouldn't happen if addPod() was called, but defensive).

Design Decision: Skip vs. Crash vs. Create

For all three nil checks, we skip the update with warning log rather than:

❌ Crash (current behavior): Causes service outage for all active sessions

❌ Create missing entries: Could create invalid state (e.g., pods without keepalives that never expire, chunks with wrong metadata)

✅ Skip and log: Safe approach because:

  • Periodic checkAllChunks() resync (every 3 seconds) corrects state from MongoDB
  • Normal event flow (insert/keepalive) will eventually process the resource
  • MongoDB is the source of truth
  • No service disruption
  • Warning logs enable debugging

@gab-arrobo
Copy link
Copy Markdown
Contributor

@midwell, thank you for your contribution. We will review your PR as soon as we fix an issue we are having with the GitHub Actions in this repository

@gab-arrobo
Copy link
Copy Markdown
Contributor

@midwell please rebase your PR. Thanks!

Add defensive nil checks to prevent crashes when processing MongoDB
change stream events in distributed deployments.

Fixes three crash scenarios:
1. Empty owner field - skip to prevent resource leaks
2. Nil chunk pointer - chunk not yet in global table (out-of-order events)
3. Nil pod pointer - pod not yet registered locally

Skip invalid updates with warning logs rather than crashing. Eventual
consistency maintained by periodic checkAllChunks() resync (3 seconds).

Tested with multiple instances during pod failures and
network partitions.

Signed-off-by: Edvin Lindqvist <edvin.lindqvist@forsway.com>
@midwell midwell force-pushed the fix_drsm-nil-pointer-crashes branch from 5e281f6 to 98ef42f Compare October 23, 2025 12:23
@midwell
Copy link
Copy Markdown
Contributor Author

midwell commented Oct 23, 2025

Rebase done.

@gab-arrobo gab-arrobo requested review from a team and Copilot October 23, 2025 13:38
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds critical defensive nil checks to the DRSM change stream handler to prevent nil pointer crashes in distributed deployments during pod failovers, network partitions, and out-of-order event processing.

Key Changes:

  • Validates owner field is non-empty before processing chunk ownership updates
  • Adds nil checks for chunk and pod lookups before dereferencing pointers
  • Includes defensive map initialization to prevent assignment panics

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@gab-arrobo
Copy link
Copy Markdown
Contributor

@midwell,
BTW, to create a release such that the NFs (amf and smf) can use the changes proposed in this PR, you need to modify the VERSION file by removing the -dev from its content. So, when your PR is approved/merged, a new release is going to be automatically created

diff --git a/VERSION b/VERSION
index 86a2e95..f01291b 100644
--- a/VERSION
+++ b/VERSION
@@ -1 +1 @@
-1.5.7-dev
+1.5.7

@gab-arrobo gab-arrobo requested a review from a team October 23, 2025 23:10
midwell and others added 3 commits October 24, 2025 16:55
…ization

Co-authored-by: Arrobo, Gabriel <gabriel.arrobo@intel.com>
Signed-off-by: Edvin Lindqvist <edvin.lindqvist@forsway.com>
Signed-off-by: Edvin Lindqvist <edvin.lindqvist@forsway.com>
Copy link
Copy Markdown
Contributor

@gab-arrobo gab-arrobo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@gab-arrobo gab-arrobo merged commit d9d54c7 into omec-project:main Oct 24, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants