Skip to content

Rewrite PG to fix race condition problems#411

Merged
slawlor merged 1 commit intomainfrom
pg_fixes
Mar 6, 2026
Merged

Rewrite PG to fix race condition problems#411
slawlor merged 1 commit intomainfrom
pg_fixes

Conversation

@slawlor
Copy link
Owner

@slawlor slawlor commented Mar 6, 2026

  1. ractor/src/pg.rs — Full rewrite of PgState and all functions

New GroupState struct bundles members and per-group listeners into one atomically-accessible unit: struct GroupState { members: HashMap<ActorId, ActorCell>, listeners: Vec, }

New PgState uses three maps instead of the old three separate maps:

  • map: DashMap<ScopeGroupKey, GroupState> — unified members + per-group listeners
  • index: DashMap<ScopeName, HashSet> — secondary index (now uses HashSet instead of Vec)
  • world_listeners: DashMap<ScopeGroupKey, Vec> — scope-level/global monitors only

Key fixes:

  • Cross-map inconsistency (1): Members and per-group listeners are now in one entry, atomically updated
  • Zombie actor insertion (3): join_scoped filters out actors with status > Draining
  • Notifications under lock (4): All functions clone listeners and drop entry guards before sending notifications
  • leave_all misses global monitors (5): New notify_world_listeners helper checks both scoped and ALL_SCOPES_NOTIFICATION keys
  • demonitor_scope incomplete (6): Now removes from world_listeners directly (clean)
  • monitor_scope duplicate notifications (7): Registers ONLY in world_listeners, not also per-group
  • leave_all cleanup race (9): Re-checks emptiness under entry lock before removing
  1. ractor/src/actor/actor_cell.rs — Guard double cleanup (8)

Added && self.get_status() < ActorStatus::Stopping condition so cleanup only runs once on the first transition to Stopping, avoiding redundant full-DashMap iterations.

  1. ractor/src/pg/tests.rs — 3 new tests
  • test_monitor_scope_no_duplicate_notifications — verifies exactly 1 notification when monitor_scope is called after group exists
  • test_demonitor_scope_fully_unsubscribes — verifies no notifications after demonitor_scope
  • test_stopped_actor_not_inserted — verifies stopped actors are rejected by join

Resolves #374

  1. ractor/src/pg.rs — Full rewrite of PgState and all functions

  New GroupState struct bundles members and per-group listeners into one atomically-accessible unit:
  struct GroupState {
      members: HashMap<ActorId, ActorCell>,
      listeners: Vec<ActorCell>,
  }

  New PgState uses three maps instead of the old three separate maps:
  - map: DashMap<ScopeGroupKey, GroupState> — unified members + per-group listeners
  - index: DashMap<ScopeName, HashSet<GroupName>> — secondary index (now uses HashSet instead of Vec)
  - world_listeners: DashMap<ScopeGroupKey, Vec<ActorCell>> — scope-level/global monitors only

  Key fixes:
  - Cross-map inconsistency (#1): Members and per-group listeners are now in one entry, atomically updated
  - Zombie actor insertion (#3): join_scoped filters out actors with status > Draining
  - Notifications under lock (#4): All functions clone listeners and drop entry guards before sending notifications
  - leave_all misses global monitors (#5): New notify_world_listeners helper checks both scoped and ALL_SCOPES_NOTIFICATION keys
  - demonitor_scope incomplete (#6): Now removes from world_listeners directly (clean)
  - monitor_scope duplicate notifications (#7): Registers ONLY in world_listeners, not also per-group
  - leave_all cleanup race (#9): Re-checks emptiness under entry lock before removing

  2. ractor/src/actor/actor_cell.rs — Guard double cleanup (#8)

  Added && self.get_status() < ActorStatus::Stopping condition so cleanup only runs once on the first transition to Stopping, avoiding redundant full-DashMap iterations.

  3. ractor/src/pg/tests.rs — 3 new tests

  - test_monitor_scope_no_duplicate_notifications — verifies exactly 1 notification when monitor_scope is called after group exists
  - test_demonitor_scope_fully_unsubscribes — verifies no notifications after demonitor_scope
  - test_stopped_actor_not_inserted — verifies stopped actors are rejected by join

Resolves #374
@codecov
Copy link

codecov bot commented Mar 6, 2026

Codecov Report

❌ Patch coverage is 86.52174% with 31 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.73%. Comparing base (e6174c8) to head (26c5fa6).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
ractor/src/pg/tests.rs 71.92% 16 Missing ⚠️
ractor/src/pg.rs 91.22% 15 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #411      +/-   ##
==========================================
+ Coverage   85.56%   85.73%   +0.16%     
==========================================
  Files          73       73              
  Lines       13329    13361      +32     
==========================================
+ Hits        11405    11455      +50     
+ Misses       1924     1906      -18     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions
Copy link

github-actions bot commented Mar 6, 2026

Benchmark Comparison (PR vs main)

Results
group                                                                   new
-----                                                                   ---
Creation of 100 actors                                                  1.00   443.1±29.58µs        ? ?/sec
Creation of 1000 actors                                                 1.00      3.9±0.11ms        ? ?/sec
Waiting on 100 actors to process first message                          1.00   104.5±16.16µs        ? ?/sec
Waiting on 100000 messages to be processed                              1.00     32.0±2.67ms        ? ?/sec
Waiting on 50 messages with large data in the Future to be processed    1.00     70.3±5.61µs        ? ?/sec
Waiting on 500 actors to process first message                          1.00   243.1±67.71µs        ? ?/sec
factory_queuer_dispatch/workers/10                                      1.00  1509.5±44.73µs 323.5 KElem/sec
factory_queuer_dispatch/workers/100                                     1.00  1511.0±58.04µs 323.2 KElem/sec
large_messages/256_byte_messages                                        1.00     66.5±2.90µs 1467.7 KElem/sec
large_messages/64_byte_messages                                         1.00     57.0±4.53µs 1712.8 KElem/sec
small_messages/16_byte_messages                                         1.00     54.9±4.35µs 1778.5 KElem/sec
small_messages/32_byte_messages                                         1.00     54.0±2.73µs 1808.4 KElem/sec
small_messages/8_byte_messages                                          1.00     56.5±3.23µs 1728.6 KElem/sec
spawn_throughput/10                                                     1.00     59.5±0.75µs 164.1 KElem/sec
spawn_throughput/100                                                    1.00    455.3±7.88µs 214.5 KElem/sec

Benchmarks run on shared CI hardware (ubuntu-latest). Differences under ~5% may be noise.

@slawlor slawlor marked this pull request as ready for review March 6, 2026 19:39
@slawlor slawlor merged commit d81cfda into main Mar 6, 2026
41 of 42 checks passed
@slawlor slawlor deleted the pg_fixes branch March 6, 2026 19:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Potential Race conditions in pg::

1 participant