Skip to content

fix: handle language runtime shutdown gracefully to prevent thread panics#710

Closed
HexaField wants to merge 12 commits intocoasys:devfrom
HexaField:fix/teardown-thread-panic
Closed

fix: handle language runtime shutdown gracefully to prevent thread panics#710
HexaField wants to merge 12 commits intocoasys:devfrom
HexaField:fix/teardown-thread-panic

Conversation

@HexaField
Copy link
Contributor

@HexaField HexaField commented Mar 4, 2026

During neighbourhood teardown, three lang-* threads panic at js_core/mod.rs:267:53 because execute_script() returns an error after the V8 isolate is being torn down, and the result is .unwrap()ed.

This replaces .unwrap() with ? so the error propagates gracefully through the Result return type instead of panicking the thread. This allows clean thread termination during language_remove().

Discovered during memory profiling in #689 — these panics prevent complete memory cleanup during teardown (currently only 7.6% recovery).

Change

rust-executor/src/js_core/mod.rs: execute_async.unwrap()execute_async?

Summary by CodeRabbit

Release Notes

  • Documentation

    • Added comprehensive memory profiling reports and leak investigation guides.
  • Bug Fixes

    • Fixed memory leaks during perspective removal and executor shutdown.
    • Improved cleanup of application-related resources and signal streams.
    • Enhanced error handling in script execution.
  • Chores

    • Added profiling and memory investigation tools for analysis and diagnostics.

HexaField and others added 12 commits February 21, 2026 18:03
- Baseline profiling: 355 MB startup, 750 MB post-init, ~78 MB per neighbourhood
- Leak investigation: 0% memory recovery on neighbourhood teardown
- perspectiveRemove does not uninstall Holochain hApps or free WASM runtimes
- Bare perspectives leak ~2.4 MB each, language cloning leaks ~4.2 MB each
- Includes reproduction scripts (profiler, leak tester, publish-langs)
Detailed code-level analysis tracing all three categories of memory leaks:
1. CRITICAL: Neighbourhood teardown leaks 100% - perspectiveRemove only sets a flag,
   never uninstalls Holochain hApps, Prolog pools, SurrealDB, or JS languages
2. Bare perspectives leak ~2.4 MB each (Prolog pools + SurrealDB not freed)
3. Language cloning leaks ~4.2 MB per clone (permanent, no unload path)

Includes exact file/line references, proposed fixes ordered by priority,
and architecture recommendations (lifecycle contract, reference counting).
CRITICAL fixes:
- Fix 1: Proper teardown_background_tasks that cleans up Prolog pools,
  SurrealDB, link language, subscribed queries, and batch store
- Fix 2: Add language_remove method to Rust LanguageController to call
  JS languageController.languageRemove() during teardown
- Fix 3: Clean up Holochain signal callbacks on language removal (both
  JS #signalCallbacks and Rust signal stream StreamMap)
- Rename _remove_perspective_pool to remove_perspective_pool

MEDIUM fixes:
- Fix 4: Add reference counting for languages in LanguageController.ts
  (languageAddRef/languageReleaseRef)
- Fix 5: Add SurrealDB shutdown() method that drops all data and indexes
Baseline vs patched binary comparison confirms:
- AD4M-layer teardown works correctly (SurrealDB, signals, languages)
- Holochain conductor retains ~140MB/neighbourhood after uninstall_app
- 0% memory recovery on both original and patched binaries
- Root cause is conductor-level wasmer/LMDB memory management

Updated leak-investigation.mjs with v2 improvements:
- Fixed GQL schema for DecoratedLinkExpression
- Added detailed smaps breakdown per test phase
- Added large anon mapping tracking across lifecycle
- Added teardown log verification
- Remove languageAddRef/languageReleaseRef and #languageRefCounts from
  LanguageController.ts — these were never called from any code path
- Simplify SurrealDB shutdown() to just log — SurrealDB uses in-memory
  storage (Surreal::new::<Mem>), so explicit DELETE/REMOVE INDEX is
  unnecessary; memory is freed when the Arc<Surreal<Db>> is dropped
…ge comment

- Move docs/profiling/ → profiling/ (docs/ is generated output)
- Fix comment: SurrealDB uses RocksDb, not in-memory storage
- Merge upstream/dev into profiling/memory-investigation
- Escape address in JS source embedding (critical security fix)
- Narrow error suppression in languageRemove to only expected cases
- Fix Prolog pool teardown lock contention (drop lock before await)
- Reorder teardown: end subscriptions before removing Prolog pools
- Add GraphQL error checking to all profiling scripts
- Add try/finally cleanup to all profiling scripts
- Validate inputs (DID, linkLanguages) before use
- Fix markdown lint issues in profiling docs
- Correct plan date and add fence languages for code blocks
- Accept deletion of JS executor files (moved to Rust in dev)
…dency

- Fail fast when AD4M init times out instead of silently continuing
- Add profiling/package.json declaring ws dependency
- Applied to leak-investigation.mjs, profiler-v9.mjs, publish-langs.mjs
…nguageController refactor

- Remove old static language_remove() that used js_core.execute() (no longer valid
  after LanguageController refactor removed the single shared JS runtime)
- Update perspective_instance.rs teardown to use the instance method
  LanguageController::global_instance().language_remove()
- Fix markdown table spacing in profiling results

Addresses review comments from @lucksus and @coderabbitai
- Critical 1: Fix subscribed_queries double-clear in teardown. Queries were
  cleared in step 1 then drained in step 5 for Prolog notifications, yielding
  nothing. Now drain+notify happens first (step 1), then pools are removed.

- Improvement 3: Replace no-op SurrealDB shutdown() call with direct log line.

- Improvement 4: Add TODO warning about missing language ref counting in
  teardown — removing a shared link language affects all perspectives using it.

- Nit: Fix duplicate step numbers (two step 6s → sequential 5, 6).
- Nit: Fix whitespace-only diff in languages/mod.rs:1020.
- Nit: Add pre-coasys#693 architecture note to profiling/refactoring-plan.md.
…anic during teardown

During neighbourhood teardown, language runtime threads panic at js_core/mod.rs:267
because execute_script() fails after the V8 isolate is being torn down, and the
result is unwrapped. Replacing .unwrap() with ? propagates the error gracefully
instead of panicking the thread.

Related: coasys#689
@HexaField
Copy link
Contributor Author

Reopening with correct base branch

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 4, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: aad5147b-0ed4-4461-95ce-84f4365f3b05

📥 Commits

Reviewing files that changed from the base of the PR and between 6dfad94 and 7b44b5c.

📒 Files selected for processing (13)
  • profiling/README.md
  • profiling/leak-investigation-2026-02-21.md
  • profiling/leak-investigation.mjs
  • profiling/package.json
  • profiling/profiler-v9.mjs
  • profiling/profiling-results-2026-02-21.md
  • profiling/publish-langs.mjs
  • profiling/refactoring-plan.md
  • rust-executor/src/holochain_service/mod.rs
  • rust-executor/src/js_core/mod.rs
  • rust-executor/src/perspectives/perspective_instance.rs
  • rust-executor/src/prolog_service/mod.rs
  • rust-executor/src/surreal_service/mod.rs

📝 Walkthrough

Walkthrough

This PR introduces comprehensive memory leak profiling documentation, investigation tooling, and implements resource cleanup fixes across the AD4M executor. Changes include detailed investigation reports, Node.js profiling scripts, and Rust implementation fixes addressing signal stream cleanup, perspective lifecycle management, Prolog pool teardown, and SurrealDB shutdown sequencing.

Changes

Cohort / File(s) Summary
Profiling Documentation
profiling/README.md, profiling/leak-investigation-2026-02-21.md, profiling/profiling-results-2026-02-21.md, profiling/refactoring-plan.md
Comprehensive analysis of memory leaks across neighbourhood teardown, perspective lifecycle, language cloning, and link accumulation; detailed profiling results with RSS/memory metrics; multi-phase refactoring plan with prioritized fixes for Rust and JS components.
Profiling Scripts & Configuration
profiling/leak-investigation.mjs, profiling/profiler-v9.mjs, profiling/publish-langs.mjs, profiling/package.json
Node.js scripts orchestrating memory profiling, leak investigation, and bootstrap language publishing; WebSocket GraphQL client integration for remote testing; RSS/SMAPS measurement utilities; process lifecycle management.
Holochain Signal Stream Cleanup
rust-executor/src/holochain_service/mod.rs
Added removed_app_ids channel to track and clean up per-app signal streams on app removal, preventing memory retention of inactive streams.
Perspective Lifecycle & Resource Teardown
rust-executor/src/perspectives/perspective_instance.rs
Expanded teardown_background_tasks with comprehensive cleanup: Prolog subscription/pool teardown, SurrealDB shutdown, conditional link-language unloading, and batch store clearing.
Prolog Service Pool Management
rust-executor/src/prolog_service/mod.rs
Exposed public remove_perspective_pool API with lock-avoidant teardown sequencing; acquires lock, removes pool, releases lock, then awaits pool cleanup to prevent deadlocks.
SurrealDB Lifecycle Signaling
rust-executor/src/surreal_service/mod.rs
Added public async shutdown method on SurrealDBService for lifecycle signaling; relies on Arc reference counting for actual resource deallocation.
Error Handling Improvement
rust-executor/src/js_core/mod.rs
Replaced unwrap() with ? operator in execute_script error handling, converting potential panics to propagated errors.

Sequence Diagram(s)

sequenceDiagram
    participant PerspectiveInstance
    participant PrologService
    participant HolochainService
    participant SurrealDB as SurrealDBService
    participant LinkLanguage

    PerspectiveInstance->>PerspectiveInstance: teardown_background_tasks()
    PerspectiveInstance->>PerspectiveInstance: Send teardown signal
    
    PerspectiveInstance->>PrologService: Notify subscriptions ending
    PrologService->>PrologService: End all subscriptions
    
    PerspectiveInstance->>PrologService: remove_perspective_pool()
    PrologService->>PrologService: Acquire write lock
    PrologService->>PrologService: Remove pool from map
    PrologService->>PrologService: Release lock
    PrologService->>PrologService: Await pool._drop_all()
    PrologService-->>PerspectiveInstance: Pool cleanup complete
    
    PerspectiveInstance->>SurrealDB: shutdown()
    SurrealDB-->>PerspectiveInstance: Shutdown signaled
    
    PerspectiveInstance->>LinkLanguage: Conditional unload (if neighbourhood)
    LinkLanguage-->>PerspectiveInstance: Language unloaded
    
    PerspectiveInstance->>PerspectiveInstance: Clear batch store
    PerspectiveInstance->>PerspectiveInstance: Clear link language ref
    PerspectiveInstance->>PerspectiveInstance: Log teardown complete
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested reviewers

  • lucksus
  • jhweir

Poem

🐰 A leak once flowed through every view,
So we traced each task and thread through—
Now signal streams and pools unwind,
With SurrealDB's peace of mind.
Clean teardowns spring from root to shoot! 🌱

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@HexaField HexaField closed this Mar 4, 2026
@HexaField HexaField deleted the fix/teardown-thread-panic branch March 4, 2026 21:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants