fix: handle language runtime shutdown gracefully to prevent thread panics#710
fix: handle language runtime shutdown gracefully to prevent thread panics#710HexaField wants to merge 12 commits intocoasys:devfrom
Conversation
- Baseline profiling: 355 MB startup, 750 MB post-init, ~78 MB per neighbourhood - Leak investigation: 0% memory recovery on neighbourhood teardown - perspectiveRemove does not uninstall Holochain hApps or free WASM runtimes - Bare perspectives leak ~2.4 MB each, language cloning leaks ~4.2 MB each - Includes reproduction scripts (profiler, leak tester, publish-langs)
Detailed code-level analysis tracing all three categories of memory leaks: 1. CRITICAL: Neighbourhood teardown leaks 100% - perspectiveRemove only sets a flag, never uninstalls Holochain hApps, Prolog pools, SurrealDB, or JS languages 2. Bare perspectives leak ~2.4 MB each (Prolog pools + SurrealDB not freed) 3. Language cloning leaks ~4.2 MB per clone (permanent, no unload path) Includes exact file/line references, proposed fixes ordered by priority, and architecture recommendations (lifecycle contract, reference counting).
CRITICAL fixes: - Fix 1: Proper teardown_background_tasks that cleans up Prolog pools, SurrealDB, link language, subscribed queries, and batch store - Fix 2: Add language_remove method to Rust LanguageController to call JS languageController.languageRemove() during teardown - Fix 3: Clean up Holochain signal callbacks on language removal (both JS #signalCallbacks and Rust signal stream StreamMap) - Rename _remove_perspective_pool to remove_perspective_pool MEDIUM fixes: - Fix 4: Add reference counting for languages in LanguageController.ts (languageAddRef/languageReleaseRef) - Fix 5: Add SurrealDB shutdown() method that drops all data and indexes
Baseline vs patched binary comparison confirms: - AD4M-layer teardown works correctly (SurrealDB, signals, languages) - Holochain conductor retains ~140MB/neighbourhood after uninstall_app - 0% memory recovery on both original and patched binaries - Root cause is conductor-level wasmer/LMDB memory management Updated leak-investigation.mjs with v2 improvements: - Fixed GQL schema for DecoratedLinkExpression - Added detailed smaps breakdown per test phase - Added large anon mapping tracking across lifecycle - Added teardown log verification
- Remove languageAddRef/languageReleaseRef and #languageRefCounts from LanguageController.ts — these were never called from any code path - Simplify SurrealDB shutdown() to just log — SurrealDB uses in-memory storage (Surreal::new::<Mem>), so explicit DELETE/REMOVE INDEX is unnecessary; memory is freed when the Arc<Surreal<Db>> is dropped
…ge comment - Move docs/profiling/ → profiling/ (docs/ is generated output) - Fix comment: SurrealDB uses RocksDb, not in-memory storage
- Merge upstream/dev into profiling/memory-investigation - Escape address in JS source embedding (critical security fix) - Narrow error suppression in languageRemove to only expected cases - Fix Prolog pool teardown lock contention (drop lock before await) - Reorder teardown: end subscriptions before removing Prolog pools - Add GraphQL error checking to all profiling scripts - Add try/finally cleanup to all profiling scripts - Validate inputs (DID, linkLanguages) before use - Fix markdown lint issues in profiling docs - Correct plan date and add fence languages for code blocks - Accept deletion of JS executor files (moved to Rust in dev)
…dency - Fail fast when AD4M init times out instead of silently continuing - Add profiling/package.json declaring ws dependency - Applied to leak-investigation.mjs, profiler-v9.mjs, publish-langs.mjs
…nguageController refactor - Remove old static language_remove() that used js_core.execute() (no longer valid after LanguageController refactor removed the single shared JS runtime) - Update perspective_instance.rs teardown to use the instance method LanguageController::global_instance().language_remove() - Fix markdown table spacing in profiling results Addresses review comments from @lucksus and @coderabbitai
- Critical 1: Fix subscribed_queries double-clear in teardown. Queries were cleared in step 1 then drained in step 5 for Prolog notifications, yielding nothing. Now drain+notify happens first (step 1), then pools are removed. - Improvement 3: Replace no-op SurrealDB shutdown() call with direct log line. - Improvement 4: Add TODO warning about missing language ref counting in teardown — removing a shared link language affects all perspectives using it. - Nit: Fix duplicate step numbers (two step 6s → sequential 5, 6). - Nit: Fix whitespace-only diff in languages/mod.rs:1020. - Nit: Add pre-coasys#693 architecture note to profiling/refactoring-plan.md.
…anic during teardown During neighbourhood teardown, language runtime threads panic at js_core/mod.rs:267 because execute_script() fails after the V8 isolate is being torn down, and the result is unwrapped. Replacing .unwrap() with ? propagates the error gracefully instead of panicking the thread. Related: coasys#689
|
Reopening with correct base branch |
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (13)
📝 WalkthroughWalkthroughThis PR introduces comprehensive memory leak profiling documentation, investigation tooling, and implements resource cleanup fixes across the AD4M executor. Changes include detailed investigation reports, Node.js profiling scripts, and Rust implementation fixes addressing signal stream cleanup, perspective lifecycle management, Prolog pool teardown, and SurrealDB shutdown sequencing. Changes
Sequence Diagram(s)sequenceDiagram
participant PerspectiveInstance
participant PrologService
participant HolochainService
participant SurrealDB as SurrealDBService
participant LinkLanguage
PerspectiveInstance->>PerspectiveInstance: teardown_background_tasks()
PerspectiveInstance->>PerspectiveInstance: Send teardown signal
PerspectiveInstance->>PrologService: Notify subscriptions ending
PrologService->>PrologService: End all subscriptions
PerspectiveInstance->>PrologService: remove_perspective_pool()
PrologService->>PrologService: Acquire write lock
PrologService->>PrologService: Remove pool from map
PrologService->>PrologService: Release lock
PrologService->>PrologService: Await pool._drop_all()
PrologService-->>PerspectiveInstance: Pool cleanup complete
PerspectiveInstance->>SurrealDB: shutdown()
SurrealDB-->>PerspectiveInstance: Shutdown signaled
PerspectiveInstance->>LinkLanguage: Conditional unload (if neighbourhood)
LinkLanguage-->>PerspectiveInstance: Language unloaded
PerspectiveInstance->>PerspectiveInstance: Clear batch store
PerspectiveInstance->>PerspectiveInstance: Clear link language ref
PerspectiveInstance->>PerspectiveInstance: Log teardown complete
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested reviewers
Poem
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
During neighbourhood teardown, three
lang-*threads panic atjs_core/mod.rs:267:53becauseexecute_script()returns an error after the V8 isolate is being torn down, and the result is.unwrap()ed.This replaces
.unwrap()with?so the error propagates gracefully through theResultreturn type instead of panicking the thread. This allows clean thread termination duringlanguage_remove().Discovered during memory profiling in #689 — these panics prevent complete memory cleanup during teardown (currently only 7.6% recovery).
Change
rust-executor/src/js_core/mod.rs:execute_async.unwrap()→execute_async?Summary by CodeRabbit
Release Notes
Documentation
Bug Fixes
Chores