fix(mothership): Use heartbeat mechanism for chat locks by TheodoreSpeaks · Pull Request #4286 · simstudioai/sim

TheodoreSpeaks · 2026-04-24T18:14:54Z

Summary

We ocassionally get mothership tasks that stall on every message due to an orphaned mothership task lock. Add heartbeat mechanism to refresh expiry every minute rather than the current 2 hours after expiry. This will mean the worst case for a orphaned task is that messages can't be sent for a single minute rather than 2 hours.

Type of Change

Testing

Tested locally using redis-cli. Validated that task is refreshed consistently and if sim side is interrupted that lock expires

Checklist

Code follows project style guidelines
Self-reviewed my changes
Tests added/updated and passing
No new warnings introduced
I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

Screenshots/Videos

vercel · 2026-04-24T18:14:59Z

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment

Project	Deployment	Actions	Updated (UTC)
docs	Skipped		Apr 24, 2026 6:14pm

TheodoreSpeaks · 2026-04-24T18:15:00Z

@BugBot review

cursor · 2026-04-24T18:15:08Z

PR Summary

Medium Risk
Changes Redis-backed per-chat stream lock semantics (TTL and renewal) during active streams; mistakes could cause premature lock loss or concurrent stream contention.

Overview
Prevents chat streams from being blocked for hours by switching the per-chat Redis stream lock to a short TTL with heartbeats. startAbortPoller now accepts chatId and periodically calls a new extendLock helper to refresh copilot:chat-stream-lock:*, stopping heartbeats if lock ownership is lost.

Adds extendLock to the Redis config via a safe Lua GET+EXPIRE script (no-op when Redis is unavailable), updates test mocks to support eval, and introduces unit tests covering lock extension cadence, backward-compat (no chatId), retry behavior, and ownership-loss handling.

^{Reviewed by Cursor Bugbot for commit b3a6985. Bugbot is set up for automated code reviews on this repo. Configure here.}

cursor

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

^{Reviewed by Cursor Bugbot for commit b3a6985. Configure here.}

greptile-apps · 2026-04-24T18:36:18Z

Greptile Summary

Replaces the 2-hour static TTL on the per-chat stream lock with a 60-second TTL + a 20-second heartbeat (extendLock via a new atomic Lua script), so orphaned locks from pod crashes self-heal in under a minute instead of two hours. The startAbortPoller gains a chatId option, and extendLock is wired in with ownership-loss detection that stops the heartbeat without stopping abort polling.

Confidence Score: 5/5

Safe to merge — the fix is well-scoped, atomic, and the one remaining finding is a minor idempotency edge case with no correctness impact.

All remaining findings are P2. The core Lua script is atomic (ownership check + EXPIRE in one round trip), error handling is correct, backward compat is preserved via the optional chatId, and tests cover all branches including error retry and ownership loss.

apps/sim/lib/copilot/request/session/abort.ts — heartbeat section can emit concurrent extendLock calls under high Redis latency (idempotent, no correctness risk)

Important Files Changed

Filename	Overview
apps/sim/lib/copilot/request/session/abort.ts	Core heartbeat logic added inside startAbortPoller — runs after the pollingStreams finally-block guard, which allows concurrent heartbeat invocations when extendLock takes >1 s (the poll interval)
apps/sim/lib/core/config/redis.ts	New extendLock function with atomic Lua script (ownership check + EXPIRE) — correct implementation mirroring the existing releaseLock pattern
apps/sim/lib/copilot/request/session/abort.test.ts	New test file covering heartbeat cadence, backward-compat no-chatId path, error retry behavior, and ownership-loss stopping — good coverage of all branches
apps/sim/lib/core/config/redis.test.ts	Tests added for extendLock: ownership match, mismatch, and Redis-unavailable no-op — correct assertions using existing eval mock
apps/sim/lib/copilot/request/lifecycle/start.ts	One-line change: passes chatId into startAbortPoller options to enable heartbeating
packages/testing/src/mocks/redis-config.mock.ts	Mock updated to expose mockExtendLock fn alongside existing acquireLock/releaseLock mocks
packages/testing/src/mocks/redis.mock.ts	createMockRedis now stubs the eval command (defaulting to 0) needed by extendLock and releaseLock tests

Sequence Diagram

sequenceDiagram
    participant Client
    participant SimPod as Sim Pod (startAbortPoller)
    participant Redis

    Client->>SimPod: Start chat stream
    SimPod->>Redis: acquireLock(chat-stream-lock, streamId, TTL=60s)
    Redis-->>SimPod: OK

    loop Every 1s (abort poll)
        SimPod->>Redis: hasAbortMarker(streamId)?
        Redis-->>SimPod: false
    end

    loop Every 20s (heartbeat)
        SimPod->>Redis: extendLock(chat-stream-lock, streamId, TTL=60s)
        Redis-->>SimPod: true (TTL refreshed)
    end

    alt Pod dies (SIGKILL / OOM)
        Note over Redis: Lock expires in 60s (was 2h)
        Client->>SimPod: Next message acquireLock succeeds after 60s
    else Normal stream end
        SimPod->>Redis: releaseLock(chat-stream-lock, streamId)
        Redis-->>SimPod: 1 (released)
        Client->>SimPod: Next message acquireLock succeeds immediately
    end

_{Reviews (1): Last reviewed commit: "fix(mothership): Use heartbeat mechanism..." | Re-trigger Greptile}

greptile-apps · 2026-04-24T18:36:22Z

+      if (!chatId || heartbeatOwnershipLost) return
+      if (Date.now() - lastHeartbeatAt < CHAT_STREAM_LOCK_HEARTBEAT_INTERVAL_MS) return
+
+      try {
+        const owned = await extendLock(
+          getChatStreamLockKey(chatId),
+          streamId,
+          CHAT_STREAM_LOCK_TTL_SECONDS
+        )
+        lastHeartbeatAt = Date.now()
+        if (!owned) {
+          heartbeatOwnershipLost = true
+          logger.warn('Lost ownership of chat stream lock — stopping heartbeat', {
+            chatId,
+            streamId,
+            ...(requestId ? { requestId } : {}),
+          })
+        }
+      } catch (error) {
+        logger.warn('Failed to extend chat stream lock TTL', {
+          chatId,
+          streamId,
+          ...(requestId ? { requestId } : {}),
+          error: toError(error).message,
+        })
+      }


Concurrent heartbeat calls possible when extendLock is slow

The heartbeat section executes after pollingStreams.delete(streamId) in the finally block, so by the time extendLock is awaited the next interval tick (1 s) can enter, pass the pollingStreams guard, finish its abort poll, delete again, and reach the time check while lastHeartbeatAt still hasn't been updated by the in-flight call. If Redis latency exceeds the 1 s poll interval — e.g. during the 5 s commandTimeout — several concurrent EXPIRE calls can pile up.

EXPIRE is idempotent so correctness is unaffected, but under degraded-Redis conditions this amplifies the load exactly when Redis is already struggling. Updating lastHeartbeatAt optimistically before the await would eliminate the duplicate calls:

lastHeartbeatAt = Date.now() // prevent concurrent calls even if extendLock is slow try { const owned = await extendLock( getChatStreamLockKey(chatId), streamId, CHAT_STREAM_LOCK_TTL_SECONDS ) if (!owned) { heartbeatOwnershipLost = true logger.warn('Lost ownership of chat stream lock — stopping heartbeat', { ... }) } } catch (error) { lastHeartbeatAt = 0 // reset so the next tick retries immediately logger.warn('Failed to extend chat stream lock TTL', { ... }) }

This preserves the fast-retry-on-error behaviour the test checks for while capping concurrent calls to one.

fix(mothership): Use heartbeat mechanism for chat locks

b3a6985

cursor Bot reviewed Apr 24, 2026

View reviewed changes

TheodoreSpeaks marked this pull request as ready for review April 24, 2026 18:32

greptile-apps Bot reviewed Apr 24, 2026

View reviewed changes

TheodoreSpeaks merged commit 04f1d01 into staging Apr 24, 2026
14 checks passed

TheodoreSpeaks deleted the chat-stream-lock-heartbeat branch April 24, 2026 18:36

TheodoreSpeaks mentioned this pull request Apr 24, 2026

v0.6.57: mothership reliability, ashby refactor, tables row count, copilot id fix, bun upgrade #4293

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(mothership): Use heartbeat mechanism for chat locks#4286

fix(mothership): Use heartbeat mechanism for chat locks#4286
TheodoreSpeaks merged 1 commit intostagingfrom
chat-stream-lock-heartbeat

TheodoreSpeaks commented Apr 24, 2026 •

edited

Loading

Uh oh!

vercel Bot commented Apr 24, 2026

Uh oh!

TheodoreSpeaks commented Apr 24, 2026

Uh oh!

cursor Bot commented Apr 24, 2026 •

edited

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

greptile-apps Bot commented Apr 24, 2026

Uh oh!

greptile-apps Bot Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TheodoreSpeaks commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Type of Change

Testing

Checklist

Screenshots/Videos

Uh oh!

vercel Bot commented Apr 24, 2026

Uh oh!

TheodoreSpeaks commented Apr 24, 2026

Uh oh!

cursor Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot commented Apr 24, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

TheodoreSpeaks commented Apr 24, 2026 •

edited

Loading

cursor Bot commented Apr 24, 2026 •

edited

Loading