Skip to content

fix(mothership): Use heartbeat mechanism for chat locks#4286

Merged
TheodoreSpeaks merged 1 commit intostagingfrom
chat-stream-lock-heartbeat
Apr 24, 2026
Merged

fix(mothership): Use heartbeat mechanism for chat locks#4286
TheodoreSpeaks merged 1 commit intostagingfrom
chat-stream-lock-heartbeat

Conversation

@TheodoreSpeaks
Copy link
Copy Markdown
Collaborator

@TheodoreSpeaks TheodoreSpeaks commented Apr 24, 2026

Summary

We ocassionally get mothership tasks that stall on every message due to an orphaned mothership task lock. Add heartbeat mechanism to refresh expiry every minute rather than the current 2 hours after expiry. This will mean the worst case for a orphaned task is that messages can't be sent for a single minute rather than 2 hours.

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation
  • Other: ___________

Testing

  • Tested locally using redis-cli. Validated that task is refreshed consistently and if sim side is interrupted that lock expires

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

Screenshots/Videos

@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 24, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped Apr 24, 2026 6:14pm

Request Review

@TheodoreSpeaks
Copy link
Copy Markdown
Collaborator Author

@BugBot review

@cursor
Copy link
Copy Markdown

cursor Bot commented Apr 24, 2026

PR Summary

Medium Risk
Changes Redis-backed per-chat stream lock semantics (TTL and renewal) during active streams; mistakes could cause premature lock loss or concurrent stream contention.

Overview
Prevents chat streams from being blocked for hours by switching the per-chat Redis stream lock to a short TTL with heartbeats. startAbortPoller now accepts chatId and periodically calls a new extendLock helper to refresh copilot:chat-stream-lock:*, stopping heartbeats if lock ownership is lost.

Adds extendLock to the Redis config via a safe Lua GET+EXPIRE script (no-op when Redis is unavailable), updates test mocks to support eval, and introduces unit tests covering lock extension cadence, backward-compat (no chatId), retry behavior, and ownership-loss handling.

Reviewed by Cursor Bugbot for commit b3a6985. Bugbot is set up for automated code reviews on this repo. Configure here.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit b3a6985. Configure here.

@TheodoreSpeaks TheodoreSpeaks marked this pull request as ready for review April 24, 2026 18:32
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 24, 2026

Greptile Summary

Replaces the 2-hour static TTL on the per-chat stream lock with a 60-second TTL + a 20-second heartbeat (extendLock via a new atomic Lua script), so orphaned locks from pod crashes self-heal in under a minute instead of two hours. The startAbortPoller gains a chatId option, and extendLock is wired in with ownership-loss detection that stops the heartbeat without stopping abort polling.

Confidence Score: 5/5

Safe to merge — the fix is well-scoped, atomic, and the one remaining finding is a minor idempotency edge case with no correctness impact.

All remaining findings are P2. The core Lua script is atomic (ownership check + EXPIRE in one round trip), error handling is correct, backward compat is preserved via the optional chatId, and tests cover all branches including error retry and ownership loss.

apps/sim/lib/copilot/request/session/abort.ts — heartbeat section can emit concurrent extendLock calls under high Redis latency (idempotent, no correctness risk)

Important Files Changed

Filename Overview
apps/sim/lib/copilot/request/session/abort.ts Core heartbeat logic added inside startAbortPoller — runs after the pollingStreams finally-block guard, which allows concurrent heartbeat invocations when extendLock takes >1 s (the poll interval)
apps/sim/lib/core/config/redis.ts New extendLock function with atomic Lua script (ownership check + EXPIRE) — correct implementation mirroring the existing releaseLock pattern
apps/sim/lib/copilot/request/session/abort.test.ts New test file covering heartbeat cadence, backward-compat no-chatId path, error retry behavior, and ownership-loss stopping — good coverage of all branches
apps/sim/lib/core/config/redis.test.ts Tests added for extendLock: ownership match, mismatch, and Redis-unavailable no-op — correct assertions using existing eval mock
apps/sim/lib/copilot/request/lifecycle/start.ts One-line change: passes chatId into startAbortPoller options to enable heartbeating
packages/testing/src/mocks/redis-config.mock.ts Mock updated to expose mockExtendLock fn alongside existing acquireLock/releaseLock mocks
packages/testing/src/mocks/redis.mock.ts createMockRedis now stubs the eval command (defaulting to 0) needed by extendLock and releaseLock tests

Sequence Diagram

sequenceDiagram
    participant Client
    participant SimPod as Sim Pod (startAbortPoller)
    participant Redis

    Client->>SimPod: Start chat stream
    SimPod->>Redis: acquireLock(chat-stream-lock, streamId, TTL=60s)
    Redis-->>SimPod: OK

    loop Every 1s (abort poll)
        SimPod->>Redis: hasAbortMarker(streamId)?
        Redis-->>SimPod: false
    end

    loop Every 20s (heartbeat)
        SimPod->>Redis: extendLock(chat-stream-lock, streamId, TTL=60s)
        Redis-->>SimPod: true (TTL refreshed)
    end

    alt Pod dies (SIGKILL / OOM)
        Note over Redis: Lock expires in 60s (was 2h)
        Client->>SimPod: Next message acquireLock succeeds after 60s
    else Normal stream end
        SimPod->>Redis: releaseLock(chat-stream-lock, streamId)
        Redis-->>SimPod: 1 (released)
        Client->>SimPod: Next message acquireLock succeeds immediately
    end
Loading

Reviews (1): Last reviewed commit: "fix(mothership): Use heartbeat mechanism..." | Re-trigger Greptile

Comment on lines +310 to +335
if (!chatId || heartbeatOwnershipLost) return
if (Date.now() - lastHeartbeatAt < CHAT_STREAM_LOCK_HEARTBEAT_INTERVAL_MS) return

try {
const owned = await extendLock(
getChatStreamLockKey(chatId),
streamId,
CHAT_STREAM_LOCK_TTL_SECONDS
)
lastHeartbeatAt = Date.now()
if (!owned) {
heartbeatOwnershipLost = true
logger.warn('Lost ownership of chat stream lock — stopping heartbeat', {
chatId,
streamId,
...(requestId ? { requestId } : {}),
})
}
} catch (error) {
logger.warn('Failed to extend chat stream lock TTL', {
chatId,
streamId,
...(requestId ? { requestId } : {}),
error: toError(error).message,
})
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Concurrent heartbeat calls possible when extendLock is slow

The heartbeat section executes after pollingStreams.delete(streamId) in the finally block, so by the time extendLock is awaited the next interval tick (1 s) can enter, pass the pollingStreams guard, finish its abort poll, delete again, and reach the time check while lastHeartbeatAt still hasn't been updated by the in-flight call. If Redis latency exceeds the 1 s poll interval — e.g. during the 5 s commandTimeout — several concurrent EXPIRE calls can pile up.

EXPIRE is idempotent so correctness is unaffected, but under degraded-Redis conditions this amplifies the load exactly when Redis is already struggling. Updating lastHeartbeatAt optimistically before the await would eliminate the duplicate calls:

lastHeartbeatAt = Date.now()  // prevent concurrent calls even if extendLock is slow
try {
  const owned = await extendLock(
    getChatStreamLockKey(chatId),
    streamId,
    CHAT_STREAM_LOCK_TTL_SECONDS
  )
  if (!owned) {
    heartbeatOwnershipLost = true
    logger.warn('Lost ownership of chat stream lock — stopping heartbeat', { ... })
  }
} catch (error) {
  lastHeartbeatAt = 0  // reset so the next tick retries immediately
  logger.warn('Failed to extend chat stream lock TTL', { ... })
}

This preserves the fast-retry-on-error behaviour the test checks for while capping concurrent calls to one.

@TheodoreSpeaks TheodoreSpeaks merged commit 04f1d01 into staging Apr 24, 2026
14 checks passed
@TheodoreSpeaks TheodoreSpeaks deleted the chat-stream-lock-heartbeat branch April 24, 2026 18:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant