fix(mothership): Use heartbeat mechanism for chat locks#4286
fix(mothership): Use heartbeat mechanism for chat locks#4286TheodoreSpeaks merged 1 commit intostagingfrom
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
|
@BugBot review |
PR SummaryMedium Risk Overview Adds Reviewed by Cursor Bugbot for commit b3a6985. Bugbot is set up for automated code reviews on this repo. Configure here. |
There was a problem hiding this comment.
✅ Bugbot reviewed your changes and found no new issues!
Comment @cursor review or bugbot run to trigger another review on this PR
Reviewed by Cursor Bugbot for commit b3a6985. Configure here.
Greptile SummaryReplaces the 2-hour static TTL on the per-chat stream lock with a 60-second TTL + a 20-second heartbeat ( Confidence Score: 5/5Safe to merge — the fix is well-scoped, atomic, and the one remaining finding is a minor idempotency edge case with no correctness impact. All remaining findings are P2. The core Lua script is atomic (ownership check + EXPIRE in one round trip), error handling is correct, backward compat is preserved via the optional chatId, and tests cover all branches including error retry and ownership loss. apps/sim/lib/copilot/request/session/abort.ts — heartbeat section can emit concurrent extendLock calls under high Redis latency (idempotent, no correctness risk) Important Files Changed
Sequence DiagramsequenceDiagram
participant Client
participant SimPod as Sim Pod (startAbortPoller)
participant Redis
Client->>SimPod: Start chat stream
SimPod->>Redis: acquireLock(chat-stream-lock, streamId, TTL=60s)
Redis-->>SimPod: OK
loop Every 1s (abort poll)
SimPod->>Redis: hasAbortMarker(streamId)?
Redis-->>SimPod: false
end
loop Every 20s (heartbeat)
SimPod->>Redis: extendLock(chat-stream-lock, streamId, TTL=60s)
Redis-->>SimPod: true (TTL refreshed)
end
alt Pod dies (SIGKILL / OOM)
Note over Redis: Lock expires in 60s (was 2h)
Client->>SimPod: Next message acquireLock succeeds after 60s
else Normal stream end
SimPod->>Redis: releaseLock(chat-stream-lock, streamId)
Redis-->>SimPod: 1 (released)
Client->>SimPod: Next message acquireLock succeeds immediately
end
Reviews (1): Last reviewed commit: "fix(mothership): Use heartbeat mechanism..." | Re-trigger Greptile |
| if (!chatId || heartbeatOwnershipLost) return | ||
| if (Date.now() - lastHeartbeatAt < CHAT_STREAM_LOCK_HEARTBEAT_INTERVAL_MS) return | ||
|
|
||
| try { | ||
| const owned = await extendLock( | ||
| getChatStreamLockKey(chatId), | ||
| streamId, | ||
| CHAT_STREAM_LOCK_TTL_SECONDS | ||
| ) | ||
| lastHeartbeatAt = Date.now() | ||
| if (!owned) { | ||
| heartbeatOwnershipLost = true | ||
| logger.warn('Lost ownership of chat stream lock — stopping heartbeat', { | ||
| chatId, | ||
| streamId, | ||
| ...(requestId ? { requestId } : {}), | ||
| }) | ||
| } | ||
| } catch (error) { | ||
| logger.warn('Failed to extend chat stream lock TTL', { | ||
| chatId, | ||
| streamId, | ||
| ...(requestId ? { requestId } : {}), | ||
| error: toError(error).message, | ||
| }) | ||
| } |
There was a problem hiding this comment.
Concurrent heartbeat calls possible when
extendLock is slow
The heartbeat section executes after pollingStreams.delete(streamId) in the finally block, so by the time extendLock is awaited the next interval tick (1 s) can enter, pass the pollingStreams guard, finish its abort poll, delete again, and reach the time check while lastHeartbeatAt still hasn't been updated by the in-flight call. If Redis latency exceeds the 1 s poll interval — e.g. during the 5 s commandTimeout — several concurrent EXPIRE calls can pile up.
EXPIRE is idempotent so correctness is unaffected, but under degraded-Redis conditions this amplifies the load exactly when Redis is already struggling. Updating lastHeartbeatAt optimistically before the await would eliminate the duplicate calls:
lastHeartbeatAt = Date.now() // prevent concurrent calls even if extendLock is slow
try {
const owned = await extendLock(
getChatStreamLockKey(chatId),
streamId,
CHAT_STREAM_LOCK_TTL_SECONDS
)
if (!owned) {
heartbeatOwnershipLost = true
logger.warn('Lost ownership of chat stream lock — stopping heartbeat', { ... })
}
} catch (error) {
lastHeartbeatAt = 0 // reset so the next tick retries immediately
logger.warn('Failed to extend chat stream lock TTL', { ... })
}This preserves the fast-retry-on-error behaviour the test checks for while capping concurrent calls to one.
Summary
We ocassionally get mothership tasks that stall on every message due to an orphaned mothership task lock. Add heartbeat mechanism to refresh expiry every minute rather than the current 2 hours after expiry. This will mean the worst case for a orphaned task is that messages can't be sent for a single minute rather than 2 hours.
Type of Change
Testing
Checklist
Screenshots/Videos