fix: (shard-distributor) use one transaction for assignShardsInCurren… by eleonoradgr · Pull Request #7687 · cadence-workflow/cadence

eleonoradgr · 2026-02-09T13:50:15Z

…tHeartbeat

What changed?
When the assignment is from the heartbeat, instead of deleting the info stored in the executor key and repopulating them, just update theassignment.

Why?
When the change was introduced it was only the executor that was calling get shard-owner if the shard was not assigned to the current executor, so the following scenatio was not possible.
Now also spectators are calling GetShardOwner and the following happens:

Executor client heartbeat and executor handler delete the info stored in the etcd relted to the executor
Spectator client calls GetShardOwner and has a shardNotFoundInTheMap
AssignEphemeralShard prepares to assign the shard to the executor with minimal amount of shards
Executor handler store the assignment received in the heartbeat
In the AssignShard at the memory level we check again if the shard is already assigned and this is now the case
the call to getshardOwner fails
To avoid this we can simply update the executor info instead of deleting the key.

How did you test it?
added unit tests and:
go test github.com/uber/cadence/service/sharddistributor/handler

Potential risks
shard manager is in development phase, no production integration is recomanded.

Release notes

Documentation Changes

Reviewer Validation

PR Description Quality (check these before reviewing code):

…tHeartbeat Signed-off-by: edigregorio <edigregorio@uber.com>

gitar-bot · 2026-02-11T08:46:22Z

Running post-merge workflows

🔍 CI failure analysis for ac414a3: Integration test failed during shutdown phase due to SQLite database locking and context deadline exceeded errors, completely unrelated to the shard distributor code changes in this PR.

Issue

The CI failure in "Golang integration test with sqlite" is caused by test infrastructure issues during shutdown, specifically context deadline exceeded errors and SQLite database locking issues.

Root Cause

The integration test run failed with multiple error patterns during test shutdown:

Context Deadline Exceeded: Multiple async task dispatches timed out with context deadline exceeded errors in the matching engine
SQLite Database Locking: Multiple LeaseTaskList operation failed. Error: sqlite3: database is locked errors
Queue Processor Shutdown Errors: Transfer and timer queue processors failed to complete tasks during shutdown with queue processor has been shutdown errors
Replication Task Failures: Failed to get replication tasks with connection timeout errors

The final test result was:

FAIL
FAIL github.com/uber/cadence/host 620.275s
FAIL
make: *** [Makefile:707: cover_integration_profile] Error 1

Details

Why this is unrelated to PR changes:

This PR only modifies service/sharddistributor/handler/executor.go and its test file
The changes are to the shard distributor heartbeat handler logic (removing a delete-then-assign pattern in favor of atomic updates)
The failures occur during integration test shutdown phase, not in any shard distributor code paths
Shard distributor is an experimental feature not used by the integration tests
The errors are all related to:
- Matching engine task dispatch timeouts
- SQLite connection pool exhaustion/locking during concurrent shutdown
- Queue processor shutdown timing issues
- Replication task fetching failures (standby cluster connectivity)

Test Infrastructure Problem:

The integration tests use SQLite as the persistence layer
During test shutdown, multiple components shut down concurrently
SQLite has limited concurrent write capabilities and can experience locking under high contention
The "database is locked" errors indicate resource exhaustion during concurrent shutdown operations
Context deadline exceeded errors suggest the tests are hitting timeout limits during cleanup
These are classic symptoms of test infrastructure timing/resource issues, not application logic bugs

Evidence this is pre-existing/flaky:

The failure pattern (timeouts and database locks during shutdown) is characteristic of timing-sensitive test infrastructure issues
The actual test assertions all passed - the failure occurred during shutdown/cleanup
No actual workflow execution errors or logic failures in the logs
Shard distributor handler code has comprehensive unit tests that would catch logic bugs

Code Review 👍 Approved with suggestions 0 resolved / 1 findings

Solid fix replacing the two-step Delete+Assign with a single atomic update using ModRevision for optimistic concurrency. The race condition described in the PR is properly addressed. One previous minor finding about test coverage of ModRevision propagation remains unresolved.

💡 Edge Case: Test doesn't verify ModRevision propagation to AssignShards

📄 service/sharddistributor/handler/executor_test.go:224

The existing success-path test (Test Case 7) creates an assignedState with default ModRevision: 0 and only asserts on AssignedShards content (line 241). It doesn't verify that ModRevision from the existing state is correctly propagated into the AssignShardsRequest.

Since the core improvement of this PR is preserving ModRevision for optimistic concurrency control (to make the update atomic via etcd's compare-and-swap), it would strengthen confidence to have a test that:

Sets a non-zero ModRevision on the assignedState returned by GetHeartbeat (e.g., ModRevision: 42)
Asserts in the AssignShards mock that request.NewState.ShardAssignments[executorID].ModRevision == 42

This would ensure a future refactor doesn't accidentally lose the ModRevision propagation, which would silently break the optimistic concurrency guarantees.

Rules ❌ No requirements met

Repository Rules

❌ PR Description Quality Standards: Fix typos in What changed (theassignment → the assignment), Why (scenatio → scenario, relted → related), and Potential risks (recomanded → recommended). Add missing Project Issue link unless this is a minor bugfix.

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

`Auto-apply`	`Compact`
`gitar auto-apply:on`	`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

jakobht approved these changes Feb 10, 2026

View reviewed changes

fix: (shard-distributor) use one transaction for assignShardsInCurren…

ac414a3

…tHeartbeat Signed-off-by: edigregorio <edigregorio@uber.com>

eleonoradgr force-pushed the assignment-from-heartbeat branch from 0050b1d to ac414a3 Compare February 11, 2026 08:41

eleonoradgr merged commit e517396 into cadence-workflow:master Feb 11, 2026
58 of 59 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: (shard-distributor) use one transaction for assignShardsInCurren…#7687

fix: (shard-distributor) use one transaction for assignShardsInCurren…#7687
eleonoradgr merged 1 commit intocadence-workflow:masterfrom
eleonoradgr:assignment-from-heartbeat

eleonoradgr commented Feb 9, 2026 •

edited

Loading

Uh oh!

gitar-bot bot commented Feb 11, 2026 •

edited

Loading

Issue

Root Cause

Details

Repository Rules

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

eleonoradgr commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer Validation

Uh oh!

gitar-bot bot commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue

Root Cause

Details

Repository Rules

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

eleonoradgr commented Feb 9, 2026 •

edited

Loading

gitar-bot bot commented Feb 11, 2026 •

edited

Loading