Skip to content

fix: (shard-distributor) use one transaction for assignShardsInCurren…#7687

Merged
eleonoradgr merged 1 commit intocadence-workflow:masterfrom
eleonoradgr:assignment-from-heartbeat
Feb 11, 2026
Merged

fix: (shard-distributor) use one transaction for assignShardsInCurren…#7687
eleonoradgr merged 1 commit intocadence-workflow:masterfrom
eleonoradgr:assignment-from-heartbeat

Conversation

@eleonoradgr
Copy link
Contributor

@eleonoradgr eleonoradgr commented Feb 9, 2026

…tHeartbeat

What changed?
When the assignment is from the heartbeat, instead of deleting the info stored in the executor key and repopulating them, just update theassignment.

Why?
When the change was introduced it was only the executor that was calling get shard-owner if the shard was not assigned to the current executor, so the following scenatio was not possible.
Now also spectators are calling GetShardOwner and the following happens:

  • Executor client heartbeat and executor handler delete the info stored in the etcd relted to the executor
  • Spectator client calls GetShardOwner and has a shardNotFoundInTheMap
  • AssignEphemeralShard prepares to assign the shard to the executor with minimal amount of shards
  • Executor handler store the assignment received in the heartbeat
  • In the AssignShard at the memory level we check again if the shard is already assigned and this is now the case
  • the call to getshardOwner fails
    To avoid this we can simply update the executor info instead of deleting the key.

How did you test it?
added unit tests and:
go test github.com/uber/cadence/service/sharddistributor/handler

Potential risks
shard manager is in development phase, no production integration is recomanded.

Release notes

Documentation Changes


Reviewer Validation

PR Description Quality (check these before reviewing code):

  • "What changed" provides a clear 1-2 line summary
    • Project Issue is linked
  • "Why" explains the full motivation with sufficient context
  • Testing is documented:
    • Unit test commands are included (with exact go test invocation)
    • Integration test setup/commands included (if integration tests were run)
    • Canary testing details included (if canary was mentioned)
  • Potential risks section is thoughtfully filled out (or legitimately N/A)
  • Release notes included if this completes a user-facing feature
  • Documentation needs are addressed (or noted if uncertain)

…tHeartbeat

Signed-off-by: edigregorio <edigregorio@uber.com>
@eleonoradgr eleonoradgr force-pushed the assignment-from-heartbeat branch from 0050b1d to ac414a3 Compare February 11, 2026 08:41
@gitar-bot
Copy link

gitar-bot bot commented Feb 11, 2026

Running post-merge workflows

🔍 CI failure analysis for ac414a3: Integration test failed during shutdown phase due to SQLite database locking and context deadline exceeded errors, completely unrelated to the shard distributor code changes in this PR.

Issue

The CI failure in "Golang integration test with sqlite" is caused by test infrastructure issues during shutdown, specifically context deadline exceeded errors and SQLite database locking issues.

Root Cause

The integration test run failed with multiple error patterns during test shutdown:

  1. Context Deadline Exceeded: Multiple async task dispatches timed out with context deadline exceeded errors in the matching engine
  2. SQLite Database Locking: Multiple LeaseTaskList operation failed. Error: sqlite3: database is locked errors
  3. Queue Processor Shutdown Errors: Transfer and timer queue processors failed to complete tasks during shutdown with queue processor has been shutdown errors
  4. Replication Task Failures: Failed to get replication tasks with connection timeout errors

The final test result was:

FAIL
FAIL github.com/uber/cadence/host 620.275s
FAIL
make: *** [Makefile:707: cover_integration_profile] Error 1

Details

Why this is unrelated to PR changes:

  • This PR only modifies service/sharddistributor/handler/executor.go and its test file
  • The changes are to the shard distributor heartbeat handler logic (removing a delete-then-assign pattern in favor of atomic updates)
  • The failures occur during integration test shutdown phase, not in any shard distributor code paths
  • Shard distributor is an experimental feature not used by the integration tests
  • The errors are all related to:
    • Matching engine task dispatch timeouts
    • SQLite connection pool exhaustion/locking during concurrent shutdown
    • Queue processor shutdown timing issues
    • Replication task fetching failures (standby cluster connectivity)

Test Infrastructure Problem:

  • The integration tests use SQLite as the persistence layer
  • During test shutdown, multiple components shut down concurrently
  • SQLite has limited concurrent write capabilities and can experience locking under high contention
  • The "database is locked" errors indicate resource exhaustion during concurrent shutdown operations
  • Context deadline exceeded errors suggest the tests are hitting timeout limits during cleanup
  • These are classic symptoms of test infrastructure timing/resource issues, not application logic bugs

Evidence this is pre-existing/flaky:

  • The failure pattern (timeouts and database locks during shutdown) is characteristic of timing-sensitive test infrastructure issues
  • The actual test assertions all passed - the failure occurred during shutdown/cleanup
  • No actual workflow execution errors or logic failures in the logs
  • Shard distributor handler code has comprehensive unit tests that would catch logic bugs
Code Review 👍 Approved with suggestions 0 resolved / 1 findings

Solid fix replacing the two-step Delete+Assign with a single atomic update using ModRevision for optimistic concurrency. The race condition described in the PR is properly addressed. One previous minor finding about test coverage of ModRevision propagation remains unresolved.

💡 Edge Case: Test doesn't verify ModRevision propagation to AssignShards

📄 service/sharddistributor/handler/executor_test.go:224

The existing success-path test (Test Case 7) creates an assignedState with default ModRevision: 0 and only asserts on AssignedShards content (line 241). It doesn't verify that ModRevision from the existing state is correctly propagated into the AssignShardsRequest.

Since the core improvement of this PR is preserving ModRevision for optimistic concurrency control (to make the update atomic via etcd's compare-and-swap), it would strengthen confidence to have a test that:

  1. Sets a non-zero ModRevision on the assignedState returned by GetHeartbeat (e.g., ModRevision: 42)
  2. Asserts in the AssignShards mock that request.NewState.ShardAssignments[executorID].ModRevision == 42

This would ensure a future refactor doesn't accidentally lose the ModRevision propagation, which would silently break the optimistic concurrency guarantees.

Rules ❌ No requirements met

Repository Rules

PR Description Quality Standards: Fix typos in What changed (theassignment → the assignment), Why (scenatio → scenario, relted → related), and Potential risks (recomanded → recommended). Add missing Project Issue link unless this is a minor bugfix.
Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@eleonoradgr eleonoradgr merged commit e517396 into cadence-workflow:master Feb 11, 2026
58 of 59 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants