Skip to content

fix: (executor-client)Skip local assignment if no convergence with h…#7695

Merged
eleonoradgr merged 2 commits intocadence-workflow:masterfrom
eleonoradgr:convergence-error
Feb 13, 2026
Merged

fix: (executor-client)Skip local assignment if no convergence with h…#7695
eleonoradgr merged 2 commits intocadence-workflow:masterfrom
eleonoradgr:convergence-error

Conversation

@eleonoradgr
Copy link
Contributor

…eartbeat

What changed?
Modified compareAssignments() in clientimpl.go to return errors (ErrAssignmentDivergenceLocalShard and ErrAssignmentDivergenceHeartbeatShard) when local and heartbeat assignments diverge.
Updated four test functions in clientimpl_test.go to capture and verify these error return values, and fixed TestHeartbeatLoop_DistributedPassthrough_AppliesAssignment to pre-populate local shards for convergence testing. #6862

Why?
The compareAssignments() function needed to surface assignment divergence issues as errors rather than just logging/metrics, enabling proper error handling in migration modes (LOCAL_PASSTHROUGH_SHADOW and DISTRIBUTED_PASSTHROUGH). Tests were updated to verify that the correct errors are returned in divergence scenarios and nil on convergence, ensuring the error-handling contract is enforced.

How did you test it?
Ran go test ./service/sharddistributor/client/executorclient to verify all tests pass. Specifically verified:

  • TestCompareAssignments_Converged returns nil on matching assignments
  • TestCompareAssignments_Diverged_MissingShard returns ErrAssignmentDivergenceLocalShard when local shard missing from heartbeat
  • TestCompareAssignments_Diverged_ExtraShard returns ErrAssignmentDivergenceHeartbeatShard when heartbeat has extra shard
  • TestCompareAssignments_Diverged_WrongStatus returns ErrAssignmentDivergenceLocalShard when shard has wrong status

Potential risks
SM is still in development phase, the deployed environment using sm don't show examples of shard_distributor_executor_assignment_divergence, so we don;t expect to skip any assignment, but just fixing something that colud be an issue in the future.

Release notes

Documentation Changes


Reviewer Validation

PR Description Quality (check these before reviewing code):

  • "What changed" provides a clear 1-2 line summary
    • Project Issue is linked
  • "Why" explains the full motivation with sufficient context
  • Testing is documented:
    • Unit test commands are included (with exact go test invocation)
    • Integration test setup/commands included (if integration tests were run)
    • Canary testing details included (if canary was mentioned)
  • Potential risks section is thoughtfully filled out (or legitimately N/A)
  • Release notes included if this completes a user-facing feature
  • Documentation needs are addressed (or noted if uncertain)

…eartbeat

Signed-off-by: edigregorio <edigregorio@uber.com>
@eleonoradgr eleonoradgr changed the title fix: (executor-client) Skip local assignment if no convergence with h… fix: (executor-client)Skip local assignment if no convergence with h… Feb 11, 2026
Signed-off-by: edigregorio <edigregorio@uber.com>
@gitar-bot
Copy link

gitar-bot bot commented Feb 11, 2026

Running post-merge workflows

Code Review ✅ Approved 1 resolved / 1 findings

Clean change that surfaces assignment divergence as errors for proper handling in migration modes. The previously missing warn log has been restored, and tests properly verify the new error contract.

✅ 1 resolved
Quality: Removed warn log loses shard-id context for heartbeat divergence

📄 service/sharddistributor/client/executorclient/clientimpl.go:515
The Warn log line with tag.Dynamic("shard-id", shardID) was removed for the heartbeat-shard-not-in-local divergence case, but was retained for the local-shard-not-in-heartbeat case (line 505-506). The error propagates up to heartbeatloop() where it's logged at line 229 as "failed to heartbeat and assign shards", but the sentinel error message doesn't include the specific shard ID that caused the divergence.

This creates an asymmetry in observability: local shard divergence logs the offending shard ID, but heartbeat shard divergence does not. When debugging divergence issues in production, knowing which specific shard caused the mismatch is valuable.

Consider restoring the warn log (consistent with the local shard case) or wrapping the error with the shard ID context using fmt.Errorf.

Rules ✅ All requirements met

Repository Rules

PR Description Quality Standards: All required sections are present with substantive content. What changed: clear 1-2 line summary with issue link. Why: explains previous behavior, problem, and solution approach. Testing: includes exact go test command. Potential risks: thoughtfully addresses low risk with context. Release notes and Documentation: appropriately left empty for internal fix.
Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@eleonoradgr eleonoradgr merged commit 53137fc into cadence-workflow:master Feb 13, 2026
42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants