Hotfix: Bumps version to 3.56.1#5690
Draft
NaluTripician wants to merge 2 commits intoreleases/3.56.1from
Draft
Conversation
… race condition in hedging cancellation (#5613) ## CrossRegionHedgingAvailabilityStrategy: Fixes `ArgumentNullException` race condition in hedging cancellation ### Bug Multiple production customers reported unobserved `ArgumentNullException: Value cannot be null. (Parameter 'request')` crashes originating from `CrossRegionHedgingAvailabilityStrategy.RequestSenderAndResultCheckAsync`. The exception was surfaced as a `TaskScheduler_UnobservedTaskException`, crashing the process. The affected code paths were: - `ContainerCore.ReadItemAsync` → `ReadItemStreamAsync` → `ProcessItemStreamAsync` → `RequestInvokerHandler.SendAsync` → `CrossRegionHedgingAvailabilityStrategy.ExecuteAvailabilityStrategyAsync` ### Root Cause A **race condition** caused by passing the wrong `CancellationToken` to the sender delegate: 1. `ExecuteAvailabilityStrategyAsync` creates a `hedgeRequestsCancellationTokenSource` (linked to the app-provided CT) to coordinate hedge request lifecycle. 2. `CloneAndSendAsync` clones the request inside a `using` block and calls `RequestSenderAndResultCheckAsync`. 3. **Bug:** `RequestSenderAndResultCheckAsync` called `sender.Invoke(request, cancellationToken)` with the **application-provided `CancellationToken`** — not `hedgeRequestsCancellationTokenSource.Token`. 4. When hedge Region B returned a final result (e.g., 200 OK), `hedgeRequestsCancellationTokenSource.Cancel()` was called. 5. **But the in-flight sender for Region A still held the app CT** (e.g., `CancellationToken.None`), which was **never cancelled**. 6. The `CloneAndSendAsync` `using` block exited, **disposing the cloned request**. 7. The Region A sender continued executing with a reference to the now-disposed request → **`ArgumentNullException: Value cannot be null. (Parameter 'request')`**. A secondary issue: when the application CT was cancelled (e2e timeout), the hedge timer (linked to app CT) would fire, and the old code would blindly continue the loop attempting to clone and send new requests on a cancelled path. ### Fix Two changes in `CrossRegionHedgingAvailabilityStrategy.cs`: **1. Pass `hedgeRequestsCancellationTokenSource.Token` to `sender.Invoke()` instead of the app CT** This ensures that when **any** hedge gets a final result and calls `hedgeRequestsCancellationTokenSource.Cancel()`, **all** in-flight senders immediately see their CT cancelled and stop before the cloned request is disposed. The `CancellationTokenSource` and `CancellationToken` parameters were also consolidated into a single `hedgeRequestsCancellationTokenSource` parameter passed through `CloneAndSendAsync` → `RequestSenderAndResultCheckAsync`. **2. Add `do/while` loop to handle spurious timer completions on app CT cancellation** When the app CT is cancelled (e2e timeout), the hedge timer fires via the linked CTS. The old code would `continue` the loop and try to clone a new request. The `do/while` loop now detects `applicationProvidedCancellationToken.IsCancellationRequested` and falls through to consolidate existing request outcomes instead of spawning new hedges. ### Tests Added (8 new unit tests) | Test | Validates | |---|---| | `HedgeCancellationCancelsInFlightRequests_NoNullRef` | Slow primary request's CT is cancelled when a hedge returns a final result — core regression test | | `SenderReceivesHedgeCancellationToken_NotAppToken` | Captures the actual CT passed to each sender and asserts all are from the hedge CTS, not the app CT | | `AppCancellationDuringHedging_DoesNotSpawnNewHedgeRequests` | E2e timeout (app CT cancelled) does not spawn new hedge requests — validates the do/while loop fix | | `MultiRegionHedging_RequestNotAccessedAfterDisposal` | Verifies the cloned request is still accessible when cancellation fires — exact scenario from the crash reports | | `HedgeCancellation_StreamRequest_NoNullRef` | Tests the stream-based code path (ReadItemStreamAsync) from the NullRef2/NullRef3 stack traces | | `PrimaryRequestFinalResult_NoAdditionalHedgesSent` | Fast primary response skips hedging entirely | | `AllHedgesTransientError_ReturnsLastResponse` | All regions return transient errors — strategy returns last response without NullRef | | `ConcurrentHedgingRequests_NoNullRef` | Stress test: 50 concurrent hedging requests with random delays — no NullRef under concurrency | ### Type of change - [x] Bug fix (non-breaking change which fixes an issue) --------- Co-authored-by: Nalu Tripician <27316859+NaluTripician@users.noreply.github.com>
- Updates ClientOfficialVersion to 3.56.1 - Updates ClientPreviewSuffixVersion to preview.1 - Adds changelog entries for 3.56.1 and 3.57.0-preview.1 - Adds API contract files for 3.56.1 and 3.57.0-preview.1 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Please follow the required format: "[Internal] Category: (Adds|Fixes|Refactors|Removes) Description"
Internal should be used for PRs that have no customer impact. This flag is used to help generate the changelog to know which PRs should be included. Examples:
Diagnostics: Adds GetElapsedClientLatency to CosmosDiagnostics
PartitionKey: Fixes null reference when using default(PartitionKey)
[v4] Client Encryption: Refactors code to external project
[Internal] Query: Adds code generator for CosmosNumbers for easy additions in the future.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hotfix 3.56.1
Cherry-picked PRs
ArgumentNullExceptionrace condition in hedging cancellationVersion Changes
Changelog
\\markdown
3.57.0-preview.1 - 2026-3-10
Fixed
3.56.1 - 2026-3-10
Fixed
\\
API Contract Diff (GA)
No API surface changes — bug fix only. Contract file \API_3.56.1.txt\ is identical to \API_3.56.0.txt.
API Contract Diff (Preview)
No API surface changes — bug fix only. Contract file \API_3.57.0-preview.1.txt\ is identical to \API_3.57.0-preview.0.txt.
Checklist