Skip to content

Hotfix: Bumps version to 3.56.1#5690

Draft
NaluTripician wants to merge 2 commits intoreleases/3.56.1from
hotfix/3.56.1
Draft

Hotfix: Bumps version to 3.56.1#5690
NaluTripician wants to merge 2 commits intoreleases/3.56.1from
hotfix/3.56.1

Conversation

@NaluTripician
Copy link
Contributor

Hotfix 3.56.1

Cherry-picked PRs

  • #5613 CrossRegionHedgingAvailabilityStrategy: Fixes ArgumentNullException race condition in hedging cancellation

Version Changes

  • ClientOfficialVersion: 3.56.0 → 3.56.1
  • ClientPreviewVersion: 3.57.0 (unchanged)
  • ClientPreviewSuffixVersion: preview.0 → preview.1

Changelog

\\markdown

3.57.0-preview.1 - 2026-3-10

Fixed

  • [5613] CrossRegionHedgingAvailabilityStrategy: Fixes ArgumentNullException race condition in hedging cancellation

3.56.1 - 2026-3-10

Fixed

  • [5613] CrossRegionHedgingAvailabilityStrategy: Fixes ArgumentNullException race condition in hedging cancellation
    \\

API Contract Diff (GA)

No API surface changes — bug fix only. Contract file \API_3.56.1.txt\ is identical to \API_3.56.0.txt.

API Contract Diff (Preview)

No API surface changes — bug fix only. Contract file \API_3.57.0-preview.1.txt\ is identical to \API_3.57.0-preview.0.txt.

Checklist

  • Cherry-picks verified
  • Contract file added to hotfix branch
  • Contract file synced to master (separate PR)
  • Kiran sign-off obtained

FabianMeiswinkel and others added 2 commits March 10, 2026 13:16
… race condition in hedging cancellation (#5613)

## CrossRegionHedgingAvailabilityStrategy: Fixes `ArgumentNullException`
race condition in hedging cancellation

### Bug

Multiple production customers reported unobserved
`ArgumentNullException: Value cannot be null. (Parameter 'request')`
crashes originating from
`CrossRegionHedgingAvailabilityStrategy.RequestSenderAndResultCheckAsync`.
The exception was surfaced as a `TaskScheduler_UnobservedTaskException`,
crashing the process. The affected code paths were:

- `ContainerCore.ReadItemAsync` → `ReadItemStreamAsync` →
`ProcessItemStreamAsync` → `RequestInvokerHandler.SendAsync` →
`CrossRegionHedgingAvailabilityStrategy.ExecuteAvailabilityStrategyAsync`

### Root Cause

A **race condition** caused by passing the wrong `CancellationToken` to
the sender delegate:

1. `ExecuteAvailabilityStrategyAsync` creates a
`hedgeRequestsCancellationTokenSource` (linked to the app-provided CT)
to coordinate hedge request lifecycle.
2. `CloneAndSendAsync` clones the request inside a `using` block and
calls `RequestSenderAndResultCheckAsync`.
3. **Bug:** `RequestSenderAndResultCheckAsync` called
`sender.Invoke(request, cancellationToken)` with the
**application-provided `CancellationToken`** — not
`hedgeRequestsCancellationTokenSource.Token`.
4. When hedge Region B returned a final result (e.g., 200 OK),
`hedgeRequestsCancellationTokenSource.Cancel()` was called.
5. **But the in-flight sender for Region A still held the app CT**
(e.g., `CancellationToken.None`), which was **never cancelled**.
6. The `CloneAndSendAsync` `using` block exited, **disposing the cloned
request**.
7. The Region A sender continued executing with a reference to the
now-disposed request → **`ArgumentNullException: Value cannot be null.
(Parameter 'request')`**.

A secondary issue: when the application CT was cancelled (e2e timeout),
the hedge timer (linked to app CT) would fire, and the old code would
blindly continue the loop attempting to clone and send new requests on a
cancelled path.

### Fix

Two changes in `CrossRegionHedgingAvailabilityStrategy.cs`:

**1. Pass `hedgeRequestsCancellationTokenSource.Token` to
`sender.Invoke()` instead of the app CT**

This ensures that when **any** hedge gets a final result and calls
`hedgeRequestsCancellationTokenSource.Cancel()`, **all** in-flight
senders immediately see their CT cancelled and stop before the cloned
request is disposed. The `CancellationTokenSource` and
`CancellationToken` parameters were also consolidated into a single
`hedgeRequestsCancellationTokenSource` parameter passed through
`CloneAndSendAsync` → `RequestSenderAndResultCheckAsync`.

**2. Add `do/while` loop to handle spurious timer completions on app CT
cancellation**

When the app CT is cancelled (e2e timeout), the hedge timer fires via
the linked CTS. The old code would `continue` the loop and try to clone
a new request. The `do/while` loop now detects
`applicationProvidedCancellationToken.IsCancellationRequested` and falls
through to consolidate existing request outcomes instead of spawning new
hedges.

### Tests Added (8 new unit tests)

| Test | Validates |
|---|---|
| `HedgeCancellationCancelsInFlightRequests_NoNullRef` | Slow primary
request's CT is cancelled when a hedge returns a final result — core
regression test |
| `SenderReceivesHedgeCancellationToken_NotAppToken` | Captures the
actual CT passed to each sender and asserts all are from the hedge CTS,
not the app CT |
| `AppCancellationDuringHedging_DoesNotSpawnNewHedgeRequests` | E2e
timeout (app CT cancelled) does not spawn new hedge requests — validates
the do/while loop fix |
| `MultiRegionHedging_RequestNotAccessedAfterDisposal` | Verifies the
cloned request is still accessible when cancellation fires — exact
scenario from the crash reports |
| `HedgeCancellation_StreamRequest_NoNullRef` | Tests the stream-based
code path (ReadItemStreamAsync) from the NullRef2/NullRef3 stack traces
|
| `PrimaryRequestFinalResult_NoAdditionalHedgesSent` | Fast primary
response skips hedging entirely |
| `AllHedgesTransientError_ReturnsLastResponse` | All regions return
transient errors — strategy returns last response without NullRef |
| `ConcurrentHedgingRequests_NoNullRef` | Stress test: 50 concurrent
hedging requests with random delays — no NullRef under concurrency |

### Type of change

- [x] Bug fix (non-breaking change which fixes an issue)

---------

Co-authored-by: Nalu Tripician <27316859+NaluTripician@users.noreply.github.com>
- Updates ClientOfficialVersion to 3.56.1
- Updates ClientPreviewSuffixVersion to preview.1
- Adds changelog entries for 3.56.1 and 3.57.0-preview.1
- Adds API contract files for 3.56.1 and 3.57.0-preview.1

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please follow the required format: "[Internal] Category: (Adds|Fixes|Refactors|Removes) Description"

Internal should be used for PRs that have no customer impact. This flag is used to help generate the changelog to know which PRs should be included. Examples:
Diagnostics: Adds GetElapsedClientLatency to CosmosDiagnostics
PartitionKey: Fixes null reference when using default(PartitionKey)
[v4] Client Encryption: Refactors code to external project
[Internal] Query: Adds code generator for CosmosNumbers for easy additions in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants