Implements the updated Retry Behavior behind a feature flag#4391
Implements the updated Retry Behavior behind a feature flag#4391muhammad-othman wants to merge 2 commits intodevelopmentfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Implements SEP Retry Behavior 2.1 updates in the AWS .NET SDK behind the AWS_NEW_RETRIES_2026 feature flag, including revised retry-quota accounting, updated backoff timing, long-polling backoff-on-quota-exhaustion, service-specific behavior (DynamoDB), and support for the x-amz-retry-after header.
Changes:
- Adds feature-flagged SEP 2.1 retry behavior (capacity costs, backoff formula, retry-after clamping, long-polling special-case).
- Adjusts DynamoDB/DynamoDB Streams default retry count handling and generator metadata.
- Adds a new unit test suite covering SEP 2.1 standard mode scenarios and updates existing capacity/quota tests for the new CapacityManager constructor.
Reviewed changes
Copilot reviewed 15 out of 17 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| sdk/test/UnitTests/Custom/Runtime/RetryHandlerStandardModeTests.cs | Updates CapacityManager construction in standard-mode retry tests. |
| sdk/test/UnitTests/Custom/Runtime/RetryHandlerStandardMode21Tests.cs | Adds new SEP 2.1 test suite and test-only retry policy/jitter control. |
| sdk/test/UnitTests/Custom/Runtime/RetryHandlerAdaptiveModeTests.cs | Updates CapacityManager construction in adaptive-mode retry tests. |
| sdk/test/UnitTests/Custom/CapacityManagerTests.cs | Updates CapacityManager construction to match new constructor semantics. |
| sdk/src/Services/DynamoDBv2/Generated/AmazonDynamoDBConfig.cs | Removes generated MaxErrorRetry default for DynamoDB config. |
| sdk/src/Services/DynamoDBStreams/Generated/AmazonDynamoDBStreamsConfig.cs | Removes generated MaxErrorRetry default for DynamoDB Streams config. |
| sdk/src/Core/Amazon.Runtime/Pipeline/RetryHandler/_async/StandardRetryPolicy.cs | Routes async wait logic through new executionContext-aware delay calculation. |
| sdk/src/Core/Amazon.Runtime/Pipeline/RetryHandler/_async/RetryPolicy.cs | Adds feature-flagged capacity typing + retry-after header capture in async retry path. |
| sdk/src/Core/Amazon.Runtime/Pipeline/RetryHandler/_async/AdaptiveRetryPolicy.cs | Routes async adaptive wait logic through new executionContext-aware delay calculation. |
| sdk/src/Core/Amazon.Runtime/Pipeline/RetryHandler/StandardRetryPolicy.cs | Implements SEP 2.1 delay computation, DynamoDB max-retry override, and new CapacityManager defaults behind the flag. |
| sdk/src/Core/Amazon.Runtime/Pipeline/RetryHandler/RetryPolicy.cs | Adds feature flag and x-amz-retry-after extraction + context storage. |
| sdk/src/Core/Amazon.Runtime/Pipeline/RetryHandler/RetryHandler.cs | Adds long-polling backoff behavior and adjusts error logging attempt value. |
| sdk/src/Core/Amazon.Runtime/Pipeline/RetryHandler/DefaultRetryPolicy.cs | Updates legacy policy CapacityManager instantiation to new constructor. |
| sdk/src/Core/Amazon.Runtime/CapacityManager.cs | Adds throttling capacity type and replaces constructor parameters/semantics. |
| generator/ServiceModels/streams.dynamodb/metadata.json | Removes max-retries generator metadata for DynamoDB Streams. |
| generator/ServiceModels/dynamodb/metadata.json | Removes max-retries generator metadata for DynamoDB. |
| generator/.DevConfigs/94c889f3-af3a-4d8f-940a-0205846afe42.json | Adds dev config entry for the change. |
| public CapacityManager(int initialRetryTokens, int retryCost, int noRetryIncrement, int timeoutRetryCost, int throttlingRetryCost) | ||
| { | ||
| retryCost = throttleRetryCost; | ||
| initialRetryTokens = throttleRetryCount; | ||
| noRetryIncrement = throttleCost; | ||
| this.retryCost = retryCost; | ||
| this.initialRetryTokens = initialRetryTokens; | ||
| this.noRetryIncrement = noRetryIncrement; |
There was a problem hiding this comment.
Details can be found in the breaking change assessment
There was a problem hiding this comment.
I get the reasoning but this class is not in the Internal namespace... I don't think customers would be using these constructors but perhaps the Obsolete makes sense here.
There was a problem hiding this comment.
But if customers aren't using it, why do we need to keep an additional obsolete constructor when we can remove it?
There was a problem hiding this comment.
I didn't say customers aren't using it, just that it's unlikely.
We've definitely seen other libraries take dependencies on components of the SDK that were supposed to be internal, and breaking them is not a good customer experience at all.
| if (UseNewRetries2026) | ||
| { | ||
| executionContext.RequestContext.LastCapacityType = IsThrottlingError(exception) ? | ||
| CapacityManager.CapacityType.Throttling : CapacityManager.CapacityType.Retry; | ||
| StoreRetryAfterHeader(executionContext, exception); | ||
| } | ||
| else | ||
| { | ||
| executionContext.RequestContext.LastCapacityType = IsServiceTimeoutError(exception) ? | ||
| CapacityManager.CapacityType.Timeout : CapacityManager.CapacityType.Retry; | ||
| } | ||
| return OnRetry(executionContext, isClockSkewError, IsThrottlingError(exception)); |
There was a problem hiding this comment.
This is by design for SEP 2.1, the spec only has RETRY_COST and THROTTLING_RETRY_COST, no timeout category, will keep open for others to review.
| if (UseNewRetries2026) | ||
| { | ||
| executionContext.RequestContext.LastCapacityType = IsThrottlingError(exception) ? | ||
| CapacityManager.CapacityType.Throttling : CapacityManager.CapacityType.Retry; | ||
| StoreRetryAfterHeader(executionContext, exception); |
There was a problem hiding this comment.
The missing else block in the async RetryAsync() is a pre-existing bug, not part of SEP 2.1. The sync Retry() method already had the LastCapacityType = IsServiceTimeoutError ? Timeout : Retry assignment, but the async path was missing it entirely. As This is a pre-existing gap I think it should be addressed separately to keep this PR without any changes for customers that don't enable UseNewRetries2026, will keep this comment open for others to review.
There was a problem hiding this comment.
Agree it makes sense to fix this on a separate PR.
4cfb2b7 to
311f8b8
Compare
| Backward compatibility review required - approval needed from one of the reviewers: normj, boblodgett |
311f8b8 to
d177b4d
Compare
| { | ||
| "core": { | ||
| "changeLogMessages": [ | ||
| "Implements the updated Retry Behavior behind a feature flag (AWS_NEW_RETRIES_2026)." |
There was a problem hiding this comment.
I know it's probably not available yet, but would be nice to link to the SDK guide (https://docs.aws.amazon.com/sdkref/latest/guide/feature-retry-behavior.html). Looking at this entry how would I know what the "updated retry behavior" is?
There was a problem hiding this comment.
I don't think the changes will be available if we are planning to release it this week or the week after that.
| public CapacityManager(int initialRetryTokens, int retryCost, int noRetryIncrement, int timeoutRetryCost, int throttlingRetryCost) | ||
| { | ||
| retryCost = throttleRetryCost; | ||
| initialRetryTokens = throttleRetryCount; | ||
| noRetryIncrement = throttleCost; | ||
| this.retryCost = retryCost; | ||
| this.initialRetryTokens = initialRetryTokens; | ||
| this.noRetryIncrement = noRetryIncrement; |
There was a problem hiding this comment.
I get the reasoning but this class is not in the Internal namespace... I don't think customers would be using these constructors but perhaps the Obsolete makes sense here.
| if (UseNewRetries2026) | ||
| { | ||
| executionContext.RequestContext.LastCapacityType = IsThrottlingError(exception) ? | ||
| CapacityManager.CapacityType.Throttling : CapacityManager.CapacityType.Retry; | ||
| StoreRetryAfterHeader(executionContext, exception); |
There was a problem hiding this comment.
Agree it makes sense to fix this on a separate PR.
| /// Defaults to false. This flag will be removed at end of 2026 when the new | ||
| /// behavior becomes the default. | ||
| /// </summary> | ||
| internal static bool UseNewRetries2026 { get; set; } = |
There was a problem hiding this comment.
I don't see it in this PR (or the SEP...), but would it make sense to update the user agent to include whether the new retry option was set? (Otherwise how would we validate customers are using it?)
There was a problem hiding this comment.
This needs to be added to the feature ids list first, maybe they were planning to do so when removing the feature flag, but let me check and get back to you.
| { | ||
| // SEP 2.1: Long-polling operations must always back off when the error is | ||
| // retryable, even if retry quota is exhausted. | ||
| if (IsLongPollingOperation(executionContext) |
There was a problem hiding this comment.
Kiro found this which seems worth checking:
Long-polling backoff incorrectly triggers on max-attempts-exceeded path (RetryHandler.cs, both sync and async)
The long-polling guard checks !shouldRetry && IsLongPollingOperation && IsLastExceptionRetryable, but shouldRetry is false for both quota-exhausted and max-attempts-exceeded cases. The spec pseudocode only sleeps on the quota-exhausted path:
if not HasRetryQuota(response)
if IsLongPollingOperation():
sleep(ExponentialBackoff(attempts - 1))
return response
When max attempts is reached, the spec returns immediately with no sleep. But the current code adds an unnecessary backoff delay before throwing. The fix needs to distinguish why shouldRetry is false — quota exhaustion vs. max attempts vs. non-retryable error.
There was a problem hiding this comment.
Didn't consider this one, will need to fix it.
There was a problem hiding this comment.
Updated to account for max-attempts-exceeded when IsLongPollingOperation is true.
bef6550 to
c80f03d
Compare
c80f03d to
1d4c292
Compare
| Assert.AreEqual(1, Tester.CallCount); | ||
|
|
||
| // Should still back off even with empty quota — uses 1000ms throttling base | ||
| retryPolicy.AssertDelaysMatch(new int[] { 1000 }); |
There was a problem hiding this comment.
The test in the SEP for this scenario (Long-Polling Backoff After Throttling Error When Token Bucket Empty) has delay: 0.05 but you have 1000 here. Is that intentional?
There was a problem hiding this comment.
Reached out to update the SEP, 1000ms should be the correct delay.
Description
This PR implements the Retry Behavior 2.1 SEP specification updates for the AWS SDK for .NET, gated behind a feature flag (
AWS_NEW_RETRIES_2026environment variable). When enabled, the SDK uses updated retry parameters including revised backoff timing, updated retry quota costs, service-specific configurations, long-polling backoff behavior, andx-amz-retry-afterheader support.All changes are backward-compatible, the feature flag defaults to
false, preserving existing behavior. When the flag is removed at end of 2026, the new behavior will become the default.Motivation and Context
DOTNET-8570Testing
RetryHandlerStandardMode21Tests.cswith 18 test cases covering all SEP 2.1 standard mode scenarios.Dry-runs
Breaking Changes Assessment
1. Identify all breaking changes
This PR modifies several APIs in the
Amazon.Runtime.Internalnamespace. While technically public (due to C# accessibility), these are internal SDK implementation details not intended for external consumption.Change 1:
CapacityManager— Old constructors removedCapacityManager(throttleRetryCount, throttleRetryCost, throttleCost)and 4-parameterCapacityManager(..., timeoutRetryCost)constructors were replaced with a single 5-parameter constructorCapacityManager(initialRetryTokens, retryCost, noRetryIncrement, timeoutRetryCost, throttlingRetryCost).CapacityManageris in theAmazon.Runtime.Internalnamespace. No service DLL or external code constructsCapacityManagerdirectly, it is only instantiated byStandardRetryPolicyandDefaultRetryPolicywithin the Core assembly.initialRetryTokensfromthrottleRetryCount * throttleRetryCost, which was confusing and hid the actual token count. The new constructor takes the total token count directly, matching the SEP 2.1 specification terminology.Internalnamespace convention signals these are not part of the public API contract.mainbranch, then ran them with the new Core DLL, real AWS calls (S3 ListBuckets, STS GetCallerIdentity) succeeded without anyMissingMethodExceptionorTypeLoadExceptionChange 2:
StandardRetryPolicy.CalculateRetryDelay— Signature changedCalculateRetryDelay(int retries, int maxBackoffInMilliseconds)(protected static) was replaced withCalculateRetryDelay(IExecutionContext, int maxBackoffInMilliseconds)(internal static) to support service-aware backoff and x-amz-retry-after header.CalculateRetryDelay(int)with a different signature.IExecutionContextto determine the service type (DynamoDB vs others), error type (throttling vs transient), and x-amz-retry-after header value.Change 3:
StandardRetryPolicy.WaitBeforeRetry(int, int)— RemovedWaitBeforeRetry(int retries, int maxBackoffInMilliseconds)was removed. The virtualWaitBeforeRetry(IExecutionContext)override remains unchanged.WaitBeforeRetry(IExecutionContext)method, not the static helper.MissingMethodException.Change 4: DynamoDB/DynamoDB Streams
MaxErrorRetrydefaultMaxErrorRetry = 10assignment was removed from the generatedAmazonDynamoDBConfig/AmazonDynamoDBStreamsConfigconstructors. The retry policy now sets the default internally: 3 retries whenUseNewRetries2026is enabled, 10 retries when disabled.UseNewRetries2026is disabled (default), DynamoDB still defaults to 10 retries, no behavioral change. When enabled, it uses 3 retries per SEP 2.1.DynamoDBBaseBackoffAndIncreasedRetriesverifies 4 max attempts (3 retries) under the new flag.Screenshots (if appropriate)
Types of changes
Checklist
License