Skip to content

Implements the updated Retry Behavior behind a feature flag#4391

Open
muhammad-othman wants to merge 2 commits intodevelopmentfrom
muhamoth/DOTNET-8570-new-retries
Open

Implements the updated Retry Behavior behind a feature flag#4391
muhammad-othman wants to merge 2 commits intodevelopmentfrom
muhamoth/DOTNET-8570-new-retries

Conversation

@muhammad-othman
Copy link
Copy Markdown
Member

@muhammad-othman muhammad-othman commented Apr 16, 2026

Description

This PR implements the Retry Behavior 2.1 SEP specification updates for the AWS SDK for .NET, gated behind a feature flag (AWS_NEW_RETRIES_2026 environment variable). When enabled, the SDK uses updated retry parameters including revised backoff timing, updated retry quota costs, service-specific configurations, long-polling backoff behavior, and x-amz-retry-after header support.

All changes are backward-compatible, the feature flag defaults to false, preserving existing behavior. When the flag is removed at end of 2026, the new behavior will become the default.

Motivation and Context

DOTNET-8570

Testing

  • Added RetryHandlerStandardMode21Tests.cs with 18 test cases covering all SEP 2.1 standard mode scenarios.
  • All existing retry tests (standard mode, adaptive mode, capacity manager) continue to pass.

Dry-runs

  • DotNet Dry-run ID: 9bb5b6b7-b20c-4d74-a527-6ff284ed5070
    • Pending
    • Completed successfully
    • Failed
  • PowerShell Dry-run ID: d81ac223-6110-4dea-b584-fc56348066f9
    • Pending
    • Completed successfully
    • Failed

Breaking Changes Assessment

1. Identify all breaking changes

This PR modifies several APIs in the Amazon.Runtime.Internal namespace. While technically public (due to C# accessibility), these are internal SDK implementation details not intended for external consumption.

Change 1: CapacityManager — Old constructors removed

  • What functionality was changed? The 3-parameter CapacityManager(throttleRetryCount, throttleRetryCost, throttleCost) and 4-parameter CapacityManager(..., timeoutRetryCost) constructors were replaced with a single 5-parameter constructor CapacityManager(initialRetryTokens, retryCost, noRetryIncrement, timeoutRetryCost, throttlingRetryCost).
  • How will this impact customers? No impact. CapacityManager is in the Amazon.Runtime.Internal namespace. No service DLL or external code constructs CapacityManager directly, it is only instantiated by StandardRetryPolicy and DefaultRetryPolicy within the Core assembly.
  • Why does this need to be a breaking change? The old constructors computed initialRetryTokens from throttleRetryCount * throttleRetryCost, which was confusing and hid the actual token count. The new constructor takes the total token count directly, matching the SEP 2.1 specification terminology.
  • Are best practices being followed? Yes. The Internal namespace convention signals these are not part of the public API contract.
  • How have you tested this?
    • All 47 unit tests pass (18 SEP 2.1 + 13 standard + 12 adaptive + 4 capacity manager)
    • All service DLLs (EC2, S3, DynamoDB, STS) compile successfully against the new Core
    • Binary compatibility test: Built old S3/STS service DLLs from main branch, then ran them with the new Core DLL, real AWS calls (S3 ListBuckets, STS GetCallerIdentity) succeeded without any MissingMethodException or TypeLoadException

Change 2: StandardRetryPolicy.CalculateRetryDelay — Signature changed

  • What functionality was changed? CalculateRetryDelay(int retries, int maxBackoffInMilliseconds) (protected static) was replaced with CalculateRetryDelay(IExecutionContext, int maxBackoffInMilliseconds) (internal static) to support service-aware backoff and x-amz-retry-after header.
  • How will this impact customers? No impact. No service DLL or known external code calls this method. DynamoDB has its own private CalculateRetryDelay(int) with a different signature.
  • Why does this need to be a breaking change? The new formula requires IExecutionContext to determine the service type (DynamoDB vs others), error type (throttling vs transient), and x-amz-retry-after header value.
  • How have you tested this? Same as above, binary compatibility test confirmed no runtime errors.

Change 3: StandardRetryPolicy.WaitBeforeRetry(int, int) — Removed

  • What functionality was changed? The public static helper WaitBeforeRetry(int retries, int maxBackoffInMilliseconds) was removed. The virtual WaitBeforeRetry(IExecutionContext) override remains unchanged.
  • How will this impact customers? No impact. All service retry policies use the virtual WaitBeforeRetry(IExecutionContext) method, not the static helper.
  • How have you tested this? Binary compatibility test with old service DLLs confirmed no MissingMethodException.

Change 4: DynamoDB/DynamoDB Streams MaxErrorRetry default

  • What functionality was changed? The MaxErrorRetry = 10 assignment was removed from the generated AmazonDynamoDBConfig/AmazonDynamoDBStreamsConfig constructors. The retry policy now sets the default internally: 3 retries when UseNewRetries2026 is enabled, 10 retries when disabled.
  • How will this impact customers? When UseNewRetries2026 is disabled (default), DynamoDB still defaults to 10 retries, no behavioral change. When enabled, it uses 3 retries per SEP 2.1.
  • How have you tested this? Unit test DynamoDBBaseBackoffAndIncreasedRetries verifies 4 max attempts (3 retries) under the new flag.
  1. Has a senior/+ engineer been assigned to review this PR?

Screenshots (if appropriate)

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist

  • My code follows the code style of this project
  • My change requires a change to the documentation
  • I have updated the documentation accordingly
  • I have read the README document
  • I have added tests to cover my changes
  • All new and existing tests passed

License

  • I confirm that this pull request can be released under the Apache 2 license

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements SEP Retry Behavior 2.1 updates in the AWS .NET SDK behind the AWS_NEW_RETRIES_2026 feature flag, including revised retry-quota accounting, updated backoff timing, long-polling backoff-on-quota-exhaustion, service-specific behavior (DynamoDB), and support for the x-amz-retry-after header.

Changes:

  • Adds feature-flagged SEP 2.1 retry behavior (capacity costs, backoff formula, retry-after clamping, long-polling special-case).
  • Adjusts DynamoDB/DynamoDB Streams default retry count handling and generator metadata.
  • Adds a new unit test suite covering SEP 2.1 standard mode scenarios and updates existing capacity/quota tests for the new CapacityManager constructor.

Reviewed changes

Copilot reviewed 15 out of 17 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
sdk/test/UnitTests/Custom/Runtime/RetryHandlerStandardModeTests.cs Updates CapacityManager construction in standard-mode retry tests.
sdk/test/UnitTests/Custom/Runtime/RetryHandlerStandardMode21Tests.cs Adds new SEP 2.1 test suite and test-only retry policy/jitter control.
sdk/test/UnitTests/Custom/Runtime/RetryHandlerAdaptiveModeTests.cs Updates CapacityManager construction in adaptive-mode retry tests.
sdk/test/UnitTests/Custom/CapacityManagerTests.cs Updates CapacityManager construction to match new constructor semantics.
sdk/src/Services/DynamoDBv2/Generated/AmazonDynamoDBConfig.cs Removes generated MaxErrorRetry default for DynamoDB config.
sdk/src/Services/DynamoDBStreams/Generated/AmazonDynamoDBStreamsConfig.cs Removes generated MaxErrorRetry default for DynamoDB Streams config.
sdk/src/Core/Amazon.Runtime/Pipeline/RetryHandler/_async/StandardRetryPolicy.cs Routes async wait logic through new executionContext-aware delay calculation.
sdk/src/Core/Amazon.Runtime/Pipeline/RetryHandler/_async/RetryPolicy.cs Adds feature-flagged capacity typing + retry-after header capture in async retry path.
sdk/src/Core/Amazon.Runtime/Pipeline/RetryHandler/_async/AdaptiveRetryPolicy.cs Routes async adaptive wait logic through new executionContext-aware delay calculation.
sdk/src/Core/Amazon.Runtime/Pipeline/RetryHandler/StandardRetryPolicy.cs Implements SEP 2.1 delay computation, DynamoDB max-retry override, and new CapacityManager defaults behind the flag.
sdk/src/Core/Amazon.Runtime/Pipeline/RetryHandler/RetryPolicy.cs Adds feature flag and x-amz-retry-after extraction + context storage.
sdk/src/Core/Amazon.Runtime/Pipeline/RetryHandler/RetryHandler.cs Adds long-polling backoff behavior and adjusts error logging attempt value.
sdk/src/Core/Amazon.Runtime/Pipeline/RetryHandler/DefaultRetryPolicy.cs Updates legacy policy CapacityManager instantiation to new constructor.
sdk/src/Core/Amazon.Runtime/CapacityManager.cs Adds throttling capacity type and replaces constructor parameters/semantics.
generator/ServiceModels/streams.dynamodb/metadata.json Removes max-retries generator metadata for DynamoDB Streams.
generator/ServiceModels/dynamodb/metadata.json Removes max-retries generator metadata for DynamoDB.
generator/.DevConfigs/94c889f3-af3a-4d8f-940a-0205846afe42.json Adds dev config entry for the change.

Comment on lines +79 to +83
public CapacityManager(int initialRetryTokens, int retryCost, int noRetryIncrement, int timeoutRetryCost, int throttlingRetryCost)
{
retryCost = throttleRetryCost;
initialRetryTokens = throttleRetryCount;
noRetryIncrement = throttleCost;
this.retryCost = retryCost;
this.initialRetryTokens = initialRetryTokens;
this.noRetryIncrement = noRetryIncrement;
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Details can be found in the breaking change assessment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the reasoning but this class is not in the Internal namespace... I don't think customers would be using these constructors but perhaps the Obsolete makes sense here.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if customers aren't using it, why do we need to keep an additional obsolete constructor when we can remove it?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't say customers aren't using it, just that it's unlikely.

We've definitely seen other libraries take dependencies on components of the SDK that were supposed to be internal, and breaking them is not a good customer experience at all.

Comment thread sdk/src/Services/DynamoDBv2/Generated/AmazonDynamoDBConfig.cs
Comment on lines +159 to 170
if (UseNewRetries2026)
{
executionContext.RequestContext.LastCapacityType = IsThrottlingError(exception) ?
CapacityManager.CapacityType.Throttling : CapacityManager.CapacityType.Retry;
StoreRetryAfterHeader(executionContext, exception);
}
else
{
executionContext.RequestContext.LastCapacityType = IsServiceTimeoutError(exception) ?
CapacityManager.CapacityType.Timeout : CapacityManager.CapacityType.Retry;
}
return OnRetry(executionContext, isClockSkewError, IsThrottlingError(exception));
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is by design for SEP 2.1, the spec only has RETRY_COST and THROTTLING_RETRY_COST, no timeout category, will keep open for others to review.

Comment thread sdk/src/Core/Amazon.Runtime/Pipeline/RetryHandler/RetryHandler.cs Outdated
Comment on lines +53 to +57
if (UseNewRetries2026)
{
executionContext.RequestContext.LastCapacityType = IsThrottlingError(exception) ?
CapacityManager.CapacityType.Throttling : CapacityManager.CapacityType.Retry;
StoreRetryAfterHeader(executionContext, exception);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The missing else block in the async RetryAsync() is a pre-existing bug, not part of SEP 2.1. The sync Retry() method already had the LastCapacityType = IsServiceTimeoutError ? Timeout : Retry assignment, but the async path was missing it entirely. As This is a pre-existing gap I think it should be addressed separately to keep this PR without any changes for customers that don't enable UseNewRetries2026, will keep this comment open for others to review.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree it makes sense to fix this on a separate PR.

@muhammad-othman muhammad-othman force-pushed the muhamoth/DOTNET-8570-new-retries branch 2 times, most recently from 4cfb2b7 to 311f8b8 Compare April 17, 2026 00:05
@github-actions
Copy link
Copy Markdown

Backward compatibility review required - approval needed from one of the reviewers: normj, boblodgett

@muhammad-othman muhammad-othman force-pushed the muhamoth/DOTNET-8570-new-retries branch from 311f8b8 to d177b4d Compare April 17, 2026 01:05
{
"core": {
"changeLogMessages": [
"Implements the updated Retry Behavior behind a feature flag (AWS_NEW_RETRIES_2026)."
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's probably not available yet, but would be nice to link to the SDK guide (https://docs.aws.amazon.com/sdkref/latest/guide/feature-retry-behavior.html). Looking at this entry how would I know what the "updated retry behavior" is?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the changes will be available if we are planning to release it this week or the week after that.

Comment thread sdk/src/Core/Amazon.Runtime/Pipeline/RetryHandler/RetryHandler.cs
Comment on lines +79 to +83
public CapacityManager(int initialRetryTokens, int retryCost, int noRetryIncrement, int timeoutRetryCost, int throttlingRetryCost)
{
retryCost = throttleRetryCost;
initialRetryTokens = throttleRetryCount;
noRetryIncrement = throttleCost;
this.retryCost = retryCost;
this.initialRetryTokens = initialRetryTokens;
this.noRetryIncrement = noRetryIncrement;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the reasoning but this class is not in the Internal namespace... I don't think customers would be using these constructors but perhaps the Obsolete makes sense here.

Comment on lines +53 to +57
if (UseNewRetries2026)
{
executionContext.RequestContext.LastCapacityType = IsThrottlingError(exception) ?
CapacityManager.CapacityType.Throttling : CapacityManager.CapacityType.Retry;
StoreRetryAfterHeader(executionContext, exception);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree it makes sense to fix this on a separate PR.

/// Defaults to false. This flag will be removed at end of 2026 when the new
/// behavior becomes the default.
/// </summary>
internal static bool UseNewRetries2026 { get; set; } =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see it in this PR (or the SEP...), but would it make sense to update the user agent to include whether the new retry option was set? (Otherwise how would we validate customers are using it?)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be added to the feature ids list first, maybe they were planning to do so when removing the feature flag, but let me check and get back to you.

Comment thread generator/.DevConfigs/94c889f3-af3a-4d8f-940a-0205846afe42.json Outdated
{
// SEP 2.1: Long-polling operations must always back off when the error is
// retryable, even if retry quota is exhausted.
if (IsLongPollingOperation(executionContext)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kiro found this which seems worth checking:

Long-polling backoff incorrectly triggers on max-attempts-exceeded path (RetryHandler.cs, both sync and async)                                                                                                                                                                                                  
                                                                                                                                                                                                                                                                                                                     
  The long-polling guard checks !shouldRetry && IsLongPollingOperation && IsLastExceptionRetryable, but shouldRetry is false for both quota-exhausted and max-attempts-exceeded cases. The spec pseudocode only sleeps on the quota-exhausted path:                                                                  
                                                                                                                                                                                                                                                                                                                     
  if not HasRetryQuota(response)                                                                                                                                                                                                                                                                                     
      if IsLongPollingOperation():                                                                                                                                                                                                                                                                                   
          sleep(ExponentialBackoff(attempts - 1))                                                                                                                                                                                                                                                                    
      return response                                                                                                                                                                                                                                                                                                
                                                                                                                                                                                                                                                                                                                     
  When max attempts is reached, the spec returns immediately with no sleep. But the current code adds an unnecessary backoff delay before throwing. The fix needs to distinguish why shouldRetry is false — quota exhaustion vs. max attempts vs. non-retryable error.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't consider this one, will need to fix it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to account for max-attempts-exceeded when IsLongPollingOperation is true.

@muhammad-othman muhammad-othman force-pushed the muhamoth/DOTNET-8570-new-retries branch from bef6550 to c80f03d Compare April 21, 2026 17:33
@muhammad-othman muhammad-othman force-pushed the muhamoth/DOTNET-8570-new-retries branch from c80f03d to 1d4c292 Compare April 21, 2026 20:32
Assert.AreEqual(1, Tester.CallCount);

// Should still back off even with empty quota — uses 1000ms throttling base
retryPolicy.AssertDelaysMatch(new int[] { 1000 });
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test in the SEP for this scenario (Long-Polling Backoff After Throttling Error When Token Bucket Empty) has delay: 0.05 but you have 1000 here. Is that intentional?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reached out to update the SEP, 1000ms should be the correct delay.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants