Skip to content

[bugfix]: fix inconsistency result when replay concurrency operations#348

Open
zhongkechen wants to merge 4 commits intomainfrom
concurrency
Open

[bugfix]: fix inconsistency result when replay concurrency operations#348
zhongkechen wants to merge 4 commits intomainfrom
concurrency

Conversation

@zhongkechen
Copy link
Copy Markdown
Contributor

@zhongkechen zhongkechen commented Apr 24, 2026

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Issue Link, if available

Fixes: #344

Description

Two concurrency operation result consistent issues during replay

  1. Branches/items that are skipped (neither succeeded nor failed ones) when checkpointing the results, could succeed or fail during replay. This occurs when a parallel operation or a map operation with large results has early termination conditions (min successful or max failure tolerant) and the items/branches are completed with an inconsistent order during replay.

  2. Branches/items that succeeded or failed when checkpointing the results, could be skipped during replay. This occurs when a parallel operation or a map operation with large results has early termination conditions (min successful or max failure tolerant) and more branches than specified condition are completed due to concurrency.

Fixes:

  1. always checkpoint the status of each branch/item in the concurrency operation's result
  2. skip the items/branches with SKIPPED status during replay
  3. always wait until checkpointed succeeded/failed branches/items to complete during replay

Demo/Screenshots

Checklist

  • I have filled out every section of the PR template
  • I have thoroughly tested this change

Testing

Unit Tests

Have unit tests been written for these changes? Yes

Integration Tests

Have integration tests been written for these changes? Yes

Examples

Has a new example been added for the change? (if applicable)

@zhongkechen zhongkechen marked this pull request as ready for review April 24, 2026 22:14
@zhongkechen zhongkechen requested a review from a team April 24, 2026 22:14
@SilanHe
Copy link
Copy Markdown
Contributor

SilanHe commented Apr 27, 2026

just for me, could you elaborate on what the fix is?

Fixed issues in replaying concurrency operations

skip the items/branches previously skipped to prevent a race condition that the skipped branches could succeed/fail
always wait until previously succeeded/failed branches to complete to prevent a race condition that the previously completed branches could be skipped

I'm having trouble understanding what this means

@zhongkechen
Copy link
Copy Markdown
Contributor Author

just for me, could you elaborate on what the fix is?

Fixed issues in replaying concurrency operations
skip the items/branches previously skipped to prevent a race condition that the skipped branches could succeed/fail
always wait until previously succeeded/failed branches to complete to prevent a race condition that the previously completed branches could be skipped

I'm having trouble understanding what this means

Updated description of the PR. Hope it's easier to understand.

@zhongkechen zhongkechen self-assigned this Apr 27, 2026
branches.add(childOp);
pendingQueue.add(childOp);
logger.debug("Item enqueued {}", name);
if (!skipped) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why even call enqueueItem if we're going to skip it anyway. Wouldn't this future never resolve?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

branches will have the skipped items. We skip pendingQueue so that the skipped items will not be executed.

int failed = failedCount.get();

if (expectedCompletionStatus != null) {
if (succeeded + failed >= expectedCompletionStatus.completed) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this check for?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make sure all succeeded/failed items/branches will succeed/fail in replay.

Comment thread sdk/src/main/java/software/amazon/lambda/durable/operation/MapOperation.java Outdated
@zhongkechen zhongkechen requested review from a team and SilanHe April 27, 2026 23:55
@zhongkechen zhongkechen added the bug Something isn't working label Apr 27, 2026
@zhongkechen zhongkechen added this to the 1.1 - Apr 30 2026 milestone Apr 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: inconsistent result when replaying concurrency operations

2 participants