Skip to content

Clear Stale Persistent Tasks in Stop/Pause API#1629

Open
mohit10011999 wants to merge 10 commits intoopensearch-project:mainfrom
mohit10011999:stalePersistentTasks
Open

Clear Stale Persistent Tasks in Stop/Pause API#1629
mohit10011999 wants to merge 10 commits intoopensearch-project:mainfrom
mohit10011999:stalePersistentTasks

Conversation

@mohit10011999
Copy link
Copy Markdown
Contributor

@mohit10011999 mohit10011999 commented Jan 25, 2026

Description

When CCR is stopped or paused, all the index and shard replication tasks should be stopped. But if the stop/ pause is not completely successful, some of the replication tasks might stay running. This can cause conflict when we restart/resume the replication.

  1. StaleTaskUtils provides shared detection/removal of unassigned replication tasks by index name. Both Start and Resume APIs call StaleTaskUtils.removeStaleTasksForIndex before creating new tasks. Detection uses precise regex matching for both replication:index:{name} and replication:[{name}][{shard}] formats. ResourceNotFoundException is handled gracefully for idempotency.
  2. The stop action wraps every step in try/catch so failures are non-fatal and logged. The flow: remove block → close index → retention leases → cluster state update → reopen → remove stale tasks → delete metadata. All operations are idempotent — calling stop multiple times or on a non-replicated index succeeds.
  3. validateNoActiveMetadata in the Start action checks cluster state for existing replication metadata. Returns actionable errors for RUNNING/PAUSED states ("use resume API") and STOPPED/FAILED states ("run Stop API to clean up"). Fires after the existing hasIndex check to preserve the original error message ordering.
  4. Idempotent stop and stale task removal are implemented. Two integration tests updated to reflect idempotent stop behavior. Concurrency safety relies on OpenSearch's built-in cluster state atomic updates.
  5. The SecurityCustomRolesIT.test for FOLLOWER that STOP replication works for user with valid permissions test was expecting stopReplication("follower-index1") to throw a ResponseException with "No replication in progress", but the idempotent stop changes made the stop API succeed silently when there's no active replication. Updated the test to simply call stopReplication and expect success, matching the same pattern already applied in StopReplicationIT. Also fixed the same issue in SingleClusterSanityIT

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Copy Markdown
Member

@ankitkala ankitkala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two major feedback:

  • The PR has lot of additional code changes which doesn't seems to be related to the actual change. Can you remove all the unnecessary changes so its easier to review
  • Stale replication tasks are problem when you're trying to create the task again(start or resume). I think we should be able to simplify by just handling this during task creation here (need to verify though, with any stacktrace from last few occurence of the issue)

@mohit10011999 mohit10011999 force-pushed the stalePersistentTasks branch 23 times, most recently from f4aa9ba to 43629d5 Compare January 28, 2026 19:30
@mohit10011999 mohit10011999 force-pushed the stalePersistentTasks branch 2 times, most recently from db9f5df to 5aa641f Compare February 3, 2026 17:12
@mohit10011999 mohit10011999 force-pushed the stalePersistentTasks branch 3 times, most recently from 758eca1 to e8b534f Compare February 13, 2026 08:07
@ankitkala ankitkala enabled auto-merge (squash) February 21, 2026 11:50
auto-merge was automatically disabled March 23, 2026 13:56

Head branch was pushed to by a user without write access

@mohit10011999 mohit10011999 force-pushed the stalePersistentTasks branch 4 times, most recently from de6493e to 7e641a6 Compare March 23, 2026 16:15
mohitamg and others added 4 commits March 24, 2026 20:33
Signed-off-by: Mohit Kumar <mohitamg@amazon.com>
This reverts commit 0e4b126.

Signed-off-by: Mohit Kumar <mohitamg@amazon.com>
Signed-off-by: Mohit Kumar <mohitamg@amazon.com>
@mohit10011999 mohit10011999 reopened this Mar 25, 2026
@mohit10011999 mohit10011999 force-pushed the stalePersistentTasks branch 3 times, most recently from 00dbad5 to c274e16 Compare March 26, 2026 15:52
mohitamg and others added 5 commits March 27, 2026 14:30
Signed-off-by: Mohit Kumar <mohitamg@amazon.com>
…ld clear all stale replication metadata

Signed-off-by: Mohit Kumar <mohitamg@amazon.com>
Signed-off-by: Mohit Kumar <mohitamg@amazon.com>
Signed-off-by: Mohit Kumar <mohitamg@amazon.com>
…tion is already running, test start replication succeeds after stop cleans up, test idempotent stop replication can be called multiple times and test stop replication cleans up and allows restart

Signed-off-by: Mohit Kumar <mohitamg@amazon.com>
?: return emptyList()

return allTasks.tasks().filter { task ->
isReplicationTaskForIndex(task, indexName) && !task.isAssigned
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the issues currently is not assigned tasks are not getting cleared. We need to clear the assigned tasks as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants