[BUG] [Remote Store] [Snapshots] Heavy Heap Usage on Master Node due stuck snapshot deletions for Remote Store clusters

### Describe the bug

If segment uploads on a shard is stuck due to any reason, an ongoing snapshot for the shard remains in the IN_PROGRESS state until we trigger a snapshot delete. Upon delete trigger, the snapshot transitions to the ABORTED state.

On a non Remote Store cluster, each snapshot thread on data node snapshotting that shard, [check](https://github.com/linuxpi/OpenSearch/blob/46722deb86789fb21e7f0f5af71ca4b4b73637fb/server/src/main/java/org/opensearch/repositories/blobstore/BlobStoreRepository.java#L3508) periodically if snapshot is marked aborted. If it identifies the snapshot is marked aborted, it fails the shard snapshot, eventually transitioning the snapshot to a terminal failed state.

On a Remote Store cluster, the code flow is different since its a shallow snapshot. While for remote store shards, the snapshot thread [waits for segment uploads to complete](https://github.com/linuxpi/OpenSearch/blob/46722deb86789fb21e7f0f5af71ca4b4b73637fb/server/src/main/java/org/opensearch/snapshots/SnapshotShardsService.java#L408), if the segment uploads is stuck, this thread remains suspended until segment upload resumes. hence, if the snapshot is aborted during this duration, it wont identify its aborted and wont fail the shard snapshot operations. 

This leads to snapshot being in ABORTED state indefinitely until remote segment uploads resume. Snapshot deletion calls are blocking, and wait indefinitely for deletion to complete. They do that by [queuing up a listener](https://github.com/linuxpi/OpenSearch/blob/f58cf991a0b491d663ffc73a9764b8112951b238/server/src/main/java/org/opensearch/snapshots/SnapshotsService.java#L2666) which gets invoked on deletion completion. this listener holds reference to the cluster state object. subsequent delete calls will queue up their own listeners each holding a reference of the cluster state at that point in time. So multiple cluster state objects end up on the heap which wont be GC'ed until the listeners invoked.

So if snapshot is stuck due to remote segment uploads being stuck, more and more retries to delete the stuck snapshot will get stuck and hold up cluster state reference in heap leading to increasing heap usage on active master.



### Related component

Storage:Snapshots

### To Reproduce

1. Simulate stuck segment uploads for a shard
2. trigger a snapshot, it gets stuck in IN_PROGRESS state
3. trigger a delete for the same snapshot, the delete call will timeout and delete operation will get stuck
4. Keep sending delete snapshot request and see active master jvm increase

### Expected behavior

- Delete snapshot calls can be made async where the call is just a trigger for delete and actual delete can happen in the background
- Evaluate if a check for aborted snapshot is needed in shallow copy snapshot code flow to allow snapshot operations to fail for remote upload stuck shards

### Additional Details

**Thread dump**
```
opensearch[095012d6c1bfe3db7b0f51c4bb31f9d7][snapshot_shards][T#1]" #482 daemon prio=5 os_prio=0 cpu=104.25ms elapsed=2910.84s tid=0x0000ffff584f4ad0 nid=0x198a waiting on condition  [0x0000ffff15a5a000]
   java.lang.Thread.State: WAITING (parking)
        at jdk.internal.misc.Unsafe.park(java.base@17.0.9/Native Method)
        - parking to wait for  <0x0000000610011bf0> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
        at java.util.concurrent.locks.LockSupport.park(java.base@17.0.9/LockSupport.java:211)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@17.0.9/AbstractQueuedSynchronizer.java:715)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@17.0.9/AbstractQueuedSynchronizer.java:938)
        at java.util.concurrent.locks.ReentrantLock$Sync.lock(java.base@17.0.9/ReentrantLock.java:153)
        at java.util.concurrent.locks.ReentrantLock.lock(java.base@17.0.9/ReentrantLock.java:322)
        at org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:238)
        at org.opensearch.index.engine.InternalEngine.refresh(InternalEngine.java:1860)
        at org.opensearch.index.engine.InternalEngine.flush(InternalEngine.java:1975)
        at org.opensearch.index.engine.InternalEngine.acquireLastIndexCommit(InternalEngine.java:2199)
        at org.opensearch.index.shard.IndexShard.acquireLastIndexCommit(IndexShard.java:1695)
        at org.opensearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:675)
        at org.opensearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:393)
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractPrioritizedRunnable.doRun(ThreadContext.java:979)
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@17.0.9/ThreadPoolExecutor.java:1136)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@17.0.9/ThreadPoolExecutor.java:635)
        at java.lang.Thread.run(java.base@17.0.9/Thread.java:840)
```

**Screenshots**
![Screenshot 2024-05-22 at 3 29 32 PM](https://github.com/user-attachments/assets/030d20f8-4d52-411a-96ef-d6dce3a93514)


**Host/Environment (please complete the following information):**
 - OS: Amazon Linux 2
 - Version OS_2.11

**Additional context**
N/A


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] [Remote Store] [Snapshots] Heavy Heap Usage on Master Node due stuck snapshot deletions for Remote Store clusters #15065

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] [Remote Store] [Snapshots] Heavy Heap Usage on Master Node due stuck snapshot deletions for Remote Store clusters #15065

Description

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions