Snapshot_deletion threadpool active thread count stuck at 1 after encountering failure

### Describe the bug

After upgrading form Opensearch 2.15.0 to 2.18.0, sometimes deleting a snapshot resulted in snapshot_deletion threadpool active thread count stuck at 1 after encountering failure.  Logs on cluster_manager node showed:
```
[2025-05-13T23:53:31,217][INFO ][o.o.s.SnapshotsService   ] [es-master-mis-30-2-0] deleting snapshots [smartsearch_min-2025-05-03.183003] from repository [s3_repository]
[2025-05-13T23:54:05,259][WARN ][o.o.r.b.BlobStoreRepository] [es-master-mis-30-2-0] [s3_repository] Failed to delete following blobs during snapshot delete : [Tuple [v1=[mis-30-2][indices][e7NMXjIlSOOS1MTWMTOr4Q][10], v2=__e05l2htCSi2L3NcdX5b2ww], Tuple [v1=[mis-30-2][indices][e7NMXjIlSOOS1MTWMTOr4Q][81], v2=__KNHrjCURQ0uCeIT35WyA_w], Tuple [v1=[mis-30-2][indices][e7NMXjIlSOOS1MTWMTOr4Q][56], v2=__yRISy4RpTL67SWuJVBbX4g]...
[2025-05-13T23:54:03,817][WARN ][o.o.r.b.BlobStoreRepository] [es-master-mis-30-2-0] [s3_repository] Failed to delete following blobs during snapshot delete : [Tuple [v1=[mis-30-2][indices][e7NMXjIlSOOS1MTWMTOr4Q][111], v2=__YW_-KjNvTzmcDBDL8Onm2Q], Tuple [v1=[mis-30-2][indices][e7NMXjIlSOOS1MTWMTOr4Q][147], v2=__-VaRQ2tTSEyD-8TLi2aFxw], Tuple [v1=[mis-30-2][indices][e7NMXjIlSOOS1MTWMTOr4Q][137], v2=__JBgcvFxWRH6vUN6pa_fXuw]...
[2025-05-13T23:54:04,347][WARN ][o.o.r.b.BlobStoreRepository] [es-master-mis-30-2-0] [s3_repository] Failed to delete following blobs during snapshot delete : [Tuple [v1=[mis-30-2][indices][e7NMXjIlSOOS1MTWMTOr4Q][24], v2=__AFiAbKSBR2Gq4XcaYqa-lA], Tuple [v1=[mis-30-2][indices][e7NMXjIlSOOS1MTWMTOr4Q][66], v2=__OHzTRuKpQPuarw2u3OtvGw]...
...
```
There were no pending tasks (checked via GET _cluster/pending_tasks).  Thread count would become 0 only when the cluster manager node was restarted.  Then when I tried to get the list of snapshots, the snapshot was gone, but trying to delete the next snapshot resulted in the same behavior with the "Failed to delete following blobs ..." log entry listing new blobs as well as blobs from previous snapshot deletion.  These log entries only listed blobs that failed to be deleted but no cause of the failure, and there were no other log entries related to snapshot deletion error.

Also, I noticed whenever I made a repository setting update such changing chunk size, subsequent request to retrieve list of snapshots returned with the follow error:
```
{ 
  "error": { 
    "root_cause": [ 
      { 
        "type": "illegal_state_exception",
        "reason": "Connection pool shut down"
      }
    ],
    "type": "illegal_state_exception",
    "reason": "Connection pool shut down"
  },
  "status": 500
}
```
The request would work again only after restarting cluster manager nodes.

### Related component

Storage:Snapshots

### To Reproduce

1. Create a snapshot management policy to take snapshot of the entire cluster every day at 12am and 12pm, and delete schedule of every day at 2am and 2pm.  Keep minimum of 2 and max of 7 snapshots.
2. Let policy take effect for a few days.
3. Observe that snapshot deletion thread count remain at non-zero value.

Note that this behavior doesn't happen every time but once it does, subsequent snapshot deletions (after cluster manager node restart) will run into this behavior.   Also, this symptom seems only to occur for our large clusters where there are daily indices with some indices have 250 shards and size of 3-4TB.

### Expected behavior

Snapshot_deletion threadpool active thread count should become 0 even after encountering failure.

### Additional Details

**Plugins**
Please list all plugins currently enabled.

**Screenshots**
If applicable, add screenshots to help explain your problem.

**Host/Environment (please complete the following information):**
 - OS: [e.g. iOS]
 - Version [e.g. 22]

**Additional context**
Add any other context about the problem here.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snapshot_deletion threadpool active thread count stuck at 1 after encountering failure #18314

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Snapshot_deletion threadpool active thread count stuck at 1 after encountering failure #18314

Description

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions