-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
Describe the bug
After upgrading form Opensearch 2.15.0 to 2.18.0, sometimes deleting a snapshot resulted in snapshot_deletion threadpool active thread count stuck at 1 after encountering failure. Logs on cluster_manager node showed:
[2025-05-13T23:53:31,217][INFO ][o.o.s.SnapshotsService ] [es-master-mis-30-2-0] deleting snapshots [smartsearch_min-2025-05-03.183003] from repository [s3_repository]
[2025-05-13T23:54:05,259][WARN ][o.o.r.b.BlobStoreRepository] [es-master-mis-30-2-0] [s3_repository] Failed to delete following blobs during snapshot delete : [Tuple [v1=[mis-30-2][indices][e7NMXjIlSOOS1MTWMTOr4Q][10], v2=__e05l2htCSi2L3NcdX5b2ww], Tuple [v1=[mis-30-2][indices][e7NMXjIlSOOS1MTWMTOr4Q][81], v2=__KNHrjCURQ0uCeIT35WyA_w], Tuple [v1=[mis-30-2][indices][e7NMXjIlSOOS1MTWMTOr4Q][56], v2=__yRISy4RpTL67SWuJVBbX4g]...
[2025-05-13T23:54:03,817][WARN ][o.o.r.b.BlobStoreRepository] [es-master-mis-30-2-0] [s3_repository] Failed to delete following blobs during snapshot delete : [Tuple [v1=[mis-30-2][indices][e7NMXjIlSOOS1MTWMTOr4Q][111], v2=__YW_-KjNvTzmcDBDL8Onm2Q], Tuple [v1=[mis-30-2][indices][e7NMXjIlSOOS1MTWMTOr4Q][147], v2=__-VaRQ2tTSEyD-8TLi2aFxw], Tuple [v1=[mis-30-2][indices][e7NMXjIlSOOS1MTWMTOr4Q][137], v2=__JBgcvFxWRH6vUN6pa_fXuw]...
[2025-05-13T23:54:04,347][WARN ][o.o.r.b.BlobStoreRepository] [es-master-mis-30-2-0] [s3_repository] Failed to delete following blobs during snapshot delete : [Tuple [v1=[mis-30-2][indices][e7NMXjIlSOOS1MTWMTOr4Q][24], v2=__AFiAbKSBR2Gq4XcaYqa-lA], Tuple [v1=[mis-30-2][indices][e7NMXjIlSOOS1MTWMTOr4Q][66], v2=__OHzTRuKpQPuarw2u3OtvGw]...
...
There were no pending tasks (checked via GET _cluster/pending_tasks). Thread count would become 0 only when the cluster manager node was restarted. Then when I tried to get the list of snapshots, the snapshot was gone, but trying to delete the next snapshot resulted in the same behavior with the "Failed to delete following blobs ..." log entry listing new blobs as well as blobs from previous snapshot deletion. These log entries only listed blobs that failed to be deleted but no cause of the failure, and there were no other log entries related to snapshot deletion error.
Also, I noticed whenever I made a repository setting update such changing chunk size, subsequent request to retrieve list of snapshots returned with the follow error:
{
"error": {
"root_cause": [
{
"type": "illegal_state_exception",
"reason": "Connection pool shut down"
}
],
"type": "illegal_state_exception",
"reason": "Connection pool shut down"
},
"status": 500
}
The request would work again only after restarting cluster manager nodes.
Related component
Storage:Snapshots
To Reproduce
- Create a snapshot management policy to take snapshot of the entire cluster every day at 12am and 12pm, and delete schedule of every day at 2am and 2pm. Keep minimum of 2 and max of 7 snapshots.
- Let policy take effect for a few days.
- Observe that snapshot deletion thread count remain at non-zero value.
Note that this behavior doesn't happen every time but once it does, subsequent snapshot deletions (after cluster manager node restart) will run into this behavior. Also, this symptom seems only to occur for our large clusters where there are daily indices with some indices have 250 shards and size of 3-4TB.
Expected behavior
Snapshot_deletion threadpool active thread count should become 0 even after encountering failure.
Additional Details
Plugins
Please list all plugins currently enabled.
Screenshots
If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
- OS: [e.g. iOS]
- Version [e.g. 22]
Additional context
Add any other context about the problem here.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status