-
Notifications
You must be signed in to change notification settings - Fork 168
Description
Code of Conduct
- I agree to follow this project's Code of Conduct
Search before asking
- I have searched in the issues and found no similar issues.
Describe the bug
In our cluster, delete pod is denied by web hook, even though all application is deleted for long time.
When I curl http://host:ip/metrics/server, I found app_num_with_node is 1.
The problem is some application is leaked. I see many duplicated logs [INFO] ShuffleTaskManager.checkResourceStatus - Detect expired appId[appattempt_xxx_xx_xx] according to rss.server.app.expired.withoutHeartbeat.
When I jstack the server many times, clearResourceThread will be stuck forever, here is the call stack.
"clearResourceThread" #40 daemon prio=5 os_prio=0 cpu=3767.63ms elapsed=5393.50s tid=0x00007f24fe92e800 nid=0x8f waiting on condition [0x00007f24f7b33000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
- parking to wait for <0x00007f28d5e29f20> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:194)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt([email protected]/AbstractQueuedSynchronizer.java:885)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued([email protected]/AbstractQueuedSynchronizer.java:917)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire([email protected]/AbstractQueuedSynchronizer.java:1240)
at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock([email protected]/ReentrantReadWriteLock.java:959)
at org.apache.uniffle.server.ShuffleTaskManager.removeResources(ShuffleTaskManager.java:756)
at org.apache.uniffle.server.ShuffleTaskManager.lambda$new$0(ShuffleTaskManager.java:183)
at org.apache.uniffle.server.ShuffleTaskManager$$Lambda$216/0x00007f24f824cc40.run(Unknown Source)
at java.lang.Thread.run([email protected]/Thread.java:829)
Apparently there's a lock that's not being released. Looking at the code, it's easy to see that the read lock in the flushBuffer is not released correctly. The log ShuffleBufferManager.flushBuffer - Shuffle[3066071] for app[appattempt_xxx] has already been removed, no need to flush the buffer proved it.
Affects Version(s)
master
Uniffle Server Log Output
No response
Uniffle Engine Log Output
No response
Uniffle Server Configurations
No response
Uniffle Engine Configurations
No response
Additional context
No response
Are you willing to submit PR?
- Yes I am willing to submit a PR!