-
Notifications
You must be signed in to change notification settings - Fork 58
[BUG] Thread_Blocked_Time metric calculation is incorrect in Metrics API #92
Description
What is the bug?
When a thread is blocked continuously for a prolonged period of time, its contribution to the average Thread_Blocked_Time metric is expected to increase; but the metrics API call keeps returning a constant value of 0.5 for the thread_blocked_time metric aggregated by average.
performance-analyzer-rca/src/main/java/org/opensearch/performanceanalyzer/jvm/ThreadList.java
Lines 282 to 293 in eeb537f
| if (t.blockedTime != -1 && t.blockedCount > oldt.blockedCount) { | |
| t.avgBlockedTime = | |
| 1.0e-3 | |
| * (t.blockedTime - oldt.blockedTime) | |
| / (t.blockedCount - oldt.blockedCount); | |
| } else { | |
| CircularLongArray arr = ThreadHistory.blockedTidHistoryMap.get(t.nativeTid); | |
| // NOTE: this is an upper bound | |
| if (arr != null) { | |
| t.avgBlockedTime = 1.0 * arr.getAvgValue() / samplingInterval; | |
| } | |
| } |
To calculate the average thread_blocked_time of a thread, the code first checks if the thread's blocked_count from the current interval's threaddump is greater than the thread's blocked_count from the previous interval's threaddump
The problem with this check is that if the thread has simply been blocked on the same monitor across both intervals, the blocked count doesn't increase but the blocked time would have increased by an amount equal to the threadumps interval duration. Hence the thread_blocked_time value is calculated by a default fallback logic rather than showing the the increase in blocked time.
How can one reproduce the bug?
Thread dumps for a cluster shows one of the many blocked search threads:
"opensearch[0fd2634cfd28ef09d9924ce73c3a64b1][search][T#1]" #169 daemon prio=5 os_prio=0 cpu=59.36ms elapsed=644.32s tid=0x00007f91040dd000 nid=0x6172 waiting for monitor entry [0x00007f90d07e1000]
java.lang.Thread.State: BLOCKED (on object monitor)
- waiting to lock <0x000000070e6fea30> (a java.lang.Object)
at org.opensearch.action.search.SearchProgressListener.notifyQueryResult(SearchProgressListener.java:135)
at org.opensearch.action.search.QueryPhaseResultConsumer.consumeResult(QueryPhaseResultConsumer.java:127)
at org.opensearch.action.search.AbstractSearchAsyncAction.onShardResult(AbstractSearchAsyncAction.java:486)
at org.opensearch.action.search.SearchQueryThenFetchAsyncAction.onShardResult(SearchQueryThenFetchAsyncAction.java:120)
at org.opensearch.action.search.AbstractSearchAsyncAction$1.innerOnResponse(AbstractSearchAsyncAction.java:261)
at org.opensearch.action.search.SearchActionListener.onResponse(SearchActionListener.java:58)
at org.opensearch.action.search.SearchActionListener.onResponse(SearchActionListener.java:42)
at org.opensearch.action.search.SearchExecutionStatsCollector.onResponse(SearchExecutionStatsCollector.java:81)
at org.opensearch.action.search.SearchExecutionStatsCollector.onResponse(SearchExecutionStatsCollector.java:49)
at org.opensearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:67)
at org.opensearch.action.search.SearchTransportService$ConnectionCountingHandler.handleResponse(SearchTransportService.java:413)
at org.opensearch.transport.TransportService$6.handleResponse(TransportService.java:658)
at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1207)
at org.opensearch.transport.TransportService$DirectResponseChannel.processResponse(TransportService.java:1285)
at org.opensearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1265)
at org.opensearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:65)
at org.opensearch.action.support.ChannelActionListener.onResponse(ChannelActionListener.java:56)
at org.opensearch.action.support.ChannelActionListener.onResponse(ChannelActionListener.java:40)
at org.opensearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:71)
at org.opensearch.action.ActionRunnable$$Lambda$3388/0x0000000801044040.accept(Unknown Source)
at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:86)
at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50)
at org.opensearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:57)
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:765)
at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.6/ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.6/ThreadPoolExecutor.java:628)
at java.lang.Thread.run(java.base@11.0.6/Thread.java:834)
PA metrics api call for ThreadBlockedTime metric with operation as dimension
curl "http://localhost:9650/_opendistro/_performanceanalyzer/metrics?metrics=Thread_Blocked_Time&agg=avg&dim=Operation"
{"hLHH2jmQSrupS4MSyictKg": {"timestamp": 1630994665000, "data": {"fields":[{"name":"Operation","type":"VARCHAR"},{"name":"Thread_Blocked_Time","type":"DOUBLE"}],"records":[["GC",0.0],["flush",0.0],["generic",0.0],["get",0.0],["management",0.0],["other",0.0],["refresh",0.0],["search",0.5]]}}}
The thread_blocked_time value for the search threads is shown as 0.5 while multiple threads had been blocked for atleast a minute.
What is the expected behavior?
The thread_blocked_time metric in Metrics API should more accurately account for average thread blocked time.