Skip to content

[BUG] Thread_Blocked_Time metric calculation is incorrect in Metrics API #92

@eirsep

Description

@eirsep

What is the bug?
When a thread is blocked continuously for a prolonged period of time, its contribution to the average Thread_Blocked_Time metric is expected to increase; but the metrics API call keeps returning a constant value of 0.5 for the thread_blocked_time metric aggregated by average.

if (t.blockedTime != -1 && t.blockedCount > oldt.blockedCount) {
t.avgBlockedTime =
1.0e-3
* (t.blockedTime - oldt.blockedTime)
/ (t.blockedCount - oldt.blockedCount);
} else {
CircularLongArray arr = ThreadHistory.blockedTidHistoryMap.get(t.nativeTid);
// NOTE: this is an upper bound
if (arr != null) {
t.avgBlockedTime = 1.0 * arr.getAvgValue() / samplingInterval;
}
}

To calculate the average thread_blocked_time of a thread, the code first checks if the thread's blocked_count from the current interval's threaddump is greater than the thread's blocked_count from the previous interval's threaddump

The problem with this check is that if the thread has simply been blocked on the same monitor across both intervals, the blocked count doesn't increase but the blocked time would have increased by an amount equal to the threadumps interval duration. Hence the thread_blocked_time value is calculated by a default fallback logic rather than showing the the increase in blocked time.

How can one reproduce the bug?
Thread dumps for a cluster shows one of the many blocked search threads:

"opensearch[0fd2634cfd28ef09d9924ce73c3a64b1][search][T#1]" #169 daemon prio=5 os_prio=0 cpu=59.36ms elapsed=644.32s tid=0x00007f91040dd000 nid=0x6172 waiting for monitor entry [0x00007f90d07e1000]
java.lang.Thread.State: BLOCKED (on object monitor)
- waiting to lock <0x000000070e6fea30> (a java.lang.Object)
at org.opensearch.action.search.SearchProgressListener.notifyQueryResult(SearchProgressListener.java:135)
at org.opensearch.action.search.QueryPhaseResultConsumer.consumeResult(QueryPhaseResultConsumer.java:127)
at org.opensearch.action.search.AbstractSearchAsyncAction.onShardResult(AbstractSearchAsyncAction.java:486)
at org.opensearch.action.search.SearchQueryThenFetchAsyncAction.onShardResult(SearchQueryThenFetchAsyncAction.java:120)
at org.opensearch.action.search.AbstractSearchAsyncAction$1.innerOnResponse(AbstractSearchAsyncAction.java:261)
at org.opensearch.action.search.SearchActionListener.onResponse(SearchActionListener.java:58)
at org.opensearch.action.search.SearchActionListener.onResponse(SearchActionListener.java:42)
at org.opensearch.action.search.SearchExecutionStatsCollector.onResponse(SearchExecutionStatsCollector.java:81)
at org.opensearch.action.search.SearchExecutionStatsCollector.onResponse(SearchExecutionStatsCollector.java:49)
at org.opensearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:67)
at org.opensearch.action.search.SearchTransportService$ConnectionCountingHandler.handleResponse(SearchTransportService.java:413)
at org.opensearch.transport.TransportService$6.handleResponse(TransportService.java:658)
at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1207)
at org.opensearch.transport.TransportService$DirectResponseChannel.processResponse(TransportService.java:1285)
at org.opensearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1265)
at org.opensearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:65)
at org.opensearch.action.support.ChannelActionListener.onResponse(ChannelActionListener.java:56)
at org.opensearch.action.support.ChannelActionListener.onResponse(ChannelActionListener.java:40)
at org.opensearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:71)
at org.opensearch.action.ActionRunnable$$Lambda$3388/0x0000000801044040.accept(Unknown Source)
at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:86)
at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50)
at org.opensearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:57)
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:765)
at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.6/ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.6/ThreadPoolExecutor.java:628)
at java.lang.Thread.run(java.base@11.0.6/Thread.java:834)

PA metrics api call for ThreadBlockedTime metric with operation as dimension
curl "http://localhost:9650/_opendistro/_performanceanalyzer/metrics?metrics=Thread_Blocked_Time&agg=avg&dim=Operation"
{"hLHH2jmQSrupS4MSyictKg": {"timestamp": 1630994665000, "data": {"fields":[{"name":"Operation","type":"VARCHAR"},{"name":"Thread_Blocked_Time","type":"DOUBLE"}],"records":[["GC",0.0],["flush",0.0],["generic",0.0],["get",0.0],["management",0.0],["other",0.0],["refresh",0.0],["search",0.5]]}}}

The thread_blocked_time value for the search threads is shown as 0.5 while multiple threads had been blocked for atleast a minute.

What is the expected behavior?
The thread_blocked_time metric in Metrics API should more accurately account for average thread blocked time.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions