Skip to content

[BUG] Reduce TaskBatcher excessive logging in DEBUG mode #12249

@amkhar

Description

@amkhar

Describe the bug

if (toExecute.isEmpty() == false) {
final String tasksSummary = processTasksBySource.entrySet().stream().map(entry -> {
String tasks = updateTask.describeTasks(entry.getValue());
return tasks.isEmpty() ? entry.getKey() : entry.getKey() + "[" + tasks + "]";
}).reduce((s1, s2) -> s1 + ", " + s2).orElse("");

While executing a pending task, we first try to log the task summary. If the pending task batchingKey has 200K tasks in the linked list, we'll end up collecting task summary of all those tasks. This takes 10 minutes which blocks the overall execution of all the tasks. This summary is being used only for logging purpose in debug setting only.

Ideally we should not log excessive even in debug mode as calculating log string is taking minutes.

Related component

Cluster Manager

To Reproduce

  1. Create 200K primary shards in a cluster which can take load of these many shards
  2. kill opensearch process on all cluster manager nodes, so it triggers reroute flow
  3. You'll see shards will go in init mode quickly
  4. But actually starting these shards will take more than 10 minutes

Expected behavior

Ideally we should not log the same thing if content is same just the shardId is different. We should short circuit and log a smaller sized string to avid this delay.

Additional Details

Hot/threads

::: {6bb0987818c65f26bf4a1028fbc6d538}{l9ltzN3dT5GCddpWAkPLYg}{1JNBmxGoSrqq3Jpe4ywfgg}<redacted>
   Hot threads at 2024-02-06T16:08:40.878Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   99.8% (498.8ms out of 500ms) cpu usage by thread 'opensearch[6bb0987818c65f26bf4a1028fbc6d538][clusterManagerService#updateTask][T#1]'
     10/10 snapshots sharing following 15 elements
       java.base@17.0.6/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
       java.base@17.0.6/java.util.HashMap$EntrySpliterator.forEachRemaining(HashMap.java:1850)
       java.base@17.0.6/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
       java.base@17.0.6/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
       java.base@17.0.6/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921)
       java.base@17.0.6/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
       java.base@17.0.6/java.util.stream.ReferencePipeline.reduce(ReferencePipeline.java:662)
       app//org.opensearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:202)
       app//org.opensearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:243)
       app//org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:756)
       app//org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:282)
       app//org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:245)
       java.base@17.0.6/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
       java.base@17.0.6/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
       java.base@17.0.6/java.lang.Thread.run(Thread.java:833)

Metadata

Metadata

Assignees

Labels

Cluster ManagerbugSomething isn't workingv2.16.0Issues and PRs related to version 2.16.0

Type

No type

Projects

Status

✅ Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions