Skip to content

[BUG] Search pipeline seen executing on transport_worker thread #10248

@austintlee

Description

@austintlee

Describe the bug
I have a search processor that uses a org.opensearch.client.Client object to execute a TransportAction. This invocation returns a BaseFuture and the processor does .get(), blocking until the client returns a response. Occasionally, the get() call blocks indefinitely and brings the whole cluster into bad state.

A thread dump on the node that hung revealed that the search processor was executing on a transport_worker thread.

"opensearch[opensearch-node1][transport_worker][T#2]" #32 daemon prio=5 os_prio=0 cpu=61810.65ms elapsed=43062.56s allocated=2771M defined_classes=251      tid=0x0000fffef4009140 nid=0x13d waiting on condition  [0x0000ffff88cda000]
    java.lang.Thread.State: WAITING (parking)
     at jdk.internal.misc.Unsafe.park(java.base@17.0.8/Native Method)
     - parking to wait for  <0x00000000ee75b640> (a org.opensearch.common.util.concurrent.BaseFuture$Sync)
     at java.util.concurrent.locks.LockSupport.park(java.base@17.0.8/LockSupport.java:211)
     at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@17.0.8/AbstractQueuedSynchronizer.java:715)
     at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base@17.0.8/AbstractQueuedSynchronizer.java:1047)
     at org.opensearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:272)
     at org.opensearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:104)
     at org.opensearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:74)
     at org.opensearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:55)
     at org.opensearch.searchpipelines.questionanswering.generative.llm.DefaultLlmImpl.doChatCompletion(DefaultLlmImpl.java:84)
     at org.opensearch.searchpipelines.questionanswering.generative.GenerativeQAResponseProcessor.processResponse(GenerativeQAResponseProcessor.java:109)
     at org.opensearch.search.pipeline.Pipeline.transformResponse(Pipeline.java:177)
     at org.opensearch.search.pipeline.PipelinedRequest.transformResponse(PipelinedRequest.java:31)
     at org.opensearch.action.search.TransportSearchAction.lambda$executeRequest$0(TransportSearchAction.java:398)
     at org.opensearch.action.search.TransportSearchAction$$Lambda$4884/0x0000003001d5d850.accept(Unknown Source)
     at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82)
     at org.opensearch.core.action.ActionListener$5.onResponse(ActionListener.java:268)
     at org.opensearch.action.search.AbstractSearchAsyncAction.sendSearchResponse(AbstractSearchAsyncAction.java:671)
     at org.opensearch.action.search.ExpandSearchPhase.run(ExpandSearchPhase.java:132)
     at org.opensearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:428)
     at org.opensearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:422)

I do see in a few other thread dumps I captured the same processor running on a search thread.

From:

return Transports.assertNotTransportThread(BLOCKING_OP_REASON)

And:

public void channelRead(ChannelHandlerContext ctx, Object msg) throws Exception {
assert Transports.assertDefaultThreadContext(transport.getThreadPool().getThreadContext());
assert Transports.assertTransportThread();
assert msg instanceof ByteBuf : "Expected message type ByteBuf, found: " + msg.getClass();

I can see that we don't expect a blocking call to happen on transport threads and there is code specifically in BaseFuture.get to disallow invocations on transport threads, although assert won't trigger unless the OpenSearch process is run with the ea JVM flag.

Can we ensure that search processors run on search threads? Or are they really allowed to run on transport threads which means that I should not have any blocking calls in my search processor?

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
A clear and concise description of what you expected to happen.

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: [e.g. iOS]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

Labels

SearchSearch query, autocomplete ...etcSearch:RelevancebugSomething isn't workingv2.12.0Issues and PRs related to version 2.12.0v3.0.0Issues and PRs related to version 3.0.0

Type

No type

Projects

Status

✅ Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions