Skip to content

Introduce ConcurrentQueryProfiler to profile query using concurrent segment search path and support concurrency during rewrite and create weight#10352

Merged
reta merged 2 commits intoopensearch-project:mainfrom
ticheng-aws:cs-rewrite-time
Oct 20, 2023
Merged

Introduce ConcurrentQueryProfiler to profile query using concurrent segment search path and support concurrency during rewrite and create weight#10352
reta merged 2 commits intoopensearch-project:mainfrom
ticheng-aws:cs-rewrite-time

Conversation

@ticheng-aws
Copy link
Copy Markdown
Contributor

@ticheng-aws ticheng-aws commented Oct 4, 2023

Description

The rewrite_time field contains the cumulative total rewrite time for a query. In the concurrent search case, the issue is “rewrite” happened on "before" (i.e. non concurrent) and "during" the concurrent path. There is only 1 rewrite timer shared across all slices, which lead the racing condition happened in the multiple threadings case.

To resolve the issue, we need to have a distinct rewrite timer for each query, then compute the sum of the non-overlapping time across all slices and add the rewrite time from the non-concurrent path.

A similar race condition happened on building the profile tree stack during the create weight step. To address this issue, I added threadToProfileTree map into the profiler. This way, each thread has its own profile tree, ensuring that the stack won't access tree nodes from other threads. In the end, we simply combine all the trees into the profileResults list.

Related Issues

Resolves #9787

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • GitHub issue/PR created in OpenSearch documentation repo for the required public documentation changes (#)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Oct 4, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Oct 4, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Oct 4, 2023

Compatibility status:

Checks if related components are compatible with change 47a5802

Incompatible components

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/reporting.git]

@github-actions
Copy link
Copy Markdown
Contributor

Gradle Check (Jenkins) Run Completed with:

@ticheng-aws
Copy link
Copy Markdown
Contributor Author

ticheng-aws commented Oct 13, 2023

Gradle Check (Jenkins) Run Completed with:

This is a known issue #10006

@github-actions
Copy link
Copy Markdown
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Copy Markdown
Contributor

Gradle Check (Jenkins) Run Completed with:

@reta
Copy link
Copy Markdown
Contributor

reta commented Oct 19, 2023

@ticheng-aws I think the APIs are much cleaner now with separate profilers, I have only minor comment to cleanup the AbstractQueryProfileTree::stopAndAddRewriteTime a bit (if it makes sense) but LGTM otherwise, thank you!

@github-actions
Copy link
Copy Markdown
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Copy Markdown
Contributor

Gradle Check (Jenkins) Run Completed with:

@reta
Copy link
Copy Markdown
Contributor

reta commented Oct 19, 2023

@ticheng-aws could you please rebase? thank you

@reta
Copy link
Copy Markdown
Contributor

reta commented Oct 19, 2023

@sohami mind please rechecking this pull request? thank you.

…current segment search (opensearch-project#10352)

Signed-off-by: Ticheng Lin <ticheng@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Copy Markdown
Contributor

Gradle Check (Jenkins) Run Completed with:

@ticheng-aws
Copy link
Copy Markdown
Contributor Author

Gradle Check (Jenkins) Run Completed with:

Tests with failures:
 - org.opensearch.cluster.allocation.ClusterRerouteIT.testDelayWithALargeAmountOfShards
 - org.opensearch.remotestore.RemoteStoreClusterStateRestoreIT.testFullClusterRestoreGlobalMetadata
 - org.opensearch.search.SearchWeightedRoutingIT.testShardRoutingWithNetworkDisruption_FailOpenEnabled

Known issues #10558, #10749, #10673

@github-actions
Copy link
Copy Markdown
Contributor

Gradle Check (Jenkins) Run Completed with:

Signed-off-by: Ticheng Lin <ticheng@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT.test {yaml=pit/10_basic/Delete all}
      1 org.opensearch.action.admin.cluster.node.tasks.ResourceAwareTasksTests.testTaskResourceTrackingDuringTaskCancellation

@reta
Copy link
Copy Markdown
Contributor

reta commented Oct 20, 2023

#5329
#9332

@opensearch-trigger-bot
Copy link
Copy Markdown
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/OpenSearch/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/OpenSearch/backport-2.x
# Create a new branch
git switch --create backport/backport-10352-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 200ad5d28a577877be530ecab507601898025c5c
# Push it to GitHub
git push --set-upstream origin backport/backport-10352-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/OpenSearch/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-10352-to-2.x.

@reta
Copy link
Copy Markdown
Contributor

reta commented Oct 20, 2023

@ticheng-aws sadly backport to 2.x failed, could you please send a manual pull request? thank you

@sohami
Copy link
Copy Markdown
Contributor

sohami commented Oct 20, 2023

@ticheng-aws sadly backport to 2.x failed, could you please send a manual pull request? thank you

@ticheng-aws We will need to backport all the previous PRs for concurrent profiler as well. I think during 2.11 we didn't do it. But lets do it now. Then it will go thru for this one as well.

@reta
Copy link
Copy Markdown
Contributor

reta commented Oct 20, 2023

@ticheng-aws We will need to backport all the previous PRs for concurrent profiler as well. I think during 2.11 we didn't do it. But lets do it now. Then it will go thru for this one as well.

Oh wait , is it because we run into NPEs?

@ticheng-aws
Copy link
Copy Markdown
Contributor Author

@ticheng-aws We will need to backport all the previous PRs for concurrent profiler as well. I think during 2.11 we didn't do it. But lets do it now. Then it will go thru for this one as well.

Oh wait , is it because we run into NPEs?

You are right. We need to backport #9248, #10111, #10547, and the current one #10352.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport 2.x Backport to 2.x branch backport-failed bug Something isn't working flaky-test Random test failure that succeeds on second run v3.0.0 Issues and PRs related to version 3.0.0

Projects

None yet

4 participants