optimize doc-level monitor workflow for index patterns by sbcd90 · Pull Request #1097 · opensearch-project/alerting

sbcd90 · 2023-08-17T23:19:19Z

Description of changes:
This pr addresses following performance problems in doc-level monitor execution workflow.

In the current workflow, if the doc-level monitor is monitoring an index pattern & a new index is introduced which matches the pattern, the doc-level monitor duplicates all the field mappings & queries for that index. This is reproducible using integ test https://github.com/opensearch-project/alerting/pull/1097/files#diff-9b08d53e8cff3d739beb6ba304e4eaf466f08412762c3cd55886a52b722a77f1R503

Now, say, we have 1000 queries & each concrete index behind the index-pattern has 1000 field mappings. Also, lets assume 1 concrete index is generated everyday. We also know the default number of field mappings an index can have is 1000 & today if the no. of field mappings go over 1000 in the query index, we rollover the query index.

This would mean, we create a new rollover query index everyday & keep on ingesting 1000 queries in it everyday. In 30 days, we create 30 indices(which means by default 1 primary & 1 replica shards per index) which contains 30000 duplicate queries.
This causes the data nodes to get full resulting in cluster crash.
opensearch-project/security-analytics#509
https://forum.opensearch.org/t/security-analytics-error/14639/11

We do not notice this problem for small no. of queries but duplication of queries piles up over a period of time.
This pr addresses this issue by continously updating 1 set of queries for all the concrete indices belonging to an index-pattern. this provides huge storage optimization.

In the current workflow, for every concrete index, we make an update mapping api call in no particular order. So, say, index test1 has fields f1 & f2 & test2 has field f4 & both of them match pattern test*, if we make first update mapping call for test1, then query index gets f1 & f2 but in the next update mapping call for test2 we completely overwrite it with f4. This is reproducible using integ test https://github.com/opensearch-project/alerting/pull/1097/files#diff-9b08d53e8cff3d739beb6ba304e4eaf466f08412762c3cd55886a52b722a77f1R551

This pr addresses this issue by first collecting field mappings of all concrete indices belonging to an index-pattern together, & then making a single call to update mappings api.

In the current workflow, for every concrete index, we make an update mapping api call in no particular order. So, if there are 100 concrete indices behind an index-pattern we make 100 update mapping api calls.

This pr optimizes the time complexity by making a single call to update mappings api for each index pattern.

Here is how the optimization looks like

CheckList:

Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

codecov · 2023-08-17T23:46:40Z

Codecov Report

Merging #1097 (6d5b661) into main (778e7ce) will increase coverage by 0.05%.
Report is 2 commits behind head on main.
The diff coverage is 92.15%.

❗ Current head 6d5b661 differs from pull request most recent head 4a85d13. Consider uploading reports for the commit 4a85d13 to get more accurate results

@@             Coverage Diff              @@
##               main    #1097      +/-   ##
============================================
+ Coverage     67.72%   67.77%   +0.05%     
  Complexity      105      105              
============================================
  Files           160      160              
  Lines         10343    10363      +20     
  Branches       1522     1521       -1     
============================================
+ Hits           7005     7024      +19     
- Misses         2672     2673       +1     
  Partials        666      666

Files Changed	Coverage Δ
.../kotlin/org/opensearch/alerting/util/IndexUtils.kt	`79.31% <ø> (-0.69%)`	⬇️
...opensearch/alerting/util/DocLevelMonitorQueries.kt	`77.27% <88.00%> (+1.79%)`	⬆️
.../opensearch/alerting/DocumentLevelMonitorRunner.kt	`84.53% <96.15%> (+0.42%)`	⬆️

... and 4 files with indirect coverage changes

eirsep · 2023-08-23T21:35:04Z

+                    monitorCtx.clusterService!!,
+                    monitorCtx.indexNameExpressionResolver!!
+                )
+                val updatedIndexName = indexName.replace("*", "_")


why do we do this indexName.replace("*", "_")?

this is because query string query fields with * have a different meaning & is also not allowed in some cases.

eirsep · 2023-08-23T21:50:46Z

                monitorCtx.indexNameExpressionResolver!!
            )
+            if (concreteIndices.isEmpty()) {
+                throw IndexNotFoundException(docLevelMonitorInput.indices[0])


can we add error log with monitor Id and indices list which were not found

in exception why are we logging only 0th index element of indices variable? can we log all elements in array/list

changed it now. it was copied from eariler logic.

eirsep · 2023-08-23T21:57:54Z

-                    // TODO: If dryrun, we should make it so we limit the search as this could still potentially give us lots of data
-                    if (isTempMonitor) {
-                        indexLastRunContext[shard] = max(-1, (indexUpdatedRunContext[shard] as String).toInt() - 10)
+            docLevelMonitorInput.indices.forEach { indexName ->


why did we change this list being looped from computed concrete indices list to the monitor indices list?

this is because we want to expand 1 index pattern & process all its concrete indices separately. Previous logic was we expand all index patterns & process all the concrete indices one by one.

eirsep · 2023-08-23T23:00:17Z

To create query index mappings for an index pattern would be a very big optimization but it works only under the assumption that an index pattern test* resolves to indices that do not have fields with same names but different data types.

this fails in the following scenario:
indices test1 and test2 are resolved from test* pattern. test1 has a field named f1 of type long and test2 also has a field named f1 BUT with type keyword

when we store the field mapping in query index for f1_test* this would break the query execution due to incorrect mapping

eirsep · 2023-08-23T23:03:38Z

we can add this optimization for the scenarios where we can identify all indices covered in index pattern are guaranteed to have the same mappings i.e. data streams, rolling indices with index templates.

should we store a field in index mapping called index_template if we are creating index from index template? food for throught

eirsep · 2023-08-25T22:45:35Z

@sbcd90

To create query index mappings for an index pattern would be a very big optimization but it works only under the assumption that an index pattern test* resolves to indices that do not have fields with same names but different data types.

this fails in the following scenario: indices test1 and test2 are resolved from test* pattern. test1 has a field named f1 of type long and test2 also has a field named f1 BUT with type keyword

when we store the field mapping in query index for f1_test* this would break the query execution due to incorrect mapping

Since we are batching all mappings can we check if the above mentioned field mapping mismatch is happening for any fields and then ingest one for the generic test* and one mapping for the specific index. Similarly while fetching mappings look for both index mappings with test* and test1. If latter is found prefer it. if not found, use the former.

sbcd90 · 2023-08-26T00:15:04Z

@sbcd90

To create query index mappings for an index pattern would be a very big optimization but it works only under the assumption that an index pattern test* resolves to indices that do not have fields with same names but different data types.
this fails in the following scenario: indices test1 and test2 are resolved from test* pattern. test1 has a field named f1 of type long and test2 also has a field named f1 BUT with type keyword
when we store the field mapping in query index for f1_test* this would break the query execution due to incorrect mapping

Since we are batching all mappings can we check if the above mentioned field mapping mismatch is happening for any fields and then ingest one for the generic test* and one mapping for the specific index. Similarly while fetching mappings look for both index mappings with test* and test1. If latter is found prefer it. if not found, use the former.

hi @eirsep , that is the exact fix i'm working on.

sbcd90 · 2023-08-29T22:46:09Z

hi @eirsep , just addressed the comment with the test case https://github.com/opensearch-project/alerting/pull/1097/files#diff-9b08d53e8cff3d739beb6ba304e4eaf466f08412762c3cd55886a52b722a77f1R503.
Can you please look into the pr again?

eirsep · 2023-08-29T22:53:29Z

the java doc description for this method seems like it should describe adjustMaxFieldLimitForQueryIndex

can we update the code comments for this method and below method

updated it.

eirsep · 2023-08-29T23:41:10Z

NIT; since you are not explicitly creating index with different mappings
can you plz fetch second index's mapping and assert that the field type is actually different for test_field so as to validate the test scenario. that would make the test more readable. right now it's a bit hard to understand.

specifying explicit mappings now.

eirsep · 2023-08-29T23:43:13Z

can we plz add asserts on the expected mappings and content of the query index. that would be the essence of these tests and help us visualize the changes made in this PR

asserts added on actual queries stored in query index.

eirsep · 2023-08-29T23:45:58Z

this is a very critical piece fo code. can we update java docs description of this method incorporating the new change?

updated the documentation for the method.

eirsep · 2023-08-30T00:00:28Z

does the below flatten logic handle both the nested and non-nested type objects correctly?

what would be the difference after flattening into string in the following 2 cases

"nested_field": { "type": "nested", "properties": { "test1": { "type": "keyword" } }

and

"nested_field": { "properties": { "test1": { "type": "keyword" } }

nested fields dont work with query string queries. https://stackoverflow.com/questions/69857071/how-can-i-use-query-string-to-match-both-nested-and-non-nested-fields-at-the-sam
but if we pair them with an index containing an object field, it will work. https://github.com/opensearch-project/alerting/pull/1097/files#diff-9b08d53e8cff3d739beb6ba304e4eaf466f08412762c3cd55886a52b722a77f1R636

eirsep · 2023-08-30T00:02:44Z

why do we need field name twice?
can you add some code comments around what is the Triple type object being returned to make it more readable?

this is used intentionally because leafNodeProcessor is used at another place where the 2nd param is actually used to pass modified field name.

eirsep

thanks for making these changes @sbcd90

i have some minor comments not related to logic but more around readiblity and testsing.

Kindly also respond to earlier comments.

Signed-off-by: Subhobrata Dey <sbcd90@gmail.com>

opensearch-trigger-bot · 2023-09-06T01:26:04Z

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/alerting/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/alerting/backport-2.x
# Create a new branch
git switch --create backport-1097-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 7f0c7c7e77a9213ad5e976c2f1573321bc26b919
# Push it to GitHub
git push --set-upstream origin backport-1097-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/alerting/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport-1097-to-2.x.

…oject#1097) Signed-off-by: Subhobrata Dey <sbcd90@gmail.com> Signed-off-by: Chase Engelbrecht <engechas@amazon.com>

* log error messages and clean up monitor when indexing doc level queries or metadata creation fails (#900) * log errors and clean up monitor when indexing doc level queries or metadata creation fails * refactor delete monitor action to re-use delete methods Signed-off-by: Surya Sashank Nistala <snistala@amazon.com> Signed-off-by: Chase Engelbrecht <engechas@amazon.com> * optimize doc-level monitor workflow for index patterns (#1097) Signed-off-by: Subhobrata Dey <sbcd90@gmail.com> Signed-off-by: Chase Engelbrecht <engechas@amazon.com> * optimize doc-level monitor execution workflow for datastreams (#1302) * optimize doc-level monitor execution for datastreams Signed-off-by: Subhobrata Dey <sbcd90@gmail.com> * add more tests to address comments Signed-off-by: Subhobrata Dey <sbcd90@gmail.com> * add integTest for multiple datastreams inside a single index pattern * add integTest for multiple datastreams inside a single index pattern Signed-off-by: Subhobrata Dey <sbcd90@gmail.com> --------- Signed-off-by: Subhobrata Dey <sbcd90@gmail.com> Signed-off-by: Chase Engelbrecht <engechas@amazon.com> * Bulk index findings and sequentially invoke auto-correlations (#1355) * Bulk index findings and sequentially invoke auto-correlations Signed-off-by: Megha Goyal <goyamegh@amazon.com> * Bulk index findings in batches of 10000 and make it configurable Signed-off-by: Megha Goyal <goyamegh@amazon.com> * Addressing review comments Signed-off-by: Megha Goyal <goyamegh@amazon.com> * Add integ tests to test bulk index findings Signed-off-by: Megha Goyal <goyamegh@amazon.com> * Fix ktlint formatting Signed-off-by: Megha Goyal <goyamegh@amazon.com> --------- Signed-off-by: Megha Goyal <goyamegh@amazon.com> Signed-off-by: Chase Engelbrecht <engechas@amazon.com> * Add jvm aware setting and max num docs settings for batching docs for percolate queries (#1435) * add jvm aware and max docs settings for batching docs for percolate queries Signed-off-by: Surya Sashank Nistala <snistala@amazon.com> * fix stats logging Signed-off-by: Surya Sashank Nistala <snistala@amazon.com> * add queryfieldnames field in findings mapping Signed-off-by: Surya Sashank Nistala <snistala@amazon.com> --------- Signed-off-by: Surya Sashank Nistala <snistala@amazon.com> Signed-off-by: Chase Engelbrecht <engechas@amazon.com> * optimize to fetch only fields relevant to doc level queries in doc level monitor instead of entire _source for each doc (#1441) * optimize to fetch only fields relevant to doc level queries in doc level monitor Signed-off-by: Surya Sashank Nistala <snistala@amazon.com> * fix test for settings check Signed-off-by: Surya Sashank Nistala <snistala@amazon.com> * fix ktlint Signed-off-by: Surya Sashank Nistala <snistala@amazon.com> --------- Signed-off-by: Surya Sashank Nistala <snistala@amazon.com> Signed-off-by: Chase Engelbrecht <engechas@amazon.com> * optimize sequence number calculation and reduce search requests in doc level monitor execution (#1445) * optimize sequence number calculation and reduce search requests by n where n is number of shards being queried in the executino Signed-off-by: Surya Sashank Nistala <snistala@amazon.com> * fix tests Signed-off-by: Surya Sashank Nistala <snistala@amazon.com> * optimize check indices and execute to query only write index of aliases and datastreams during monitor creation Signed-off-by: Surya Sashank Nistala <snistala@amazon.com> * fix test Signed-off-by: Surya Sashank Nistala <snistala@amazon.com> * add javadoc Signed-off-by: Surya Sashank Nistala <snistala@amazon.com> * add tests to verify seq_no calculation Signed-off-by: Surya Sashank Nistala <snistala@amazon.com> --------- Signed-off-by: Surya Sashank Nistala <snistala@amazon.com> Signed-off-by: Chase Engelbrecht <engechas@amazon.com> * Fix tests Signed-off-by: Chase Engelbrecht <engechas@amazon.com> * Fix BWC tests Signed-off-by: Chase Engelbrecht <engechas@amazon.com> * clean up doc level queries on dry run (#1430) Signed-off-by: Joanne Wang <jowg@amazon.com> Signed-off-by: Chase Engelbrecht <engechas@amazon.com> * Fix import Signed-off-by: Chase Engelbrecht <engechas@amazon.com> * Fix tests Signed-off-by: Chase Engelbrecht <engechas@amazon.com> * Fix BWC version Signed-off-by: Chase Engelbrecht <engechas@amazon.com> * Fix another test Signed-off-by: Chase Engelbrecht <engechas@amazon.com> * Revert order of operations change Signed-off-by: Chase Engelbrecht <engechas@amazon.com> --------- Signed-off-by: Subhobrata Dey <sbcd90@gmail.com> Signed-off-by: Chase Engelbrecht <engechas@amazon.com> Signed-off-by: Megha Goyal <goyamegh@amazon.com> Signed-off-by: Surya Sashank Nistala <snistala@amazon.com> Signed-off-by: Joanne Wang <jowg@amazon.com> Co-authored-by: Surya Sashank Nistala <snistala@amazon.com> Co-authored-by: Subhobrata Dey <sbcd90@gmail.com> Co-authored-by: Megha Goyal <56077967+goyamegh@users.noreply.github.com> Co-authored-by: Joanne Wang <jowg@amazon.com>

sbcd90 requested review from AWSHurneyt, bowenlan-amzn, eirsep, getsaurabh02, lezzago, praveensameneni, qreshi and rishabhmaurya as code owners August 17, 2023 23:19

sbcd90 force-pushed the test_perf branch from 6bf0853 to 6d5b661 Compare August 17, 2023 23:34

lezzago approved these changes Aug 18, 2023

View reviewed changes

eirsep reviewed Aug 23, 2023

View reviewed changes

lezzago self-requested a review August 25, 2023 20:34

sbcd90 force-pushed the test_perf branch from bd21422 to 287e9ef Compare August 29, 2023 22:22

eirsep reviewed Aug 29, 2023

View reviewed changes

eirsep reviewed Aug 30, 2023

View reviewed changes

eirsep requested changes Aug 30, 2023

View reviewed changes

optimize doc-level monitor workflow for index patterns

4a85d13

Signed-off-by: Subhobrata Dey <sbcd90@gmail.com>

sbcd90 force-pushed the test_perf branch from 287e9ef to 4a85d13 Compare August 31, 2023 21:18

eirsep approved these changes Sep 5, 2023

View reviewed changes

sbcd90 added the backport 2.x label Sep 6, 2023

sbcd90 merged commit 7f0c7c7 into opensearch-project:main Sep 6, 2023

opensearch-trigger-bot Bot added the failed backport label Sep 6, 2023

sbcd90 mentioned this pull request Sep 7, 2023

optimize doc-level monitor workflow for index patterns #1122

Merged

1 task

engechas mentioned this pull request Mar 14, 2024

Backport 1097, 1302, 1435, 1441, 1445, 1430 #1470

Merged

1 task

engechas mentioned this pull request Mar 18, 2024

Backports 2.7 #1482

Merged

1 task

engechas pushed a commit to engechas/alerting that referenced this pull request Mar 18, 2024

optimize doc-level monitor workflow for index patterns (opensearch-pr…

b6c4783

…oject#1097) Signed-off-by: Subhobrata Dey <sbcd90@gmail.com> Signed-off-by: Chase Engelbrecht <engechas@amazon.com>

Conversation

sbcd90 commented Aug 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Aug 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eirsep commented Aug 23, 2023 • edited by sbcd90 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eirsep commented Aug 23, 2023

Uh oh!

eirsep commented Aug 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sbcd90 commented Aug 26, 2023

Uh oh!

sbcd90 commented Aug 29, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eirsep left a comment

Choose a reason for hiding this comment

Uh oh!

opensearch-trigger-bot Bot commented Sep 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sbcd90 commented Aug 17, 2023 •

edited

Loading

codecov Bot commented Aug 17, 2023 •

edited

Loading

eirsep commented Aug 23, 2023 •

edited by sbcd90

Loading

eirsep commented Aug 25, 2023 •

edited

Loading