[ML] adding support for composite aggs in anomaly detection#69970
[ML] adding support for composite aggs in anomaly detection#69970benwtrent merged 27 commits intoelastic:masterfrom
Conversation
|
Pinging @elastic/ml-core (Team:ML) |
| TIP: If you use a terms aggregation and the cardinality of a term is high, the | ||
| aggregation might not be effective and you might want to just use the default | ||
| search and scroll behavior. | ||
| TIP: If you use a terms aggregation and the cardinality of a term is high, you |
There was a problem hiding this comment.
@szabosteve Mind reviewing all these doc updates? Its pretty encompassing.
szabosteve
left a comment
There was a problem hiding this comment.
Thanks for amending the docs! I left a couple of minor comments.
| TIP: If you use a terms aggregation and the cardinality of a term is high, the | ||
| aggregation might not be effective and you might want to just use the default | ||
| search and scroll behavior. | ||
| TIP: If you use a terms aggregation and the cardinality of a term is high, you |
There was a problem hiding this comment.
| TIP: If you use a terms aggregation and the cardinality of a term is high, you | |
| TIP: If you use a terms aggregation and the cardinality of a term is high, |
| aggregation might not be effective and you might want to just use the default | ||
| search and scroll behavior. | ||
| TIP: If you use a terms aggregation and the cardinality of a term is high, you | ||
| should use composite aggregations. |
There was a problem hiding this comment.
| should use composite aggregations. | |
| use composite aggregations instead. |
| of each bucket is the time of the last record in the bucket. | ||
|
|
||
| For `composite` aggregation support, there must be exactly one `date_histogram` value | ||
| source. Additionally, that value source must NOT be sorted in descending order. Additional |
There was a problem hiding this comment.
| source. Additionally, that value source must NOT be sorted in descending order. Additional | |
| source. That value source must not be sorted in descending order. Additional |
|
|
||
| Your {dfeed} can contain multiple aggregations, but only the ones with names | ||
| TIP: If you are utilizing a `term` aggregation to gather influencer or partition | ||
| detector field information, consider using a `composite` aggregation. It will perform |
There was a problem hiding this comment.
| detector field information, consider using a `composite` aggregation. It will perform | |
| detector field information, consider using a `composite` aggregation. It performs |
| Your {dfeed} can contain multiple aggregations, but only the ones with names | ||
| TIP: If you are utilizing a `term` aggregation to gather influencer or partition | ||
| detector field information, consider using a `composite` aggregation. It will perform | ||
| better than a `date_histogram` with a nested `term` aggregation and will also include |
There was a problem hiding this comment.
| better than a `date_histogram` with a nested `term` aggregation and will also include | |
| better than a `date_histogram` with a nested `term` aggregation and also includes |
| aggregation `term` source with the name `airline` works. Note its name | ||
| is the same as the field. | ||
| <4> The required `max` aggregation whose name is the time field in the | ||
| job analysis config |
There was a problem hiding this comment.
| job analysis config | |
| job analysis config. |
| When you define an aggregation in a {dfeed}, it must have the following form: | ||
| When you define an aggregation in a {dfeed}, it must have one of the following forms: | ||
|
|
||
| When using a `date_histogram` aggregation to bucket by time |
There was a problem hiding this comment.
| When using a `date_histogram` aggregation to bucket by time | |
| When using a `date_histogram` aggregation to bucket by time: |
| ---------------------------------- | ||
| // NOTCONSOLE | ||
|
|
||
| When using a `composite` aggregation |
There was a problem hiding this comment.
| When using a `composite` aggregation | |
| When using a `composite` aggregation: |
| aggregation. For more information, see | ||
| {ref}/search-aggregations-bucket-datehistogram-aggregation.html[Date histogram aggregation]. | ||
| sub-aggregation that is a `date_histogram`, the top level aggregation is the | ||
| required `date_histogram`, or the top leve aggregation is the required `composite`. |
There was a problem hiding this comment.
| required `date_histogram`, or the top leve aggregation is the required `composite`. | |
| required `date_histogram`, or the top level aggregation is the required `composite`. |
| You can optionally specify a terms aggregation, which creates buckets for | ||
| different values of a field. | ||
|
|
||
| TIP: Instead of nesting a `term` aggregation, try using `composite` aggs. |
There was a problem hiding this comment.
| TIP: Instead of nesting a `term` aggregation, try using `composite` aggs. | |
| TIP: Instead of nesting a `term` aggregation, use `composite` aggs. |
|
The datafeed config does not seem to match the job config in the description. |
|
@dimitris-athanasiou updated :). I copied the wrong one from my kibana console! |
dimitris-athanasiou
left a comment
There was a problem hiding this comment.
Very very cool! Just a few minor things.
.../plugin/core/src/test/java/org/elasticsearch/xpack/core/ml/datafeed/DatafeedConfigTests.java
Outdated
Show resolved
Hide resolved
.../plugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportStartDatafeedAction.java
Show resolved
Hide resolved
...gin/ml/src/main/java/org/elasticsearch/xpack/ml/datafeed/extractor/DataExtractorFactory.java
Outdated
Show resolved
Hide resolved
...gin/ml/src/main/java/org/elasticsearch/xpack/ml/datafeed/extractor/DataExtractorFactory.java
Outdated
Show resolved
Hide resolved
...va/org/elasticsearch/xpack/ml/datafeed/extractor/aggregation/AggregationToJsonProcessor.java
Outdated
Show resolved
Hide resolved
...elasticsearch/xpack/ml/datafeed/extractor/aggregation/CompositeAggregationDataExtractor.java
Outdated
Show resolved
Hide resolved
...elasticsearch/xpack/ml/datafeed/extractor/aggregation/CompositeAggregationDataExtractor.java
Outdated
Show resolved
Hide resolved
...elasticsearch/xpack/ml/datafeed/extractor/aggregation/CompositeAggregationDataExtractor.java
Show resolved
Hide resolved
|
OK, doing some additional testing, we are running into an issue with requiring
It is possible that bucket This can result in many THOUSANDS of out of order records and cause processing to slow to a crawl as the model has to constantly reorder buckets as they come in. This needs to be thought about further before merging this PR. How do we handle bucket ordering in |
x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ml/datafeed/DatafeedConfig.java
Show resolved
Hide resolved
| Iterator<Map.Entry<Long, List<Map<String, Object>>>> iterator = docsByBucketTimestamp.entrySet().iterator(); | ||
| while (iterator.hasNext()) { | ||
| Map.Entry<Long, List<Map<String, Object>>> entry = iterator.next(); | ||
| if (shouldCancel.test(entry.getKey())) { |
There was a problem hiding this comment.
OK, I thought about this and the discrepancy with the other version of writeDocs. I believe we can just keep the new version and use it where we used the old version as well.
Here is a bit of the story here.
The first implementation of allowing datafeeds with aggs to cancel added the idea of looking at how many key-value pairs we wrote. At that point, the implementation was not checking if we wrote the entire bucket. Thus, it was possible to stop the datafeed half through the bucket. This was fixed later, when the current implementation was added.
The current implementation collects data in buckets. We then write whole buckets and only check to see if we should cancel after a whole bucket was written. We still look for whether we reached the key-value pair batch size of 1000. But I think this is completely unnecessary now.
The point of checking key-value pairs was to create a good moment to check whether we should cancel without having to check after each record. But checking after each histogram bucket is good enough. I hope this makes sense.
There was a problem hiding this comment.
Having said that, the other benefit of splitting is to contain the size of the output stream. But I guess this might be unnecessary as it is bound by the size of the search response, right? If the search response already fits in memory, then the output stream we create shouldn't be a problem.
docs/reference/ml/anomaly-detection/ml-configuring-aggregations.asciidoc
Outdated
Show resolved
Hide resolved
docs/reference/ml/anomaly-detection/ml-configuring-aggregations.asciidoc
Outdated
Show resolved
Hide resolved
docs/reference/ml/anomaly-detection/ml-configuring-aggregations.asciidoc
Outdated
Show resolved
Hide resolved
docs/reference/ml/anomaly-detection/ml-configuring-aggregations.asciidoc
Outdated
Show resolved
Hide resolved
docs/reference/ml/anomaly-detection/ml-configuring-aggregations.asciidoc
Outdated
Show resolved
Hide resolved
docs/reference/ml/anomaly-detection/ml-configuring-aggregations.asciidoc
Outdated
Show resolved
Hide resolved
docs/reference/ml/anomaly-detection/ml-configuring-aggregations.asciidoc
Outdated
Show resolved
Hide resolved
docs/reference/ml/anomaly-detection/ml-configuring-aggregations.asciidoc
Outdated
Show resolved
Hide resolved
docs/reference/ml/anomaly-detection/ml-configuring-aggregations.asciidoc
Outdated
Show resolved
Hide resolved
docs/reference/ml/anomaly-detection/ml-configuring-aggregations.asciidoc
Outdated
Show resolved
Hide resolved
docs/reference/ml/anomaly-detection/ml-configuring-aggregations.asciidoc
Outdated
Show resolved
Hide resolved
docs/reference/ml/anomaly-detection/ml-configuring-aggregations.asciidoc
Outdated
Show resolved
Hide resolved
docs/reference/ml/anomaly-detection/ml-configuring-aggregations.asciidoc
Outdated
Show resolved
Hide resolved
docs/reference/ml/anomaly-detection/ml-configuring-aggregations.asciidoc
Show resolved
Hide resolved
Co-authored-by: Lisa Cawley <lcawley@elastic.co>
…ed-composite-aggs
docs/reference/ml/anomaly-detection/ml-configuring-aggregations.asciidoc
Outdated
Show resolved
Hide resolved
…s.asciidoc Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>
docs/reference/ml/anomaly-detection/ml-configuring-aggregations.asciidoc
Outdated
Show resolved
Hide resolved
…s.asciidoc Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>
|
run elasticsearch-ci/2 |
dimitris-athanasiou
left a comment
There was a problem hiding this comment.
Looks good! Just a couple of minor comments and we're set to go.
| return (DateHistogramValuesSourceBuilder)valuesSourceBuilder; | ||
| } | ||
| } | ||
| return null; |
There was a problem hiding this comment.
It seems that every caller of this assumes it doesn't return null. I think this is because at this point we have checked there is a histogram aggregation specified. I wonder if we should just throw an IllegalStateException here instead of returning null and remove the assertions in callers.
| Collections.singletonList(indexName)) | ||
| .setParsedAggregations(aggs) | ||
| .setFrequency(TimeValue.timeValueHours(1)) | ||
| .setChunkingConfig(ChunkingConfig.newManual(TimeValue.timeValueHours(1))) |
There was a problem hiding this comment.
Why are we setting manual chunking of 1 hour and later update it to auto? It might be something worth capturing in a comment in the test.
| .execute(ActionListener.wrap( | ||
| _unused -> listener.onResponse(finished), | ||
| ex -> { | ||
| logger.info( |
There was a problem hiding this comment.
Should this be an error?
There was a problem hiding this comment.
How about a warn? error seems like some sort of system failure and failing to refresh the job is not a horrific thing in this scenario.
There was a problem hiding this comment.
warn is even more appropriate indeed
…69970) This commit allows for composite aggregations in datafeeds. Composite aggs provide a much better solution for having influencers, partitions, etc. on high volume data. Instead of worrying about long scrolls in the datafeed, the calculation is distributed across cluster via the aggregations. The restrictions for this support are as follows: - The composite aggregation must have EXACTLY one `date_histogram` source - The sub-aggs of the composite aggregation must have a `max` aggregation on the SAME timefield as the aforementioned `date_histogram` source - The composite agg must be the ONLY top level agg and it cannot have a `composite` or `date_histogram` sub-agg - If using a `date_histogram` to bucket time, it cannot have a `composite` sub-agg. - The top-level `composite` agg cannot have a sibling pipeline agg. Pipeline aggregations are supported as a sub-agg (thus a pipeline agg INSIDE the bucket). Some key user interaction differences: - Speed + resources used by the cluster should be controlled by the `size` parameter in the `composite` aggregation. Previously, we said if you are using aggs, use a specific `chunking_config`. But, with composite, that is not necessary. - Users really shouldn't use nested `terms` aggs anylonger. While this is still a "valid" configuration and MAY be desirable for some users (only wanting the top 10 of certain terms), typically when users want influencers, partition fields, etc. they want the ENTIRE population. Previously, this really wasn't possible with aggs, with `composite` it is. - I cannot really think of a typical usecase that SHOULD ever use a multi-bucket aggregation that is NOT supported by composite.
…9970) (#71052) * [ML] adding support for composite aggs in anomaly detection (#69970) This commit allows for composite aggregations in datafeeds. Composite aggs provide a much better solution for having influencers, partitions, etc. on high volume data. Instead of worrying about long scrolls in the datafeed, the calculation is distributed across cluster via the aggregations. The restrictions for this support are as follows: - The composite aggregation must have EXACTLY one `date_histogram` source - The sub-aggs of the composite aggregation must have a `max` aggregation on the SAME timefield as the aforementioned `date_histogram` source - The composite agg must be the ONLY top level agg and it cannot have a `composite` or `date_histogram` sub-agg - If using a `date_histogram` to bucket time, it cannot have a `composite` sub-agg. - The top-level `composite` agg cannot have a sibling pipeline agg. Pipeline aggregations are supported as a sub-agg (thus a pipeline agg INSIDE the bucket). Some key user interaction differences: - Speed + resources used by the cluster should be controlled by the `size` parameter in the `composite` aggregation. Previously, we said if you are using aggs, use a specific `chunking_config`. But, with composite, that is not necessary. - Users really shouldn't use nested `terms` aggs anylonger. While this is still a "valid" configuration and MAY be desirable for some users (only wanting the top 10 of certain terms), typically when users want influencers, partition fields, etc. they want the ENTIRE population. Previously, this really wasn't possible with aggs, with `composite` it is. - I cannot really think of a typical usecase that SHOULD ever use a multi-bucket aggregation that is NOT supported by composite.
This commit allows for composite aggregations in datafeeds.
Composite aggs provide a much better solution for having influencers, partitions, etc. on high volume data. Instead of worrying about long scrolls in the datafeed, the calculation is distributed across cluster via the aggregations.
The restrictions for this support are as follows:
date_histogramsourcemaxaggregation on the SAME timefield as the aforementioneddate_histogramsourcecompositeordate_histogramsub-aggdate_histogramto bucket time, it cannot have acompositesub-agg.compositeagg cannot have a sibling pipeline agg. Pipeline aggregations are supported as a sub-agg (thus a pipeline agg INSIDE the bucket).Some key user interaction differences:
sizeparameter in thecompositeaggregation. Previously, we said if you are using aggs, use a specificchunking_config. But, with composite, that is not necessary.termsaggs anylonger. While this is still a "valid" configuration and MAY be desirable for some users (only wanting the top 10 of certain terms), typically when users want influencers, partition fields, etc. they want the ENTIRE population. Previously, this really wasn't possible with aggs, withcompositeit is.Example
here is a job that was traditionally restricted to using a scroll datafeed but now can be completed with a composite agg:
Job:
datafeed:
All the links, results, etc. are the same as if this was done via a regular scroll.
Even the UI works as the internal aggregations are simple enough (just a count). This is not always the case, as with aggregations now, they are so flexible that the UI can have issues recreating the underlying visuals. But, in this particular case, it works fine because there is no internal
termaggregation (just in the composite definition).The UI should probably validate the composite agg definition to make sure it is only
date_histogramandtermsources. But, support for filtering by the other composite sources shouldn't be too difficult in time. Kibana issue