-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Composite aggs seems to sort too slowly with filter queries #70035
Description
Piggy-backing off of previous work: #28745
During the work in #69970 some troubling performance data has reared its ugly head.
Given the following query:
{"bool":{"filter":[{"term":{"event.dataset":"nginx.access"}}]}}
The following composite agg moves at an almost glacial pace:
"aggs": {
"buckets": {
"composite": {
"size": 1000,
"sources": [
{
"date": {
"date_histogram": {
"field": "@timestamp",
"fixed_interval": "15m"
}
}
},
{
"source.address": {
"terms": {
"field": "source.address"
}
}
}
]
},
"aggregations": {
"@timestamp": {
"max": {
"field": "@timestamp"
}
}
}
}
}
Here are some doc stats:
total_hits: 14479391
cardinality(source.address): 851502
max_timestamp: "2017-03-11T23:59:56.537Z"
min_timestamp: "2017-02-01T00:00:00.189Z"
In datafeeds we "chunk" through when scrolling through data. Consequently, we hit every document and make multiple queries. This is because sorting by timestamp can be costly when hitting many docs.
So, our scrolling datafeed had the following performance:
search_count | 16,649
bucket_count | 935
average_search_time_per_bucket_ms | 81.901
~4.5 ms per search (bucket_count * average_search_time_per_bucket_ms)/search_count
Job finished in ~6 minutes
Doing composite agg without chunking:
🐌 🐌 🐌
search_count | 3,795
bucket_count | 935
average_search_time_per_bucket_ms | 2,705.224
~666.5 ms per search
🐌 🐌 🐌
job finished in 40+ mintes
It seems to me that the composite agg is doing WAY too much work. I think it may be sorting WAY too many documents given the sources.
As an experiment, I added some time based query chunking in 25264688ms intervals (calculated based on term cardinality, count, and total time range)
🔥 🔥 🔥
search_count | 4,124
bucket_count | 935
average_search_time_per_bucket_ms | 112.775
~25 ms per search
🔥 🔥 🔥
Job finished in ~4 minutes
Datafeeds (and transforms) will ALWAYS be a filter based query (ignoring scores). These queries are user provided, so they could definitely be anything. But it seems to me that there is still room for improvement in the composite agg.