-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
Is your feature request related to a problem? Please describe
Background
This idea is vaguely inspired by Lucene's ImpactsDISI + MaxScoreCache. On eligible queries, Lucene allows doc id set iterators to start skipping blocks of docs that have non-competitive scores, once the collector for the query already has trackTotalHitsUpTo hits (10k by default). This is acceptable since after trackTotalHitsUpTo hits, we've given up counting the total number of hits, and there's no reason to check if non-competitive docs match the query at all. If docs have equal scores, ties are broken by picking the one with lower doc id, which means later docs can safely be skipped.
This only applies in certain cases (exactly 1 must clause which uses impacts, plus stuff about its cost relative to other clauses) but when it works it can have huge performance improvements since it can let you skip most documents. On simple 2-clause http_logs query this can be as much as a 90x speedup (2494 -> 28 ms).
Example query:
"bool": {
"filter": {
"match": {
"status": "200"
}
},
"must": {
"match": {
"request": "images"
}
}
}
Idea
Lucene does not skip non-competitive docs in an all-filter case. (It's using ScoreMode.COMPLETE_NO_SCORES instead of ScoreMode.TOP_SCORES). So, the above query but with both clauses being filter can take much longer than 28 ms.
But, in a filter clause, all documents are equally competitive, since scores don't matter. Like before, we break ties by choosing the doc with lower doc id, and we don't care about counting exact hits past trackTotalHitsUpTo. So, as I understand it, there's no benefit in continuing to iterate past trackTotalHitsUpTo matching documents. If we stopped early we should get similar speedups that ImpactsDISI gives us.
(This doesn't apply if we're doing aggs, pagination, sorting, or other similar things, but for a simple query I think it's correct).
I think this speedup would apply to a much broader class of boolean queries than ImpactsDISI especially since we're planning on automatically rewriting certain must clauses --> filter clauses as described in #17586.
Describe the solution you'd like
We already have a mechanism to stop iteration early after a certain number of collected docs: the query param terminate_after. This lives in the SearchContext.
Ideally there would be some way for a top-level boolean query to signal to the SearchContext to set this value to trackTotalHitsUpTo if, after rewriting is done, there are only filter clauses. (I think this should also still work if there are must_not clauses). This should be possible at least in theory since QueryBuilder rewriting and conversion to Query is completed before we construct the Collector, which is the thing that needs to accept terminate_after.
I ran a benchmark testing the same all-filter queries with and without terminate_after and found some good speedups. Details in the Additional Context section.
Related component
Search:Performance
Describe alternatives you've considered
This could also be implemented at the Lucene level, but it might be tricker since I think there's no shared SearchContext type of object, and it seems tricky to get queries to talk to the parts of the code building the collectors.
Additional context
Benchmark results on http_logs are below. I didn't implement this suggestion yet, I'm just running some all-filter queries in OSB (branch link) and explicitly setting terminate_after on some of them. I also ran some where terminate_after is 1k instead of the default 10k to see if that'd have a significantly larger impact than 10k.
| Query | Current p50 (ms) | p50 with terminate_after=10k | Speedup | p50 with terminate_after=1k | Speedup |
|---|---|---|---|---|---|
status matches 200 & request matches "images" |
851 | 12.6 | 68x | 10.5 | 81x |
status matches 404 & request matches "images" |
40.7 | 15.5 | 2.6x | 12.4 | 3.3x |
@timestamp between 6/10-6/13 & request matches "images" |
454 | 10.9 | 42x | 10.6 | 43x |
request.raw in list of top 10 terms & @timestamp between 6/10-6/13 |
216 | 9.2 | 23x | 8.1 | 27x |
We can see the speedup is greatest when the queries matched a lot of docs, like status=200.
Using 1k instead of the default 10k helps a bit but clearly the main effect is from enabling terminate_after at all.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status